Batch API vs Realtime LLM Calls: Cost Comparison and When to Switch

Quick answer: Both OpenAI and Anthropic offer 50% discounts on batch API processing with a 24-hour turnaround SLA. If any portion of your workload can tolerate async processing, batch API is an immediate 50% cost reduction on that portion. Most production systems have at least 40-60% batch-eligible workloads that teams are running synchronously out of habit.

What the batch API is

Both OpenAI's Batch API and Anthropic's Message Batches API allow you to submit a file of requests that are processed asynchronously within 24 hours, at half the price of synchronous API calls.

The infrastructure reason for the discount: batch requests fill idle GPU capacity that would otherwise sit unused during off-peak hours. The provider can defer processing, making the economics favorable enough to pass 50% savings to users.

Pricing: batch vs standard

Model

Standard Input

Batch Input

Standard Output

Batch Output

GPT-4o	$2.50/M	$1.25/M	$10.00/M	$5.00/M
GPT-4.1	$2.00/M	$1.00/M	$8.00/M	$4.00/M
GPT-4.1 Mini	$0.40/M	$0.20/M	$1.60/M	$0.80/M
Claude Sonnet 4	$3.00/M	$1.50/M	$15.00/M	$7.50/M
Claude Haiku 4	$0.80/M	$0.40/M	$4.00/M	$2.00/M

Every model, every provider: 50% off.

Which workloads are batch-eligible?

Batch processing works when:

Requests are independent (no dependency on other request responses)
Results can be consumed 1-24 hours after submission
Output is stored for later use, not streamed to a waiting user

Clearly batch-eligible:

Nightly document summarization
Bulk data enrichment (e.g., categorizing 100K support tickets)
Content moderation pipelines
Translation of product catalogs
Generating embeddings at scale
Scheduled report generation
Evaluation runs and evals
Training data generation
SEO content generation pipelines

Clearly NOT batch-eligible:

Chat interfaces (user waiting for response)
Realtime code completion (IDE plugins)
Live customer support agents
Any feature where a user is actively waiting

Gray area (needs evaluation):

Email drafting suggestions (usually has minutes of tolerance)
Background insight generation shown on next page load
Pre-generating suggestions for a user's next session

Cost savings example

A content company running 10M tokens/day of document summarization:

Standard API (Claude Sonnet 4):

Input: 8M × $3.00/M = $24/day
Output: 2M × $15.00/M = $30/day
Total: $54/day = $1,620/month

Batch API (Claude Sonnet 4):

Input: 8M × $1.50/M = $12/day
Output: 2M × $7.50/M = $15/day
Total: $27/day = $810/month

Savings: $810/month = $9,720/year with zero quality change.

Implementing the OpenAI Batch API

import openai, json

client = openai.OpenAI()

# Create JSONL batch file
requests = [
    {"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
     "body": {"model": "gpt-4-1", "messages": [{"role": "user", "content": doc}],
              "max_tokens": 500}}
    for i, doc in enumerate(documents)
]

with open("batch.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Upload file and create batch
batch_file = client.files.create(file=open("batch.jsonl", "rb"), purpose="batch")
batch = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions",
                              completion_window="24h")

print(f"Batch ID: {batch.id}")

When batch API doesn't make sense

Your workload is already realtime and latency-sensitive. Don't break a working system for 50% savings if the latency would hurt users.
Batch size is very small (<100 requests). The overhead of creating batch files isn't worth it for small jobs.
You need immediate error feedback. Batch processing errors surface hours later; realtime calls fail immediately so you can retry faster.

The simplest decision rule: if a user isn't waiting for the response, use batch. The 50% discount is free money.

See how to reduce LLM API costs for the full cost optimization playbook, and compare live batch pricing across providers with the LLMversus cost calculator.