Batch API vs Realtime LLM Calls: Cost Comparison and When to Switch
Quick answer: Both OpenAI and Anthropic offer 50% discounts on batch API processing with a 24-hour turnaround SLA. If any portion of your workload can tolerate async processing, batch API is an immediate 50% cost reduction on that portion. Most production systems have at least 40-60% batch-eligible workloads that teams are running synchronously out of habit.
What the batch API is
Both OpenAI's Batch API and Anthropic's Message Batches API allow you to submit a file of requests that are processed asynchronously within 24 hours, at half the price of synchronous API calls.
The infrastructure reason for the discount: batch requests fill idle GPU capacity that would otherwise sit unused during off-peak hours. The provider can defer processing, making the economics favorable enough to pass 50% savings to users.
Pricing: batch vs standard
| Model | Standard Input | Batch Input | Standard Output | Batch Output |
| GPT-4o | $2.50/M | $1.25/M | $10.00/M | $5.00/M |
| GPT-4.1 | $2.00/M | $1.00/M | $8.00/M | $4.00/M |
| GPT-4.1 Mini | $0.40/M | $0.20/M | $1.60/M | $0.80/M |
| Claude Sonnet 4 | $3.00/M | $1.50/M | $15.00/M | $7.50/M |
| Claude Haiku 4 | $0.80/M | $0.40/M | $4.00/M | $2.00/M |
Every model, every provider: 50% off.
Which workloads are batch-eligible?
Batch processing works when:
- Requests are independent (no dependency on other request responses)
- Results can be consumed 1-24 hours after submission
- Output is stored for later use, not streamed to a waiting user
Clearly batch-eligible:
- Nightly document summarization
- Bulk data enrichment (e.g., categorizing 100K support tickets)
- Content moderation pipelines
- Translation of product catalogs
- Generating embeddings at scale
- Scheduled report generation
- Evaluation runs and evals
- Training data generation
- SEO content generation pipelines
Clearly NOT batch-eligible:
- Chat interfaces (user waiting for response)
- Realtime code completion (IDE plugins)
- Live customer support agents
- Any feature where a user is actively waiting
Gray area (needs evaluation):
- Email drafting suggestions (usually has minutes of tolerance)
- Background insight generation shown on next page load
- Pre-generating suggestions for a user's next session
Cost savings example
A content company running 10M tokens/day of document summarization:
Standard API (Claude Sonnet 4):
- Input: 8M × $3.00/M = $24/day
- Output: 2M × $15.00/M = $30/day
- Total: $54/day = $1,620/month
Batch API (Claude Sonnet 4):
- Input: 8M × $1.50/M = $12/day
- Output: 2M × $7.50/M = $15/day
- Total: $27/day = $810/month
Savings: $810/month = $9,720/year with zero quality change.
Implementing the OpenAI Batch API
import openai, json
client = openai.OpenAI()
# Create JSONL batch file
requests = [
{"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
"body": {"model": "gpt-4-1", "messages": [{"role": "user", "content": doc}],
"max_tokens": 500}}
for i, doc in enumerate(documents)
]
with open("batch.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Upload file and create batch
batch_file = client.files.create(file=open("batch.jsonl", "rb"), purpose="batch")
batch = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions",
completion_window="24h")
print(f"Batch ID: {batch.id}")
When batch API doesn't make sense
- Your workload is already realtime and latency-sensitive. Don't break a working system for 50% savings if the latency would hurt users.
- Batch size is very small (<100 requests). The overhead of creating batch files isn't worth it for small jobs.
- You need immediate error feedback. Batch processing errors surface hours later; realtime calls fail immediately so you can retry faster.
The simplest decision rule: if a user isn't waiting for the response, use batch. The 50% discount is free money.
See how to reduce LLM API costs for the full cost optimization playbook, and compare live batch pricing across providers with the LLMversus cost calculator.