LLM Cost Optimization: The Complete 2026 Playbook
Quick answer: LLM cost optimization has five levers in order of impact: (1) right-size model selection — can save 60-80%, (2) enable prompt caching — can save 50-90% on input, (3) migrate async workloads to batch API — 50% off immediately, (4) optimize prompt length — 20-40% input reduction, (5) add cost attribution and kill waste — typically 10-20% from unused features. Start in that order.
The cost optimization hierarchy
Not all optimizations are equal. Here's the impact ladder:
| Lever | Effort | Typical Savings | Time to Implement |
| Model right-sizing | Medium | 60-80% | 1-2 weeks |
| Prompt caching | Low | 50-90% on input | 1-3 days |
| Batch API migration | Medium | 50% on batch workload | 1-2 weeks |
| Prompt length reduction | Low-Medium | 20-40% on input | 1-2 weeks |
| Cost attribution | Medium | 10-20% from waste | 1-2 weeks |
| Response caching | Medium | 15-40% on total calls | 1 week |
| Open-source migration | High | 50-90% at scale | 4-12 weeks |
Do them roughly in this order. The top three deliver the most value with the least effort.
Lever 1: Model right-sizing
The principle is simple: use the cheapest model that achieves acceptable quality for each task. The implementation requires classifying your tasks by quality tier.
Tier classification process:
- List every LLM-powered feature in your product
- For each feature, ask: "What happens if output quality drops by 20%?"
- Run each tier on the cheapest model in that tier and evaluate
- Only upgrade a model if evaluation shows clear quality degradation
Most teams discover that 60-80% of their token volume runs on Tier 1 models unnecessarily.
See the LLMversus model comparison for a full breakdown by tier and the cheapest LLM API ranking for pricing.
Lever 2: Prompt caching
Any prompt that contains a large static prefix (system prompt, documentation, user history, retrieved context) that appears in many requests is a caching candidate.
Quick audit: Look at your top 5 endpoints by token volume. For each one, answer: "What percentage of the input tokens are identical across different user requests?" If the answer is >20%, caching will deliver meaningful savings.
See LLM API caching strategies for the full implementation guide.
Lever 3: Batch API migration
Survey every scheduled job, background task, and asynchronous pipeline in your system. Any of these that call an LLM API synchronously when they don't need to is a batch migration opportunity.
See batch API vs realtime cost for the full analysis.
Lever 4: Prompt engineering for cost
Prompt length directly determines input token cost. Every token you remove from your prompts is a proportional cost reduction.
Prompt audit checklist:
- [ ] Remove instructions the model follows by default (e.g., "be helpful", "be accurate")
- [ ] Compress multi-step instructions into shorter equivalents
- [ ] Trim few-shot examples to 2-3 tight examples
- [ ] Remove conversational filler ("Great! I'd be happy to help with that...")
- [ ] Replace verbose schema descriptions with compact examples
- [ ] Audit system prompts quarterly — they grow over time as teams add instructions
Output length control:
- Add explicit length instructions for every task ("respond in 2-3 sentences", "keep your answer under 100 words")
- Set
max_tokensat the API level as a hard cap - For structured extraction, return JSON only, no explanation
Lever 5: Cost attribution and waste elimination
You can't optimize what you can't measure. Add a tagging layer to all LLM calls:
response = client.chat.completions.create(
model="gpt-4-1",
messages=messages,
user=f"feature:customer-support|env:production|segment:pro"
)
Then aggregate by feature to find your top token consumers. You'll typically find:
- 2-3 features consuming 60-70% of spend
- 1-2 features running in staging with production models
- At least one feature that was deprecated but its LLM calls are still running
Tools: Helicone, Langfuse, Portkey, or custom logging to your data warehouse.
The governance layer
Optimization without governance regresses over time. Establish:
- Cost budget per feature: Set monthly token budgets for each LLM-powered feature. Alert when usage exceeds 80% of budget.
- Model change approval: New model upgrades require a cost impact assessment. No engineer should be able to silently swap a Haiku call for an Opus call in production.
- Quarterly prompt audit: Schedule 2 hours every quarter to review top-10 prompts by token cost. Prompts accumulate bloat naturally over time.
- Staging model policy: Document the approved models for each environment. Staging always uses the cheapest model in the same family.
- Runaway protection: Circuit breakers that stop LLM calls if per-feature cost exceeds a threshold per hour. Prevent accidental infinite loops from burning thousands of dollars.
Implementation checklist
Week 1:
- [ ] Add cost attribution tags to all LLM calls
- [ ] Pull token usage by feature for the last 30 days
- [ ] Identify top 3 features by token spend
Week 2:
- [ ] Audit top 3 features for caching opportunities
- [ ] Implement prompt caching on eligible features
- [ ] Identify batch-eligible workloads
Week 3:
- [ ] Migrate first batch-eligible workload to Batch API
- [ ] Run model right-sizing evaluation on Tier 1 features
- [ ] Audit and trim system prompts
Week 4:
- [ ] Set spending alerts and feature budgets
- [ ] Document staging model policy
- [ ] Calculate and report total savings from implementation
Use the LLMversus cost calculator to model savings from each optimization before implementation.