Self-Hosted vs API LLM: True Cost Comparison for 2026
Quick answer: For most teams, managed LLM APIs are cheaper than self-hosting until you exceed roughly 100-500M tokens/month of sustained load — the exact break-even depends on model size, quality requirements, and engineering costs. Below that threshold, the infrastructure and maintenance overhead of self-hosting costs more than the API savings.
The self-hosted cost model
Self-hosting an LLM involves three cost categories:
1. GPU compute: An A100 80GB GPU runs $2.50-$3.50/hour on major cloud providers (Lambda Labs, CoreWeave, RunPod, vast.ai). An H100 runs $4-6/hour. You can reduce this with spot/reserved pricing (30-40% discount) but spot instances can be interrupted.
2. Engineering overhead: Setting up vLLM, TGI, or Ollama for production serving, managing scaling, load balancing, monitoring, and updates. Estimate 1-2 engineer-weeks for initial setup and 2-4 hours/week for maintenance. At a $150K/year engineer cost, that's roughly $15,000-$25,000/year in overhead.
3. Operational costs: Storage, networking, monitoring tooling. Typically $200-$500/month for a modest deployment.
Throughput benchmarks for common models
| Model | GPU | Tokens/Second | $/Hour (Lambda) | $/1M tokens |
| Llama 4 Scout (17B) | A100 80GB | ~80,000 | $2.80 | $0.010 |
| Llama 4 Maverick (400B MoE) | 8×A100 | ~15,000 | $22.40 | $0.042 |
| DeepSeek V3 (685B) | 8×H100 | ~8,000 | $44.00 | $0.153 |
| Mistral Large (123B) | 4×A100 | ~18,000 | $11.20 | $0.174 |
Throughput figures are combined input+output tokens at batch size 1. Batch processing increases throughput 3-5×.
API cost comparison for same models
| Model | Provider | $/1M tokens (60/40 split) |
| Llama 4 Scout | Together AI | $0.33 |
| Llama 4 Maverick | Fireworks AI | $0.48 |
| DeepSeek V3 | DeepSeek API | $0.59 |
| Mistral Large | Mistral AI | $1.56 |
The break-even analysis
For Llama 4 Scout as an example:
- Self-hosted cost: $0.010/1M tokens (compute only) + ~$2,000/month engineering amortized
- API cost: $0.33/1M tokens
Break-even monthly volume:
(API cost - self-hosted compute cost) × volume = engineering overhead
($0.33 - $0.01) × volume_in_millions = $2,000
volume = $2,000 / $0.32 = ~6,250M tokens/month
That's 6.25 billion tokens per month — substantial. For most teams, API wins until you're at very high volume.
For Mistral Large (higher API price):
($1.56 - $0.174) × volume = $2,000
volume = ~1,440M tokens/month
About 1.4 billion tokens/month — still high, but reachable for mid-scale applications.
When self-hosting makes sense
Volume: >500M tokens/month of sustained production load.
Data privacy: If your data cannot leave your infrastructure (healthcare, finance, defense), self-hosting is sometimes the only option regardless of cost.
Fine-tuning: If you need to fine-tune models on proprietary data for significant performance gains, self-hosting is required.
Latency control: If you need consistent <50ms TTFT at high concurrency, dedicated GPU infrastructure outperforms shared API endpoints.
Regulatory: Some compliance frameworks require on-premise or VPC-isolated deployment.
Practical recommendations
- <50M tokens/month: Use managed APIs. Don't self-host.
- 50-500M tokens/month: Evaluate hybrid — managed API for realtime, batch API for async workloads. Explore third-party hosted inference (Together, Fireworks) as a middle option.
- >500M tokens/month: Evaluate self-hosting for your highest-volume, most price-sensitive workloads. Keep managed APIs for low-volume, high-quality tasks.
See best open source LLMs for the current top self-hostable models, and the LLMversus cost calculator for exact break-even modeling.