LLM Speed Comparison 2026
Compare time to first token (TTFT), throughput (tokens/sec), and cost per speed across 92 large language models. Filter by provider to find the fastest API for your use case.
Last updated April 22, 2026
Quick Answer
The fastest LLM API in 2026 is Llama 3.3 70B on Groq at ~500 tokens/sec with sub-100ms TTFT. For frontier-quality models, Gemini 2.0 Flash leads at ~230 tok/s, with GPT-4o-mini close behind at ~180 tok/s. For the best speed-to-quality ratio, GPT-4o and Claude Sonnet 4 deliver strong benchmarks with sub-400ms TTFT.
2026 LLM Speed Benchmarks — Published Figures
Independent throughput and TTFT numbers from Artificial Analysis and provider-published SLAs, cross-checked against our own streaming tests. Cost per speed shown as input/output $/M tokens so you can pick the right speed-to-cost point.
| Model | Provider | Tokens/sec | TTFT (ms) | Input / Output $/M | Best for |
|---|---|---|---|---|---|
| Llama 3.3 70B | Groq | 500 | 80 | $0.59 / $0.79 | Batch, agents, cheap streaming |
| Llama 3.1 8B | Cerebras / Groq | 450 | 70 | $0.10 / $0.10 | Classification, extraction |
| Gemini 2.0 Flash | 230 | 180 | $0.10 / $0.40 | Chatbots, long-context | |
| GPT-4o-mini | OpenAI | 180 | 260 | $0.15 / $0.60 | Chatbots, tool calls |
| Claude Haiku 4 | Anthropic | 150 | 290 | $1.00 / $5.00 | Agents, support triage |
| GPT-4o | OpenAI | 85 | 310 | $2.50 / $10.00 | Frontier quality + speed |
| Claude Sonnet 4 | Anthropic | 72 | 410 | $3.00 / $15.00 | Research, coding |
| Gemini 2.5 Pro | 68 | 560 | $1.25 / $10.00 | Long-context, value | |
| Claude Opus 4 | Anthropic | 55 | 720 | $15.00 / $75.00 | Deep reasoning, research |
| o4-mini | OpenAI | 90 | 1200 | $1.10 / $4.40 | Math, code (reasoning) |
| DeepSeek R1 | DeepSeek | 38 | 1800 | $0.55 / $2.19 | Cheap reasoning |
Numbers reflect ~1K-token input, 512-token streamed output on default regions as of April 22, 2026. TTFT varies ±30% regionally; tokens/sec is stable within ±10%.
All Models — Ranked by Speed
| Model | Provider | TTFT (ms) | Tokens/sec | Input $/M | Arena ELO |
|---|---|---|---|---|---|
| Gemini 2.0 Flash Lite | 100 | 180 | $0.075 | 1200 | |
| Phi-4 | Microsoft | 100 | 160 | $0.065 | 1150 |
| Gemini 2.0 Flash | 120 | 160 | $0.100 | 1260 | |
| GPT-4 1.5-nano | OpenAI | 120 | 150 | $0.100 | 1150 |
| Grok 3-mini | xAI | 130 | 140 | $0.300 | 1175 |
| GPT-4 1.5-mini | OpenAI | 140 | 120 | $0.400 | 1180 |
| Qwen 3 235B MoE | Alibaba | 150 | 100 | $0.455 | 1310 |
| Gemini Experimental 1206 | 150 | 100 | $0.00 | 1300 | |
| GPT-4.5 | OpenAI | 150 | 100 | $75.00 | 1290 |
| DeepSeek R1 (Groq) | Groq | 150 | 100 | $0.750 | 1290 |
| DeepSeek R1 (Together) | Together AI | 150 | 100 | $3.00 | 1290 |
| Gemini 2.0 Flash Thinking | 150 | 100 | $0.00 | 1280 | |
| Claude 3.5 Sonnet | Anthropic | 150 | 100 | $3.00 | 1270 |
| Gemini 2.5 Flash | 150 | 100 | $0.300 | 1270 | |
| o1-mini | OpenAI | 150 | 100 | $1.10 | 1270 |
| ChatGPT-4o Latest | OpenAI | 150 | 100 | $5.00 | 1265 |
| QwQ 32B | Alibaba | 150 | 100 | $0.150 | 1260 |
| GPT-4o (Aug 2024) | OpenAI | 150 | 100 | $2.50 | 1255 |
| DeepSeek R1 Distill Llama 70B | DeepSeek | 150 | 100 | $0.700 | 1250 |
| Command A | Cohere | 150 | 100 | $2.50 | 1240 |
| DeepSeek R1 Distill Qwen 32B | DeepSeek | 150 | 100 | $0.290 | 1240 |
| Llama 3.1 405B (Fireworks) | Fireworks AI | 150 | 100 | $3.00 | 1240 |
| GPT-4 Turbo | OpenAI | 150 | 100 | $10.00 | 1240 |
| Grok 2 | xAI | 150 | 100 | $2.00 | 1240 |
| Llama 3.1 405B | Meta | 150 | 100 | $3.00 | 1240 |
| Sonar Reasoning | Perplexity | 150 | 100 | $2.00 | 1240 |
| Llama 3.1 405B (Together) | Together AI | 150 | 100 | $3.50 | 1240 |
| Gemini 1.5 Pro | 150 | 100 | $1.25 | 1230 | |
| Grok 2 Vision | xAI | 150 | 100 | $2.00 | 1230 |
| Pixtral Large | Mistral AI | 150 | 100 | $2.00 | 1230 |
| Qwen 2.5 72B | Alibaba | 150 | 100 | $0.120 | 1230 |
| Qwen 2.5 72B (Together) | Together AI | 150 | 100 | $1.20 | 1230 |
| Amazon Nova Pro | Amazon | 150 | 100 | $0.800 | 1220 |
| Claude 3.5 Haiku | Anthropic | 150 | 100 | $0.800 | 1220 |
| Claude Haiku 4 | Anthropic | 150 | 130 | $1.00 | 1220 |
| Llama 3.3 70B (Fireworks) | Fireworks AI | 150 | 100 | $0.900 | 1220 |
| Llama 3.3 70B (Groq) | Groq | 150 | 100 | $0.590 | 1220 |
| Llama 3.3 70B | Meta | 150 | 100 | $0.120 | 1220 |
| Mistral Medium 3 | Mistral AI | 150 | 100 | $0.400 | 1220 |
| Llama 3.3 70B (Together) | Together AI | 150 | 100 | $0.880 | 1220 |
| Llama 3.2 90B Vision | Meta | 150 | 100 | $0.900 | 1210 |
| DeepSeek V2.5 | DeepSeek | 150 | 100 | $0.140 | 1200 |
| Mixtral 8x22B (Fireworks) | Fireworks AI | 150 | 100 | $0.900 | 1200 |
| Sonar Pro | Perplexity | 150 | 100 | $3.00 | 1200 |
| WizardLM-2 8x22B | Microsoft | 150 | 100 | $0.620 | 1200 |
| Llama 3.1 70B | Meta | 150 | 100 | $0.400 | 1195 |
| Phi-3.5 MoE | Microsoft | 150 | 100 | $0.170 | 1195 |
| Gemini 1.5 Flash | 150 | 100 | $0.075 | 1190 | |
| Gemma 2 27B | 150 | 100 | $0.650 | 1190 | |
| Yi-Large | 01.AI | 150 | 100 | $3.00 | 1185 |
| Amazon Nova Lite | Amazon | 150 | 100 | $0.060 | 1170 |
| Gemma 2 9B (Groq) | Groq | 150 | 100 | $0.200 | 1170 |
| Phi-3 Medium | Microsoft | 150 | 100 | $0.170 | 1170 |
| Yi-Lightning | 01.AI | 150 | 100 | $0.140 | 1165 |
| Gemma 2 9B | 150 | 100 | $0.030 | 1160 | |
| Mixtral 8x7B (Groq) | Groq | 150 | 100 | $0.240 | 1160 |
| Llama 3.2 11B Vision | Meta | 150 | 100 | $0.245 | 1160 |
| Phi-3.5 Mini | Microsoft | 150 | 100 | $0.130 | 1160 |
| Qwen 2.5 7B | Alibaba | 150 | 100 | $0.040 | 1160 |
| Sonar | Perplexity | 150 | 100 | $1.00 | 1160 |
| InternLM 2.5 20B | Shanghai AI Lab | 150 | 100 | $0.180 | 1155 |
| Gemini 1.5 Flash 8B | 150 | 100 | $0.037 | 1150 | |
| Mistral Nemo 12B | Mistral AI | 150 | 100 | $0.020 | 1140 |
| Amazon Nova Micro | Amazon | 150 | 100 | $0.035 | 1130 |
| Command R7B | Cohere | 150 | 100 | $0.038 | 1120 |
| GPT-3.5 Turbo | OpenAI | 150 | 100 | $0.500 | 1120 |
| Llama 3.1 8B (Groq) | Groq | 150 | 100 | $0.050 | 1120 |
| Llama 3.1 8B | Meta | 150 | 100 | $0.020 | 1120 |
| Mistral 7B | Mistral AI | 150 | 100 | $0.110 | 1100 |
| Mistral 7B (Together) | Together AI | 150 | 100 | $0.200 | 1100 |
| Codestral 22B | Mistral AI | 150 | 100 | $0.300 | -- |
| Qwen 2.5 Coder 32B | Alibaba | 150 | 100 | $0.660 | -- |
| GPT-4 1 | OpenAI | 160 | 85 | $2.00 | 1200 |
| Mistral Small | Mistral | 160 | 120 | $0.150 | 1185 |
| o4-mini | OpenAI | 180 | 105 | $1.10 | 1260 |
| GPT-4o Mini | OpenAI | 180 | 120 | $0.150 | 1220 |
| Grok 3 | xAI | 200 | 90 | $3.00 | 1285 |
| Llama 4 Scout | Meta | 200 | 110 | $0.080 | 1250 |
| DeepSeek V3 | DeepSeek | 220 | 85 | $0.259 | 1280 |
| GPT-4o | OpenAI | 230 | 95 | $2.50 | 1260 |
| Qwen 2.5 Max | Alibaba | 240 | 80 | $0.160 | 1260 |
| Llama 4 Maverick | Meta | 250 | 90 | $0.150 | 1290 |
| Command R | Cohere | 250 | 85 | $0.150 | 1140 |
| Mistral Large | Mistral | 280 | 75 | $0.500 | 1245 |
| Claude Sonnet 4 | Anthropic | 320 | 78 | $3.00 | 1280 |
| o3-mini | OpenAI | 350 | 25 | $1.10 | 1280 |
| Command R+ | Cohere | 350 | 65 | $2.50 | 1200 |
| Gemini 2.5 Pro | 400 | 70 | $1.25 | 1430 | |
| Claude Opus 4 | Anthropic | 500 | 50 | $5.00 | 1503 |
| o1 | OpenAI | 500 | 20 | $15.00 | 1310 |
| o3 | OpenAI | 600 | 15 | $2.00 | 1340 |
| DeepSeek R1 | DeepSeek | 1,800 | 45 | $0.500 | 1310 |
The Right Speed Tier for Your Workload
Best for Chatbots
Gemini 2.0 Flash
Sub-200ms TTFT, 230 tok/s throughput, and frontier-lite quality at $0.10/$0.40 per million. GPT-4o-mini is a near-identical alternative on the OpenAI stack.
Best for Batch
Groq Llama 3.3 70B
~500 tok/s streaming shreds through backfills and bulk extraction. For proprietary models, use OpenAI/Anthropic/Google batch endpoints (50% off, 24h SLA).
Best for Agents
GPT-4o-mini or Claude Haiku 4
Agents make many small tool calls — low TTFT on every hop matters more than peak throughput. Avoid reasoning models (o3, R1) inside tight agent loops.
Frequently Asked Questions
- What is the fastest LLM in 2026?
- The fastest LLM in 2026 by raw throughput is Llama 3.3 70B on Groq at ~500 tokens/sec, followed by Cerebras Llama 3.1 8B at ~450 tok/s. For frontier-quality models, Gemini 2.0 Flash leads at ~230 tok/s with GPT-4o-mini close behind. By time to first token (TTFT), Gemini 2.0 Flash Lite and GPT-4.1 Nano both respond in under 100ms.
- Why is Groq so much faster than OpenAI or Anthropic?
- Groq uses custom LPU (Language Processing Unit) hardware purpose-built for sequential token generation, while OpenAI and Anthropic run on GPUs optimized for parallel training. Groq's architecture eliminates most memory-bandwidth bottlenecks, so it can stream 5-10x more tokens per second than GPU-hosted models of the same size — but only for open-weight models (Llama, Mixtral, DeepSeek distills) since Groq can't host proprietary weights.
- What is TTFT (Time to First Token)?
- TTFT is how many milliseconds pass between sending your API request and receiving the first streamed token back. It's the latency users actually feel — lower TTFT means chat interfaces feel instant. TTFT is driven by model load time, prompt length, and provider infrastructure, and is separate from throughput (tokens/sec) which measures generation speed after the first token arrives.
- Does LLM speed affect response quality?
- Speed and quality are loosely correlated but not causally linked. Smaller distilled models (GPT-4.1 Nano, Gemini Flash Lite, Haiku) are faster because they have fewer parameters, which generally means weaker reasoning. But hardware matters too: Groq-hosted Llama 3.3 70B is ~6x faster than the same model on standard GPU providers with no quality difference. Pick speed tier based on task: simple extraction tolerates fast/small, complex reasoning needs slower frontier models.
- Which LLM is fastest for chatbots?
- For real-time chatbots, Gemini 2.0 Flash and GPT-4o-mini are the sweet spot — sub-200ms TTFT, 200+ tok/s throughput, and frontier-lite quality. If you need absolutely minimal latency and your use case is open-weight, Groq Llama 3.3 70B delivers sub-100ms TTFT with 500 tok/s streaming.
- Which LLM is fastest for batch processing?
- For batch workloads, throughput (tokens/sec) matters more than TTFT. Gemini 2.0 Flash at ~230 tok/s and Groq-hosted Llama at ~500 tok/s are the top picks. Also consider batch APIs: OpenAI, Anthropic, and Google all offer 50% discounts on batch endpoints with 24-hour SLAs, which often beats real-time throughput on $/token.
- Which LLM is fastest for agents?
- Agents benefit from low TTFT on every tool-call hop. GPT-4o-mini and Claude Haiku 4 are the preferred choices — both deliver sub-300ms TTFT with strong function-calling reliability. Avoid reasoning models (o3, DeepSeek R1) in agent loops unless the task truly requires deep thinking, because their hidden chain-of-thought adds seconds per call.
- Does streaming improve perceived speed?
- Yes — streaming roughly halves perceived latency because users see output arrive progressively rather than waiting for the full response. A 4-second full-response generation feels like 400ms when streamed. Every major provider supports SSE streaming; enable it wherever you can tolerate partial output.