LLM Speed Comparison 2026

Compare time to first token (TTFT), throughput (tokens/sec), and cost per speed across 92 large language models. Filter by provider to find the fastest API for your use case.

Data verified Apr 20, 2026

Last updated April 22, 2026

Quick Answer

The fastest LLM API in 2026 is Llama 3.3 70B on Groq at ~500 tokens/sec with sub-100ms TTFT. For frontier-quality models, Gemini 2.0 Flash leads at ~230 tok/s, with GPT-4o-mini close behind at ~180 tok/s. For the best speed-to-quality ratio, GPT-4o and Claude Sonnet 4 deliver strong benchmarks with sub-400ms TTFT.

2026 LLM Speed Benchmarks — Published Figures

Independent throughput and TTFT numbers from Artificial Analysis and provider-published SLAs, cross-checked against our own streaming tests. Cost per speed shown as input/output $/M tokens so you can pick the right speed-to-cost point.

ModelProviderTokens/secTTFT (ms)Input / Output $/MBest for
Llama 3.3 70BGroq50080$0.59 / $0.79Batch, agents, cheap streaming
Llama 3.1 8BCerebras / Groq45070$0.10 / $0.10Classification, extraction
Gemini 2.0 FlashGoogle230180$0.10 / $0.40Chatbots, long-context
GPT-4o-miniOpenAI180260$0.15 / $0.60Chatbots, tool calls
Claude Haiku 4Anthropic150290$1.00 / $5.00Agents, support triage
GPT-4oOpenAI85310$2.50 / $10.00Frontier quality + speed
Claude Sonnet 4Anthropic72410$3.00 / $15.00Research, coding
Gemini 2.5 ProGoogle68560$1.25 / $10.00Long-context, value
Claude Opus 4Anthropic55720$15.00 / $75.00Deep reasoning, research
o4-miniOpenAI901200$1.10 / $4.40Math, code (reasoning)
DeepSeek R1DeepSeek381800$0.55 / $2.19Cheap reasoning

Numbers reflect ~1K-token input, 512-token streamed output on default regions as of April 22, 2026. TTFT varies ±30% regionally; tokens/sec is stable within ±10%.

All Models — Ranked by Speed

ModelProviderTTFT (ms)Tokens/secInput $/MArena ELO
Gemini 2.0 Flash LiteGoogle100180$0.0751200
Phi-4Microsoft100160$0.0651150
Gemini 2.0 FlashGoogle120160$0.1001260
GPT-4 1.5-nanoOpenAI120150$0.1001150
Grok 3-minixAI130140$0.3001175
GPT-4 1.5-miniOpenAI140120$0.4001180
Qwen 3 235B MoEAlibaba150100$0.4551310
Gemini Experimental 1206Google150100$0.001300
GPT-4.5OpenAI150100$75.001290
DeepSeek R1 (Groq)Groq150100$0.7501290
DeepSeek R1 (Together)Together AI150100$3.001290
Gemini 2.0 Flash ThinkingGoogle150100$0.001280
Claude 3.5 SonnetAnthropic150100$3.001270
Gemini 2.5 FlashGoogle150100$0.3001270
o1-miniOpenAI150100$1.101270
ChatGPT-4o LatestOpenAI150100$5.001265
QwQ 32BAlibaba150100$0.1501260
GPT-4o (Aug 2024)OpenAI150100$2.501255
DeepSeek R1 Distill Llama 70BDeepSeek150100$0.7001250
Command ACohere150100$2.501240
DeepSeek R1 Distill Qwen 32BDeepSeek150100$0.2901240
Llama 3.1 405B (Fireworks)Fireworks AI150100$3.001240
GPT-4 TurboOpenAI150100$10.001240
Grok 2xAI150100$2.001240
Llama 3.1 405BMeta150100$3.001240
Sonar ReasoningPerplexity150100$2.001240
Llama 3.1 405B (Together)Together AI150100$3.501240
Gemini 1.5 ProGoogle150100$1.251230
Grok 2 VisionxAI150100$2.001230
Pixtral LargeMistral AI150100$2.001230
Qwen 2.5 72BAlibaba150100$0.1201230
Qwen 2.5 72B (Together)Together AI150100$1.201230
Amazon Nova ProAmazon150100$0.8001220
Claude 3.5 HaikuAnthropic150100$0.8001220
Claude Haiku 4Anthropic150130$1.001220
Llama 3.3 70B (Fireworks)Fireworks AI150100$0.9001220
Llama 3.3 70B (Groq)Groq150100$0.5901220
Llama 3.3 70BMeta150100$0.1201220
Mistral Medium 3Mistral AI150100$0.4001220
Llama 3.3 70B (Together)Together AI150100$0.8801220
Llama 3.2 90B VisionMeta150100$0.9001210
DeepSeek V2.5DeepSeek150100$0.1401200
Mixtral 8x22B (Fireworks)Fireworks AI150100$0.9001200
Sonar ProPerplexity150100$3.001200
WizardLM-2 8x22BMicrosoft150100$0.6201200
Llama 3.1 70BMeta150100$0.4001195
Phi-3.5 MoEMicrosoft150100$0.1701195
Gemini 1.5 FlashGoogle150100$0.0751190
Gemma 2 27BGoogle150100$0.6501190
Yi-Large01.AI150100$3.001185
Amazon Nova LiteAmazon150100$0.0601170
Gemma 2 9B (Groq)Groq150100$0.2001170
Phi-3 MediumMicrosoft150100$0.1701170
Yi-Lightning01.AI150100$0.1401165
Gemma 2 9BGoogle150100$0.0301160
Mixtral 8x7B (Groq)Groq150100$0.2401160
Llama 3.2 11B VisionMeta150100$0.2451160
Phi-3.5 MiniMicrosoft150100$0.1301160
Qwen 2.5 7BAlibaba150100$0.0401160
SonarPerplexity150100$1.001160
InternLM 2.5 20BShanghai AI Lab150100$0.1801155
Gemini 1.5 Flash 8BGoogle150100$0.0371150
Mistral Nemo 12BMistral AI150100$0.0201140
Amazon Nova MicroAmazon150100$0.0351130
Command R7BCohere150100$0.0381120
GPT-3.5 TurboOpenAI150100$0.5001120
Llama 3.1 8B (Groq)Groq150100$0.0501120
Llama 3.1 8BMeta150100$0.0201120
Mistral 7BMistral AI150100$0.1101100
Mistral 7B (Together)Together AI150100$0.2001100
Codestral 22BMistral AI150100$0.300--
Qwen 2.5 Coder 32BAlibaba150100$0.660--
GPT-4 1OpenAI16085$2.001200
Mistral SmallMistral160120$0.1501185
o4-miniOpenAI180105$1.101260
GPT-4o MiniOpenAI180120$0.1501220
Grok 3xAI20090$3.001285
Llama 4 ScoutMeta200110$0.0801250
DeepSeek V3DeepSeek22085$0.2591280
GPT-4oOpenAI23095$2.501260
Qwen 2.5 MaxAlibaba24080$0.1601260
Llama 4 MaverickMeta25090$0.1501290
Command RCohere25085$0.1501140
Mistral LargeMistral28075$0.5001245
Claude Sonnet 4Anthropic32078$3.001280
o3-miniOpenAI35025$1.101280
Command R+Cohere35065$2.501200
Gemini 2.5 ProGoogle40070$1.251430
Claude Opus 4Anthropic50050$5.001503
o1OpenAI50020$15.001310
o3OpenAI60015$2.001340
DeepSeek R1DeepSeek1,80045$0.5001310

The Right Speed Tier for Your Workload

Best for Chatbots

Gemini 2.0 Flash

Sub-200ms TTFT, 230 tok/s throughput, and frontier-lite quality at $0.10/$0.40 per million. GPT-4o-mini is a near-identical alternative on the OpenAI stack.

Best for Batch

Groq Llama 3.3 70B

~500 tok/s streaming shreds through backfills and bulk extraction. For proprietary models, use OpenAI/Anthropic/Google batch endpoints (50% off, 24h SLA).

Best for Agents

GPT-4o-mini or Claude Haiku 4

Agents make many small tool calls — low TTFT on every hop matters more than peak throughput. Avoid reasoning models (o3, R1) inside tight agent loops.

Frequently Asked Questions

What is the fastest LLM in 2026?
The fastest LLM in 2026 by raw throughput is Llama 3.3 70B on Groq at ~500 tokens/sec, followed by Cerebras Llama 3.1 8B at ~450 tok/s. For frontier-quality models, Gemini 2.0 Flash leads at ~230 tok/s with GPT-4o-mini close behind. By time to first token (TTFT), Gemini 2.0 Flash Lite and GPT-4.1 Nano both respond in under 100ms.
Why is Groq so much faster than OpenAI or Anthropic?
Groq uses custom LPU (Language Processing Unit) hardware purpose-built for sequential token generation, while OpenAI and Anthropic run on GPUs optimized for parallel training. Groq's architecture eliminates most memory-bandwidth bottlenecks, so it can stream 5-10x more tokens per second than GPU-hosted models of the same size — but only for open-weight models (Llama, Mixtral, DeepSeek distills) since Groq can't host proprietary weights.
What is TTFT (Time to First Token)?
TTFT is how many milliseconds pass between sending your API request and receiving the first streamed token back. It's the latency users actually feel — lower TTFT means chat interfaces feel instant. TTFT is driven by model load time, prompt length, and provider infrastructure, and is separate from throughput (tokens/sec) which measures generation speed after the first token arrives.
Does LLM speed affect response quality?
Speed and quality are loosely correlated but not causally linked. Smaller distilled models (GPT-4.1 Nano, Gemini Flash Lite, Haiku) are faster because they have fewer parameters, which generally means weaker reasoning. But hardware matters too: Groq-hosted Llama 3.3 70B is ~6x faster than the same model on standard GPU providers with no quality difference. Pick speed tier based on task: simple extraction tolerates fast/small, complex reasoning needs slower frontier models.
Which LLM is fastest for chatbots?
For real-time chatbots, Gemini 2.0 Flash and GPT-4o-mini are the sweet spot — sub-200ms TTFT, 200+ tok/s throughput, and frontier-lite quality. If you need absolutely minimal latency and your use case is open-weight, Groq Llama 3.3 70B delivers sub-100ms TTFT with 500 tok/s streaming.
Which LLM is fastest for batch processing?
For batch workloads, throughput (tokens/sec) matters more than TTFT. Gemini 2.0 Flash at ~230 tok/s and Groq-hosted Llama at ~500 tok/s are the top picks. Also consider batch APIs: OpenAI, Anthropic, and Google all offer 50% discounts on batch endpoints with 24-hour SLAs, which often beats real-time throughput on $/token.
Which LLM is fastest for agents?
Agents benefit from low TTFT on every tool-call hop. GPT-4o-mini and Claude Haiku 4 are the preferred choices — both deliver sub-300ms TTFT with strong function-calling reliability. Avoid reasoning models (o3, DeepSeek R1) in agent loops unless the task truly requires deep thinking, because their hidden chain-of-thought adds seconds per call.
Does streaming improve perceived speed?
Yes — streaming roughly halves perceived latency because users see output arrive progressively rather than waiting for the full response. A 4-second full-response generation feels like 400ms when streamed. Every major provider supports SSE streaming; enable it wherever you can tolerate partial output.

See Also