LLM Benchmark Comparison 2026

Compare benchmark scores across 25 large language models. Sort by Arena ELO, Coding ELO, HumanEval, MMLU, MATH, and GPQA. Click any column to sort, or use the chart to visualize rankings.

Benchmark Rankings

25 of 25 models
ModelArena ELOCoding ELOMMLU
o4-mini1350138091
Gemini 2.5 Pro1340135092
Claude Opus 41330136091.5
DeepSeek R11310133089
o3-mini1310134087.3
Grok 31300129089
GPT-4.11290132090.2
Llama 4 Maverick1290128088
Claude Sonnet 41280130588.7
DeepSeek V31280130087.5
Gemini 2.0 Flash1260124085.5
GPT-4o1260126588.7
Qwen 2.5 Max1260125086
Llama 4 Scout1250123085
Mistral Large1245124086.5
GPT-4.1 Mini1240123084.5
Claude Haiku 41220119583
GPT-4o Mini1220120082
Grok 3 Mini1220120082
Command R+1200116082
Gemini 2.0 Flash Lite1200117080
Mistral Small1185116079
GPT-4.1 Nano1180115078.5
Phi-41150113080.5
Command R1140110075.5