LLM Benchmark Comparison 2026
Compare benchmark scores across 25 large language models. Sort by Arena ELO, Coding ELO, HumanEval, MMLU, MATH, and GPQA. Click any column to sort, or use the chart to visualize rankings.
Benchmark Rankings
25 of 25 models
| Model | Arena ELO | Coding ELO | MMLU |
|---|---|---|---|
| o4-mini | 1350 | 1380 | 91 |
| Gemini 2.5 Pro | 1340 | 1350 | 92 |
| Claude Opus 4 | 1330 | 1360 | 91.5 |
| DeepSeek R1 | 1310 | 1330 | 89 |
| o3-mini | 1310 | 1340 | 87.3 |
| Grok 3 | 1300 | 1290 | 89 |
| GPT-4.1 | 1290 | 1320 | 90.2 |
| Llama 4 Maverick | 1290 | 1280 | 88 |
| Claude Sonnet 4 | 1280 | 1305 | 88.7 |
| DeepSeek V3 | 1280 | 1300 | 87.5 |
| Gemini 2.0 Flash | 1260 | 1240 | 85.5 |
| GPT-4o | 1260 | 1265 | 88.7 |
| Qwen 2.5 Max | 1260 | 1250 | 86 |
| Llama 4 Scout | 1250 | 1230 | 85 |
| Mistral Large | 1245 | 1240 | 86.5 |
| GPT-4.1 Mini | 1240 | 1230 | 84.5 |
| Claude Haiku 4 | 1220 | 1195 | 83 |
| GPT-4o Mini | 1220 | 1200 | 82 |
| Grok 3 Mini | 1220 | 1200 | 82 |
| Command R+ | 1200 | 1160 | 82 |
| Gemini 2.0 Flash Lite | 1200 | 1170 | 80 |
| Mistral Small | 1185 | 1160 | 79 |
| GPT-4.1 Nano | 1180 | 1150 | 78.5 |
| Phi-4 | 1150 | 1130 | 80.5 |
| Command R | 1140 | 1100 | 75.5 |