Evaluation

Benchmark

Quick Answer

A standardized test dataset used to compare model performance across different models.

A benchmark is a standard dataset and evaluation protocol for comparing models fairly. Good benchmarks are diverse, challenging, and reproducible. Benchmarks include: MMLU (knowledge), HumanEval (coding), GSM8K (math), and many others. Benchmarks drive progress but have limitations—models can overfit to published benchmarks. Benchmarks measure specific capabilities; no single benchmark captures overall quality. Multiple benchmarks give more complete picture than any single one. Creating representative benchmarks is challenging.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →