Evaluation

HumanEval

Quick Answer

A benchmark evaluating code generation capability through functional correctness on programming tasks.

HumanEval is the standard coding benchmark with 164 programming problems. Evaluation is pass/fail based on execution. This objective evaluation measures actual coding ability. HumanEval drives competition between coding models. Pass rates range from 0% (non-coding models) to 92%+ (top models). HumanEval has limitations (simple problems, not production code). However, it's standard for comparing models. Variants like HumanEval+ make it harder.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →