Evaluation

SWE-Bench

Quick Answer

Software Engineering Benchmark: evaluating models on real GitHub issues and pull requests.

SWE-Bench evaluates models on 2,294 real GitHub issues. Models must write code to fix bugs or implement features. Evaluation is end-to-end—code is tested. This tests real-world coding ability. SWE-Bench is more realistic than HumanEval but much harder. Pass rates are much lower (1-5% for open models, up to ~40% for best). SWE-Bench is a newer, harder benchmark driving capability. It better reflects real engineering tasks.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →

← All glossary terms