Evaluation
SWE-Bench
Quick Answer
Software Engineering Benchmark: evaluating models on real GitHub issues and pull requests.
SWE-Bench evaluates models on 2,294 real GitHub issues. Models must write code to fix bugs or implement features. Evaluation is end-to-end—code is tested. This tests real-world coding ability. SWE-Bench is more realistic than HumanEval but much harder. Pass rates are much lower (1-5% for open models, up to ~40% for best). SWE-Bench is a newer, harder benchmark driving capability. It better reflects real engineering tasks.
Last verified: 2026-04-08