Evaluation
HellaSwag
Quick Answer
A benchmark of commonsense reasoning through completing video descriptions.
HellaSwag has 70K multiple-choice questions about video descriptions. It tests commonsense reasoning—predicting what happens next. HellaSwag is challenging for humans (78% accuracy). Modern models achieve ~88%+ accuracy. HellaSwag tests different reasoning than other benchmarks. It's useful for measuring commonsense. HellaSwag remains a standard benchmark.
Last verified: 2026-04-08