Evaluation

GSM8K

Quick Answer

Grade School Math 8K: a benchmark of 8,500 grade school math word problems.

GSM8K contains 8,500 elementary school math word problems. Evaluation is exact-match of final answers. This tests arithmetic and reasoning. GSM8K is very challenging without prompting techniques (chain-of-thought helps significantly). Pass rates vary widely by model. GSM8K shows reasoning ability better than raw knowledge. Recent models achieve 90%+ with chain-of-thought. GSM8K is standard for evaluating math reasoning.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →