Evaluation

GPQA

Quick Answer

Graduate-level Google-Proof Q&A: a benchmark of grad-level science questions from PhDs.

GPQA consists of 448 graduate-level science questions, many from actual PhDs. It's extremely difficult—graduate students get ~57%. GPQA tests deep specialized knowledge. Modern models achieve ~40%+ with effort. GPQA demonstrates capability gaps on expert knowledge. It's useful for evaluating advanced reasoning. GPQA highlights where models fall short on deep expertise.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →