Evaluation
GPQA
Quick Answer
Graduate-level Google-Proof Q&A: a benchmark of grad-level science questions from PhDs.
GPQA consists of 448 graduate-level science questions, many from actual PhDs. It's extremely difficult—graduate students get ~57%. GPQA tests deep specialized knowledge. Modern models achieve ~40%+ with effort. GPQA demonstrates capability gaps on expert knowledge. It's useful for evaluating advanced reasoning. GPQA highlights where models fall short on deep expertise.
Last verified: 2026-04-08