Evaluation

ROUGE Score

Quick Answer

A metric measuring recall of n-grams between model output and reference summaries.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures summarization quality. ROUGE-1/2/L measure unigram/bigram/longest-common-subsequence overlap. ROUGE is standard for summarization but has limitations. ROUGE might penalize reasonable paraphrases. Better metrics exist but ROUGE remains standard. ROUGE is less relevant for open-ended generation.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →