Architecture

Grouped-Query Attention

Quick Answer

An attention variant where multiple query heads share key and value heads, reducing memory.

Grouped-query attention (GQA) reduces KV cache memory by having multiple query heads share fewer key and value heads. Standard multi-head attention has one KV head per query head. GQA might use 8 query heads per 1 KV head. This dramatically reduces KV cache memory (proportional to number of KV heads) without severely impacting quality. GQA is increasingly standard in modern efficient models. It's particularly valuable for long-context and high-throughput scenarios. GQA provides a good efficiency/quality tradeoff.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →