Architecture
Key-Value Cache
Quick Answer
Storing pre-computed keys and values from previous tokens to speed up inference.
During autoregressive generation, computing attention over all previous tokens for each new token is expensive. The KV cache stores previously computed key and value vectors, enabling reuse. Only the new token's query attends to cached keys/values. This reduces computational cost from O(n²) to O(n) for generation. KV cache is essential for practical inference. Managing KV cache memory is important for long sequences and high throughput. Some models use multi-query or grouped-query attention to reduce cache memory.
Last verified: 2026-04-08