KV Cache

Stored attention keys and values for past tokens. Grows linearly with context length.

On every step, a transformer attends to all previous tokens. To avoid recomputing keys and values from scratch, the engine caches them. The KV cache size is `2 * num_layers * num_kv_heads * head_dim * seq_len * batch * bytes_per_element`. At long contexts on big models, the KV cache can rival the weights themselves. Quantizing it to FP8 or INT8 halves the memory at small quality cost.

Related terms

VRAM
Context Length
Grouped Query Attention (GQA)