KV Cache
Stored attention keys and values for past tokens. Grows linearly with context length.
On every step, a transformer attends to all previous tokens. To avoid recomputing keys and values from scratch, the engine caches them. The KV cache size is `2 * num_layers * num_kv_heads * head_dim * seq_len * batch * bytes_per_element`. At long contexts on big models, the KV cache can rival the weights themselves. Quantizing it to FP8 or INT8 halves the memory at small quality cost.