Why Your Long-Context LLM Runs Out of Memory
The KV cache is the line item nobody mentions until you set context to 128k and the GPU explodes. Here is what it is and how to size it.
If you have ever set a long context and watched your GPU OOM the first time you actually used it, the answer is the KV cache.
What it is
A transformer attends to every previous token. To avoid recomputing keys and values from scratch on every step, the engine stores them. That is the KV cache. One block per layer per token.
The formula
KV cache size in bytes:
``
2 (k+v) × num_layers × num_kv_heads × head_dim × seq × batch × bytes_per_element
``
At FP16, bytes_per_element = 2.
For Llama 3.1 70B (80 layers, 8 KV heads, 128 head dim) at 32k context, batch 1, FP16:
``
2 × 80 × 8 × 128 × 32768 × 1 × 2 = 10.7 GB
``
That is just KV. Weights are separate.
Why GQA matters
Grouped Query Attention reduces num_kv_heads. Llama 3.1 has 8 KV heads where Llama 2 had 64. Same model size, KV cache is ~8x smaller. This is the architectural trick that made long-context Llama practical.
Old multi-head models (Mistral 7B v0.1) burn KV cache fast. Modern GQA models (Llama 3.x, Qwen 2.5, DeepSeek) are far cheaper.
Quantizing the KV cache
The KV cache uses FP16 by default. INT8 cuts it in half, INT4 cuts it to a quarter. Quality cost is small for chat, more visible on reasoning. vLLM, llama.cpp, TensorRT-LLM all support FP8 KV cache.
When to care
Below 8k context, KV cache is a rounding error. At 32k, it is a real chunk. At 128k+, it can dominate. Long-document QA, code with repo-scale context, multi-turn agent histories all hit the KV cache hard.
Sizing the right way
Set your context to the longest workload you actually run. Add KV cache to weight memory. Then add 20% headroom. The calculator does this. Do not size on weights only and assume the rest is small.