How Much VRAM Does Llama 3.1 70B Actually Need?

2026-05-09

Walk through the real numbers, weights, KV cache, activations, headroom, for running Llama 3.1 70B at every popular quantization.

The marketing pages for Llama 3.1 70B all quote the same number: 140 GB at FP16. That number is correct and useless. Nobody runs the model that way unless they have a server full of A100s, and even then most of them quantize.

Here are the numbers you actually need.

Weights only

At FP16, the weights take 70.6B params x 2 bytes = 141.2 GB. At INT8 you halve it to ~71 GB. Q5_K_M lands at ~49 GB. The popular Q4_K_M sits at about 43 GB. Q3_K_M cuts to ~34 GB and Q2_K to ~23 GB.

Those are weights only. Real inference adds three more line items.

KV cache, the silent budget

The KV cache stores past attention. Llama 3.1 70B has 80 layers and 8 KV heads at 128 dim. At 8k context that is roughly 2.6 GB at FP16. At 32k context, 10.5 GB. At 128k context, 42 GB. KV cache scales linearly with sequence length, which is why the same model can fit on 80 GB at chat context but spill on long-document QA.

Quantizing the KV cache to FP8 or INT8 cuts it in half. Almost every modern serving engine supports this. Quality cost is small for chat workloads.

Activations and overhead

Inference activation memory is small compared to weights, usually 1-3 GB. The serving engine itself (vLLM, TensorRT-LLM, llama.cpp) holds another ~1 GB.

Real totals

At Q4_K_M with 8k context and FP16 KV: weights 43 + KV 2.6 + activations 1.5 + overhead 1 = ~48 GB. Add 20% headroom and you want a card with 58 GB. A100 80 GB is comfortable. 2x 24 GB consumer cards (4090, 3090) split evenly fit.

At Q4_K_M with 32k context: ~57 GB before headroom, ~68 GB with. Same conclusion, but the consumer 2x setup gets tighter.

At Q4_K_M with 128k context: KV cache alone is 42 GB. You need an A100 80 GB or H100, or aggressive KV quantization (FP8 KV cuts it to 21 GB).

Tokens per second

Token generation is bandwidth-bound. On an A100 80 GB at 2039 GB/s, reading 43 GB of weights per token gives a peak of ~47 tok/s. In practice, ~30. On 2x 4090, ~25 tok/s with tensor parallel overhead. On M3 Max 128 GB at 400 GB/s, ~9 tok/s.

What this means for your buy decision

If you only ever run 70B at 8k context, a single 48 GB card (RTX 6000 Ada, MI300X) is the cleanest setup. If you want long context, A100 80 GB is the floor. If you want it on a Mac, M2/M4 Ultra is the only sane single-machine option, and you live with ~10 tok/s.

The calculator on the homepage runs all of this live, with your context length and quantization plugged in.