Updated for Llama 3.1, DeepSeek V3, Qwen 2.5

How much VRAM does this LLM actually need?

Pick a model, set a quantization, choose context length. The calculator shows total memory needed, breaks down where it goes, and ranks every GPU and Mac by whether it fits, plus rough tokens per second.

Open the calculator See common setups

Models supported

Hardware options

Quantizations

Always

Free, no signup

Calculator

Pick a model and a GPU. We do the math.

Numbers update live. The total includes weights, KV cache for your context length, activations, and a 20% headroom buffer because real workloads spike.

Pick your setup

Model

Meta's small workhorse. Runs on most consumer GPUs at 4-bit. Good fit for chat assistants and code helpers.

Quantization

The popular 4-bit. ~70% memory cut, small quality drop.

Context length (8,192 tokens)

512131,072

Batch size

KV cache precision

Total VRAM needed

8.4 GB

with 20% headroom. Base 7.0 GB

Model weights4.5 GB

KV cache1.0 GB

Activations512 MB

Overhead1.0 GB

What runs this

22 options fit

0 tight · 0 short

RTX 4060 Ti 16GB

16 GB · 288 GB/s

~41 tok/s

53% used

RTX 4070 Ti Super

16 GB · 672 GB/s

~96 tok/s

53% used

RTX 4080 Super

16 GB · 736 GB/s

~106 tok/s

53% used

RTX 4090

24 GB · 1008 GB/s

~145 tok/s

35% used

RTX 3090

24 GB · 936 GB/s

~134 tok/s

35% used

RTX 5090

32 GB · 1792 GB/s

~257 tok/s

26% used

RTX A6000

48 GB · 768 GB/s

~110 tok/s

18% used

RTX 6000 Ada

48 GB · 960 GB/s

~138 tok/s

18% used

A100 40GB

40 GB · 1555 GB/s

~223 tok/s

21% used

A100 80GB

80 GB · 2039 GB/s

~292 tok/s

11% used

H100 80GB SXM

80 GB · 3350 GB/s

~480 tok/s

11% used

H200 141GB

141 GB · 4800 GB/s

~688 tok/s

6% used

B200 192GB

192 GB · 8000 GB/s

~1147 tok/s

4% used

Radeon RX 7900 XTX

24 GB · 960 GB/s

~138 tok/s

35% used

Instinct MI300X

192 GB · 5300 GB/s

~760 tok/s

4% used

Mac Studio M2 Ultra 192GB

144 GB · 800 GB/s

~115 tok/s

8% used

MacBook Pro M3 Max 128GB

96 GB · 400 GB/s

~57 tok/s

12% used

MacBook Pro M4 Max 128GB

96 GB · 546 GB/s

~78 tok/s

12% used

Models

Open-weight LLMs we cover

Llama 3.1 8B

2024

Llama · 8.03B

Meta's small workhorse. Runs on most consumer GPUs at 4-bit. Good fit for chat assistants and code helpers.

License: Llama 3.1 Community

Llama 3.1 70B

2024

Llama · 70.6B

GPT-4 class on a single machine if you have 2x consumer GPUs or one A100/H100. Quantize to fit.

License: Llama 3.1 Community

Llama 3.1 405B

2024

Llama · 405B

Frontier-tier open weights. Needs a multi-GPU server or aggressive 4-bit quantization to even start.

License: Llama 3.1 Community

DeepSeek V3 (671B MoE)

2024

DeepSeek · 671B

MoE design means total weights are huge but active compute is closer to 37B. Memory still fills up.

License: DeepSeek License (commercial OK)

DeepSeek R1

2025

DeepSeek · 671B

Reasoning-tuned variant. Same memory profile as V3, just thinks longer per token.

License: DeepSeek License

Qwen 2.5 7B

2024

Qwen · 7.62B

Alibaba's strong small model with permissive license. GQA keeps the KV cache lean for long contexts.

License: Apache 2.0

Qwen 2.5 32B

2024

Qwen · 32.5B

Sweet spot between 7B and 70B. Fits on a single 3090/4090 at 4-bit, on M3 Max at 8-bit.

License: Apache 2.0

Qwen 2.5 72B

2024

Qwen · 72.7B

Comparable to Llama 3.1 70B, often better at math and code. Same memory ballpark.

License: Qwen License

Mistral 7B

2023

Mistral · 7.24B

The classic small open model. Still a fine choice if you want minimal VRAM and full Apache freedom.

License: Apache 2.0

Mixtral 8x7B (MoE)

2023

Mistral · 46.7B

Eight 7B experts. Total memory is 47B but only ~13B is active per token. Surprisingly fast.

License: Apache 2.0

Gemma 2 9B

2024

Gemma · 9.24B

Google's small open model. Ranks well on chat benchmarks, runs on a single mid-tier GPU.

License: Gemma Terms

Gemma 2 27B

2024

Gemma · 27.2B

Mid-size Google model. Fits on a 3090 at 4-bit. Good middle ground for local use.

License: Gemma Terms

Phi-4 14B

2024

Phi · 14.7B

Microsoft's data-quality bet, MIT licensed. Punches above its weight at math and reasoning.

License: MIT

Common setups

Three popular ways to run a local LLM

Indie dev

RTX 4090 + 8B model

24 GB single card. Llama 3.1 8B at FP16 with a 32k context fits with room. Q4 lets you push to 32B class. Best price-to-tokens-per-second for one user.

~160 tok/s

Power user

Mac Studio M4 Ultra 256GB

Quietest single machine that handles 70B class at FP16 and DeepSeek V3 at Q4. Slower than NVIDIA, no fan noise, no Linux maintenance.

~25 tok/s

Small team

2x A100 80GB

160 GB across two cards. Run Llama 3.1 405B at Q4 with tensor parallelism. Or serve 70B at FP16 with batching for multiple users.

~40 tok/s × N

Blog

Long reads

Use cases

Worked examples

Glossary

Quick definitions

FAQ

Real questions, real answers

How much VRAM do I need to run Llama 3.1 70B?+

At 4-bit quantization (Q4_K_M) Llama 3.1 70B weighs about 40 GB. Add 2-4 GB for an 8k context KV cache and 1-2 GB of overhead, you land near 48 GB total. A single 48 GB card (RTX 6000 Ada, A40) fits it. Two 24 GB consumer cards (2x 4090, 2x 3090) fit it with tensor parallelism. A100 80 GB is the safe single-card choice.

Can my Mac run a local LLM?+

Apple Silicon Macs use unified memory, so the GPU shares system RAM. About 75% is available for the GPU. An M2/M3/M4 Pro with 36 GB runs 7B-13B comfortably. M3/M4 Max with 64-128 GB handles 32B and 70B at 4-bit. M2 Ultra 192 GB and M4 Ultra 256 GB are the only single-machine options that fit DeepSeek V3 quantized.

What is quantization and how much does it hurt quality?+

Quantization reduces the bits per weight. FP16 is the native training precision (16 bits, no loss). Q8 halves memory with negligible drop. Q4_K_M (~4.85 bits) is the popular sweet spot, ~70% memory cut and a small quality hit. Q3 and below start showing real degradation, particularly on math and coding tasks. The MMLU drop from FP16 to Q4 is usually 1-3 points.

What is the KV cache and why does it grow with context?+

Every transformer layer keeps the keys and values for past tokens so it does not recompute attention. Memory scales linearly with context length. For Llama 3.1 70B at 32k context, the KV cache is around 10 GB on its own. Long-context models with high head counts (older MQA-free designs) burn KV cache the fastest. Group Query Attention (GQA) cuts this dramatically.

Why is the calculator showing more VRAM than I see in benchmarks?+

Benchmark numbers often quote the model weights only and assume a tiny context. This calculator includes weights, KV cache for your context length, activation memory, and a 1 GB engine overhead, then adds 20% headroom because real workloads spike. If you set context to 2048 and ignore overhead, you will roughly match the marketing numbers.

Should I buy two 3090s or one 4090 for local LLMs?+

Two used 3090s give you 48 GB total at roughly the same price as one 4090, which has 24 GB. For models above 24 GB the 2x 3090 setup wins. Below 24 GB, the single 4090 is faster, runs cooler, and avoids the multi-GPU complexity. The break point is whether you want to run 70B-class models locally.

Does AMD work for local LLMs?+

Yes, with friction. ROCm support has improved a lot. RX 7900 XTX with 24 GB is a viable LLM card on Linux through llama.cpp Vulkan or ROCm builds. Datacenter MI300X with 192 GB is excellent if you can get one. Day-to-day, most tooling (vLLM, transformers, ExLlamaV2) is more polished on NVIDIA.

What is the difference between active and total parameters in MoE?+

Mixture of Experts models (DeepSeek V3, Mixtral) have many expert subnetworks. Each token uses only a few. DeepSeek V3 has 671B total weights but only 37B active per token. You still need to fit all 671B weights in memory because any expert can fire. Speed scales with active params, memory scales with total.

What changes if I quantize the KV cache to 8-bit or 4-bit?+

KV cache memory is independent of weight quantization. Setting it to 8-bit halves cache memory, 4-bit quarters it. Most modern serving engines (vLLM, llama.cpp, TensorRT-LLM) support FP8/INT8 KV cache. Quality impact is small on chat workloads, larger on reasoning.

How fast will the model actually run?+

Token generation is bound by memory bandwidth, not compute. Read the active weights once per token. Tokens per second roughly equals (bandwidth in GB/s) / (active params in GB). A 4090 at 1008 GB/s running 8B-Q4 (4 GB) gives ~250 tok/s peak, ~160 in practice. The calculator uses a 65% efficiency factor.

Does context length affect generation speed?+

Yes, but mostly through KV cache reads. The first token in a long prompt takes a while (prefill is compute-bound). After that, every generated token reads KV from all prior tokens. At very long contexts, KV bandwidth starts dominating and tok/s drops.

Is the calculator accurate for fine-tuning or training?+

No, this is for inference only. Training adds optimizer states (AdamW = 8 bytes per param at FP32 master), gradients, and activation checkpointing. A 7B full fine-tune wants ~80 GB. LoRA fine-tuning is much cheaper, ~10 GB for 7B on top of model weights.

What if my model is not in the list?+

Pick the closest size and architecture. The calculator uses exposed config (params, hidden size, layers, KV heads, head dim). For a model with similar architecture, results are within 5-10%. We will keep adding models, especially Mistral Large, Llama 3.3, and the new Chinese open weights.

Can I run Llama 70B on CPU?+

Yes, with patience. Quantize to Q4 or lower (~40 GB), use llama.cpp on a machine with 64+ GB RAM and DDR5. Expect 1-3 tokens per second on a fast desktop, 5-10 on a server with high memory bandwidth. Apple Silicon Macs are far better than x86 CPUs for this because of the unified memory architecture.

Why do you cap the suggestions at 18 hardware options?+

Eighteen covers the spectrum from cheap consumer cards to top-tier datacenter and Apple Silicon. Adding more turns the list into noise. If your specific card is missing, find the closest match by VRAM and bandwidth, the verdict will still hold.