Updated for Llama 3.1, DeepSeek V3, Qwen 2.5

How much VRAM does this LLM actually need?

Pick a model, set a quantization, choose context length. The calculator shows total memory needed, breaks down where it goes, and ranks every GPU and Mac by whether it fits, plus rough tokens per second.

13
Models supported
22
Hardware options
6
Quantizations
Always
Free, no signup
Calculator

Pick a model and a GPU. We do the math.

Numbers update live. The total includes weights, KV cache for your context length, activations, and a 20% headroom buffer because real workloads spike.

Pick your setup

Meta's small workhorse. Runs on most consumer GPUs at 4-bit. Good fit for chat assistants and code helpers.

The popular 4-bit. ~70% memory cut, small quality drop.

512131,072
Total VRAM needed
8.4 GB
with 20% headroom. Base 7.0 GB
Model weights4.5 GB
KV cache1.0 GB
Activations512 MB
Overhead1.0 GB
What runs this
22 options fit
0 tight · 0 short
RTX 4060 Ti 16GB
16 GB · 288 GB/s
~41 tok/s
53% used
RTX 4070 Ti Super
16 GB · 672 GB/s
~96 tok/s
53% used
RTX 4080 Super
16 GB · 736 GB/s
~106 tok/s
53% used
RTX 4090
24 GB · 1008 GB/s
~145 tok/s
35% used
RTX 3090
24 GB · 936 GB/s
~134 tok/s
35% used
RTX 5090
32 GB · 1792 GB/s
~257 tok/s
26% used
RTX A6000
48 GB · 768 GB/s
~110 tok/s
18% used
RTX 6000 Ada
48 GB · 960 GB/s
~138 tok/s
18% used
A100 40GB
40 GB · 1555 GB/s
~223 tok/s
21% used
A100 80GB
80 GB · 2039 GB/s
~292 tok/s
11% used
H100 80GB SXM
80 GB · 3350 GB/s
~480 tok/s
11% used
H200 141GB
141 GB · 4800 GB/s
~688 tok/s
6% used
B200 192GB
192 GB · 8000 GB/s
~1147 tok/s
4% used
Radeon RX 7900 XTX
24 GB · 960 GB/s
~138 tok/s
35% used
Instinct MI300X
192 GB · 5300 GB/s
~760 tok/s
4% used
Mac Studio M2 Ultra 192GB
144 GB · 800 GB/s
~115 tok/s
8% used
MacBook Pro M3 Max 128GB
96 GB · 400 GB/s
~57 tok/s
12% used
MacBook Pro M4 Max 128GB
96 GB · 546 GB/s
~78 tok/s
12% used
Models

Open-weight LLMs we cover

Llama 3.1 8B
2024
Llama · 8.03B

Meta's small workhorse. Runs on most consumer GPUs at 4-bit. Good fit for chat assistants and code helpers.

License: Llama 3.1 Community
Llama 3.1 70B
2024
Llama · 70.6B

GPT-4 class on a single machine if you have 2x consumer GPUs or one A100/H100. Quantize to fit.

License: Llama 3.1 Community
Llama 3.1 405B
2024
Llama · 405B

Frontier-tier open weights. Needs a multi-GPU server or aggressive 4-bit quantization to even start.

License: Llama 3.1 Community
DeepSeek V3 (671B MoE)
2024
DeepSeek · 671B

MoE design means total weights are huge but active compute is closer to 37B. Memory still fills up.

License: DeepSeek License (commercial OK)
DeepSeek R1
2025
DeepSeek · 671B

Reasoning-tuned variant. Same memory profile as V3, just thinks longer per token.

License: DeepSeek License
Qwen 2.5 7B
2024
Qwen · 7.62B

Alibaba's strong small model with permissive license. GQA keeps the KV cache lean for long contexts.

License: Apache 2.0
Qwen 2.5 32B
2024
Qwen · 32.5B

Sweet spot between 7B and 70B. Fits on a single 3090/4090 at 4-bit, on M3 Max at 8-bit.

License: Apache 2.0
Qwen 2.5 72B
2024
Qwen · 72.7B

Comparable to Llama 3.1 70B, often better at math and code. Same memory ballpark.

License: Qwen License
Mistral 7B
2023
Mistral · 7.24B

The classic small open model. Still a fine choice if you want minimal VRAM and full Apache freedom.

License: Apache 2.0
Mixtral 8x7B (MoE)
2023
Mistral · 46.7B

Eight 7B experts. Total memory is 47B but only ~13B is active per token. Surprisingly fast.

License: Apache 2.0
Gemma 2 9B
2024
Gemma · 9.24B

Google's small open model. Ranks well on chat benchmarks, runs on a single mid-tier GPU.

License: Gemma Terms
Gemma 2 27B
2024
Gemma · 27.2B

Mid-size Google model. Fits on a 3090 at 4-bit. Good middle ground for local use.

License: Gemma Terms
Phi-4 14B
2024
Phi · 14.7B

Microsoft's data-quality bet, MIT licensed. Punches above its weight at math and reasoning.

License: MIT
Common setups

Three popular ways to run a local LLM

Indie dev

RTX 4090 + 8B model

24 GB single card. Llama 3.1 8B at FP16 with a 32k context fits with room. Q4 lets you push to 32B class. Best price-to-tokens-per-second for one user.

~160 tok/s
Power user

Mac Studio M4 Ultra 256GB

Quietest single machine that handles 70B class at FP16 and DeepSeek V3 at Q4. Slower than NVIDIA, no fan noise, no Linux maintenance.

~25 tok/s
Small team

2x A100 80GB

160 GB across two cards. Run Llama 3.1 405B at Q4 with tensor parallelism. Or serve 70B at FP16 with batching for multiple users.

~40 tok/s × N
FAQ

Real questions, real answers

How much VRAM do I need to run Llama 3.1 70B?+

At 4-bit quantization (Q4_K_M) Llama 3.1 70B weighs about 40 GB. Add 2-4 GB for an 8k context KV cache and 1-2 GB of overhead, you land near 48 GB total. A single 48 GB card (RTX 6000 Ada, A40) fits it. Two 24 GB consumer cards (2x 4090, 2x 3090) fit it with tensor parallelism. A100 80 GB is the safe single-card choice.

Can my Mac run a local LLM?+

Apple Silicon Macs use unified memory, so the GPU shares system RAM. About 75% is available for the GPU. An M2/M3/M4 Pro with 36 GB runs 7B-13B comfortably. M3/M4 Max with 64-128 GB handles 32B and 70B at 4-bit. M2 Ultra 192 GB and M4 Ultra 256 GB are the only single-machine options that fit DeepSeek V3 quantized.

What is quantization and how much does it hurt quality?+

Quantization reduces the bits per weight. FP16 is the native training precision (16 bits, no loss). Q8 halves memory with negligible drop. Q4_K_M (~4.85 bits) is the popular sweet spot, ~70% memory cut and a small quality hit. Q3 and below start showing real degradation, particularly on math and coding tasks. The MMLU drop from FP16 to Q4 is usually 1-3 points.

What is the KV cache and why does it grow with context?+

Every transformer layer keeps the keys and values for past tokens so it does not recompute attention. Memory scales linearly with context length. For Llama 3.1 70B at 32k context, the KV cache is around 10 GB on its own. Long-context models with high head counts (older MQA-free designs) burn KV cache the fastest. Group Query Attention (GQA) cuts this dramatically.

Why is the calculator showing more VRAM than I see in benchmarks?+

Benchmark numbers often quote the model weights only and assume a tiny context. This calculator includes weights, KV cache for your context length, activation memory, and a 1 GB engine overhead, then adds 20% headroom because real workloads spike. If you set context to 2048 and ignore overhead, you will roughly match the marketing numbers.

Should I buy two 3090s or one 4090 for local LLMs?+

Two used 3090s give you 48 GB total at roughly the same price as one 4090, which has 24 GB. For models above 24 GB the 2x 3090 setup wins. Below 24 GB, the single 4090 is faster, runs cooler, and avoids the multi-GPU complexity. The break point is whether you want to run 70B-class models locally.

Does AMD work for local LLMs?+

Yes, with friction. ROCm support has improved a lot. RX 7900 XTX with 24 GB is a viable LLM card on Linux through llama.cpp Vulkan or ROCm builds. Datacenter MI300X with 192 GB is excellent if you can get one. Day-to-day, most tooling (vLLM, transformers, ExLlamaV2) is more polished on NVIDIA.

What is the difference between active and total parameters in MoE?+

Mixture of Experts models (DeepSeek V3, Mixtral) have many expert subnetworks. Each token uses only a few. DeepSeek V3 has 671B total weights but only 37B active per token. You still need to fit all 671B weights in memory because any expert can fire. Speed scales with active params, memory scales with total.

What changes if I quantize the KV cache to 8-bit or 4-bit?+

KV cache memory is independent of weight quantization. Setting it to 8-bit halves cache memory, 4-bit quarters it. Most modern serving engines (vLLM, llama.cpp, TensorRT-LLM) support FP8/INT8 KV cache. Quality impact is small on chat workloads, larger on reasoning.

How fast will the model actually run?+

Token generation is bound by memory bandwidth, not compute. Read the active weights once per token. Tokens per second roughly equals (bandwidth in GB/s) / (active params in GB). A 4090 at 1008 GB/s running 8B-Q4 (4 GB) gives ~250 tok/s peak, ~160 in practice. The calculator uses a 65% efficiency factor.

Does context length affect generation speed?+

Yes, but mostly through KV cache reads. The first token in a long prompt takes a while (prefill is compute-bound). After that, every generated token reads KV from all prior tokens. At very long contexts, KV bandwidth starts dominating and tok/s drops.

Is the calculator accurate for fine-tuning or training?+

No, this is for inference only. Training adds optimizer states (AdamW = 8 bytes per param at FP32 master), gradients, and activation checkpointing. A 7B full fine-tune wants ~80 GB. LoRA fine-tuning is much cheaper, ~10 GB for 7B on top of model weights.

What if my model is not in the list?+

Pick the closest size and architecture. The calculator uses exposed config (params, hidden size, layers, KV heads, head dim). For a model with similar architecture, results are within 5-10%. We will keep adding models, especially Mistral Large, Llama 3.3, and the new Chinese open weights.

Can I run Llama 70B on CPU?+

Yes, with patience. Quantize to Q4 or lower (~40 GB), use llama.cpp on a machine with 64+ GB RAM and DDR5. Expect 1-3 tokens per second on a fast desktop, 5-10 on a server with high memory bandwidth. Apple Silicon Macs are far better than x86 CPUs for this because of the unified memory architecture.

Why do you cap the suggestions at 18 hardware options?+

Eighteen covers the spectrum from cheap consumer cards to top-tier datacenter and Apple Silicon. Adding more turns the list into noise. If your specific card is missing, find the closest match by VRAM and bandwidth, the verdict will still hold.