โ† All posts

Q4 vs Q8 vs FP16: Which Quantization Should You Actually Use?

2026-05-08

A practical guide to picking quantization for local LLMs. What each format gives up, what it saves, and when the tradeoff matters.

Quantization is the cheat code for running big models on small hardware. The question is how cheap is too cheap.

The four levels you actually meet

FP16/BF16 is the native training precision. 16 bits per weight, no quality loss, full memory cost. Use it when you have headroom and the workload is sensitive (research, evaluation, agents that compose many calls).

INT8 / Q8 is the safe halving. Memory drops 50%, quality drop is below 1 MMLU point on most models. Almost always better than FP16 for serving because you get more headroom for KV cache and batching.

Q5_K_M sits at about 5.5 bits per weight in GGUF format. It is a small step down from Q8 with another 30% memory cut. Safe for most chat and code workloads.

Q4_K_M is the popular default. ~4.85 bits per weight. Memory cut of ~70% from FP16. Quality drop of 1-3 MMLU points. The reason you can fit a 70B model on a 24 GB card.

Where Q4 starts to bite

Q4 quality matches FP16 on standard benchmarks but starts to drift on: - Long-form code where one wrong token cascades. - Math reasoning where intermediate steps need to be exact. - Agentic workflows that chain 10+ calls. Errors compound. - Non-English work, especially for languages that were undertrained.

For chat and search, Q4 is fine. For shipping code agents, prefer Q5 or Q8 if you have the memory.

Below Q4

Q3 cuts another ~20% memory and starts showing real degradation. Use it only when nothing else fits. Q2 is a last resort. The model still talks, but you are running a different model than the one you benchmarked.

KV cache quantization is independent

You can run weights at Q4 and KV cache at INT8 or even INT4. The two are orthogonal. KV cache quantization is a free win on most modern engines.

Picking in 30 seconds

  • Plenty of VRAM, want best quality: BF16 or Q8
  • Limited VRAM, single-user chat: Q4_K_M
  • Limited VRAM, agents or code: Q5_K_M
  • Mac with unified memory: Q5 or Q4 (you get bandwidth, not pure speed)
  • Below 16 GB: smallest model possible at Q4 or Q5

The calculator on the home page lets you flip between quantizations and see the memory and tok/s impact in real time.