Quantization

Reducing the bits per weight. Cuts memory at a small quality cost.

Modern LLMs train at FP16 or BF16 (16 bits per weight). Quantization stores the weights at fewer bits while keeping computation in higher precision where it matters. Common formats: Q8 (8-bit, near-lossless), Q5_K_M (5.5-bit, GGUF), Q4_K_M (4.85-bit, popular default), Q3_K_M (~3.9-bit, aggressive), Q2_K (2.6-bit, last resort). Tradeoff: lower bits, smaller memory, more quality drop.

Related terms

Q4_K_M