Q4_K_M

GGUF 4-bit quantization, mixed precision. The popular default for fitting big models on consumer hardware.

Q4_K_M is a GGUF format that uses about 4.85 bits per weight on average, with mixed-precision blocks where attention and feed-forward parts get slightly more bits than embeddings. Memory cut versus FP16 is around 70%. Quality drop is typically 1-3 points on MMLU. Llama 3.1 70B Q4_K_M weighs ~43 GB, which is why it fits on dual 3090s.

Related terms

Quantization