Glossary

Quick definitions for terms that come up sizing local LLMs.

VRAM
Video memory on a GPU. The space the model weights, KV cache, and activations live in during inference.
KV Cache
Stored attention keys and values for past tokens. Grows linearly with context length.
Quantization
Reducing the bits per weight. Cuts memory at a small quality cost.
Q4_K_M
GGUF 4-bit quantization, mixed precision. The popular default for fitting big models on consumer hardware.
Context Length
How many tokens the model can attend to in one forward pass. Drives KV cache memory.
Grouped Query Attention (GQA)
An attention variant that shares keys and values across query heads. Drastically smaller KV cache.
Mixture of Experts (MoE)
Architecture with many expert subnetworks where only a few activate per token.
Tokens per Second (tok/s)
How fast the model generates output. Bound by memory bandwidth, not compute.
Unified Memory
Apple Silicon's shared CPU/GPU memory. Lets the GPU use system RAM at ~75% available.
Tensor Parallelism
Splitting a single model across multiple GPUs by partitioning weight matrices.