Glossary

Quick definitions for terms that come up sizing local LLMs.

Video memory on a GPU. The space the model weights, KV cache, and activations live in during inference.

Stored attention keys and values for past tokens. Grows linearly with context length.

Reducing the bits per weight. Cuts memory at a small quality cost.

GGUF 4-bit quantization, mixed precision. The popular default for fitting big models on consumer hardware.

How many tokens the model can attend to in one forward pass. Drives KV cache memory.

Grouped Query Attention (GQA)

An attention variant that shares keys and values across query heads. Drastically smaller KV cache.

Mixture of Experts (MoE)

Architecture with many expert subnetworks where only a few activate per token.

Tokens per Second (tok/s)

How fast the model generates output. Bound by memory bandwidth, not compute.

Apple Silicon's shared CPU/GPU memory. Lets the GPU use system RAM at ~75% available.

Tensor Parallelism

Splitting a single model across multiple GPUs by partitioning weight matrices.