VRAM

Video memory on a GPU. The space the model weights, KV cache, and activations live in during inference.

VRAM is dedicated GPU memory, separate from system RAM. NVIDIA consumer cards range from 8 GB to 32 GB. Datacenter cards go to 192 GB. AMD's Instinct MI300X has 192 GB. Apple Silicon does not have separate VRAM, the GPU shares unified memory with the CPU at a ratio of about 75% available to the GPU. For LLM inference, more VRAM means bigger models, longer context, or bigger batches.

Related terms

KV Cache
Unified Memory