Context Length
How many tokens the model can attend to in one forward pass. Drives KV cache memory.
Context length is the maximum number of tokens (words and sub-words) the model can consider at once. Llama 3.1 supports up to 128k. Memory grows linearly with context because every additional token adds keys and values to the KV cache. Setting context to 128k 'just in case' is a common reason for VRAM blowups.