Grouped Query Attention (GQA)
An attention variant that shares keys and values across query heads. Drastically smaller KV cache.
Standard multi-head attention has one set of keys and values per query head. GQA shares one set across multiple query heads (a 'group'). Llama 3.1 uses 8 KV heads versus 64 query heads, an 8x KV cache reduction. This is the architectural change that made long-context Llama practical without massive memory.