Tokens per Second (tok/s)
How fast the model generates output. Bound by memory bandwidth, not compute.
During generation, the model reads its weights once per token. Tokens per second roughly equals (memory bandwidth in GB/s) / (active weight bytes). A 4090 at 1008 GB/s reading 4 GB of 8B-Q4 weights gives ~250 tok/s peak, ~160 in practice. Apple Silicon at 400-1100 GB/s gives lower numbers but quietly.