Run Llama 3.1 8B on a Laptop GPU
The smallest sane local-LLM setup. 8B at Q4, 8k context, on any 8 GB+ GPU.
- Step 1
Pick the model and quant
Open the calculator, pick Llama 3.1 8B and Q4_K_M. Total memory at 8k context lands around 6 GB. Plenty of headroom on a 12 GB laptop GPU and tight but workable on 8 GB.
- Step 2
Install llama.cpp or Ollama
Ollama is the friction-free way: one install, then `ollama run llama3.1:8b-instruct-q4_K_M`. llama.cpp gives you more control over quantization and KV cache settings if you care.
- Step 3
Set context to your real workload
Most chat fits in 4-8k. Code workflows benefit from 16-32k. Do not crank context to 128k 'just in case': KV cache grows linearly and you will OOM on the second long chat.
- Step 4
Verify with a real prompt
Watch nvidia-smi or Activity Monitor while you ask it something with a long output. If memory hits 95%+ you are too close to the edge, drop quant or context.