โ† All playbooks

Run Llama 3.1 8B on a Laptop GPU

The smallest sane local-LLM setup. 8B at Q4, 8k context, on any 8 GB+ GPU.

  1. Step 1

    Pick the model and quant

    Open the calculator, pick Llama 3.1 8B and Q4_K_M. Total memory at 8k context lands around 6 GB. Plenty of headroom on a 12 GB laptop GPU and tight but workable on 8 GB.

  2. Step 2

    Install llama.cpp or Ollama

    Ollama is the friction-free way: one install, then `ollama run llama3.1:8b-instruct-q4_K_M`. llama.cpp gives you more control over quantization and KV cache settings if you care.

  3. Step 3

    Set context to your real workload

    Most chat fits in 4-8k. Code workflows benefit from 16-32k. Do not crank context to 128k 'just in case': KV cache grows linearly and you will OOM on the second long chat.

  4. Step 4

    Verify with a real prompt

    Watch nvidia-smi or Activity Monitor while you ask it something with a long output. If memory hits 95%+ you are too close to the edge, drop quant or context.