โ† All playbooks

Set Up a 32k-Context Coding Assistant

Long context for repo-scale work without melting the GPU. Right model, right quant, right KV cache.

  1. Step 1

    Pick a long-context model

    Qwen 2.5 32B or 7B both handle 128k natively. DeepSeek Coder V2 is also strong at long code. Avoid old multi-head models, KV cache will eat you alive.

  2. Step 2

    Quantize the KV cache to FP8 or INT8

    Most engines support this. KV cache memory drops by half. Quality cost on code is small.

  3. Step 3

    Set context to 32k, not 128k

    32k holds a substantial codebase. 128k forces you to fight memory and slows the first-token prefill. Most repo-scale workflows fit in 32k.

  4. Step 4

    Verify in the calculator

    Open the home page calculator, set Qwen 2.5 32B, Q4_K_M, 32k, KV 8-bit. Total lands near 22 GB, fits comfortably on a 4090 or M3 Pro.