Set Up a 32k-Context Coding Assistant

Long context for repo-scale work without melting the GPU. Right model, right quant, right KV cache.

Step 1
Pick a long-context model
Qwen 2.5 32B or 7B both handle 128k natively. DeepSeek Coder V2 is also strong at long code. Avoid old multi-head models, KV cache will eat you alive.
Step 2
Quantize the KV cache to FP8 or INT8
Most engines support this. KV cache memory drops by half. Quality cost on code is small.
Step 3
Set context to 32k, not 128k
32k holds a substantial codebase. 128k forces you to fight memory and slows the first-token prefill. Most repo-scale workflows fit in 32k.
Step 4
Verify in the calculator
Open the home page calculator, set Qwen 2.5 32B, Q4_K_M, 32k, KV 8-bit. Total lands near 22 GB, fits comfortably on a 4090 or M3 Pro.

Pick a long-context model