Blog

Long reads on local LLM hardware, quantization, and the actual numbers behind the marketing.

How Much VRAM Does Llama 3.1 70B Actually Need?

Walk through the real numbers, weights, KV cache, activations, headroom, for running Llama 3.1 70B at every popular quantization.

Q4 vs Q8 vs FP16: Which Quantization Should You Actually Use?

A practical guide to picking quantization for local LLMs. What each format gives up, what it saves, and when the tradeoff matters.

Why Your Long-Context LLM Runs Out of Memory

The KV cache is the line item nobody mentions until you set context to 128k and the GPU explodes. Here is what it is and how to size it.

RTX 4090 vs Two RTX 3090s for Local LLMs

Same budget, very different ceilings. When the single 4090 wins and when the dual 3090 setup is the only sensible answer.

Running Local LLMs on Apple Silicon in 2026

Unified memory makes Macs surprisingly good at big models, and disappointingly slow at fast generation. Where the M-series shines and where it does not.