Run DeepSeek V3 on a Mac Studio
The only single-machine path to running 671B-parameter DeepSeek V3 quietly at home.
- Step 1
Get an M2 Ultra 192 GB or M4 Ultra 256 GB
DeepSeek V3 at Q4 is ~340 GB. Even on Apple's biggest unified memory machines, you will not fit Q4 in 75% of 192 GB (=144 GB). You need Q3 or aggressive quantization, or the new M4 Ultra 256 GB which gives 192 GB usable.
- Step 2
Use llama.cpp Metal
MLX is improving but llama.cpp Metal still has the most polished MoE inference path on macOS. Pull a GGUF at Q3_K_M or IQ2_M.
- Step 3
Expect ~10-15 tok/s
Even with 1092 GB/s on M4 Ultra, the ~37B active params per token put you in the 10-15 tok/s range at best. Quiet, useful, not GPT-4 fast.
- Step 4
Plug into a real client
llama.cpp exposes an OpenAI-compatible HTTP server. Point Open WebUI, Cursor, or your code agents at localhost:8080. Now you have a frontier-tier model running on a desk machine.