โ† All posts

Running Local LLMs on Apple Silicon in 2026

2026-05-02

Unified memory makes Macs surprisingly good at big models, and disappointingly slow at fast generation. Where the M-series shines and where it does not.

The Apple Silicon story for local LLMs is two sentences. Unified memory means the GPU can use up to ~75% of system RAM. Memory bandwidth is much lower than NVIDIA datacenter cards. Both matter.

The numbers

| Mac | Unified RAM | Bandwidth | tok/s on Llama 3.1 8B Q4 | |-----|-------------|-----------|--------------------------| | M4 Pro 48 GB | 36 GB usable | 273 GB/s | ~70 | | M3 Max 128 GB | 96 GB usable | 400 GB/s | ~110 | | M4 Max 128 GB | 96 GB usable | 546 GB/s | ~140 | | M2 Ultra 192 GB | 144 GB usable | 800 GB/s | ~200 | | M4 Ultra 256 GB | 192 GB usable | 1092 GB/s | ~270 |

Where Macs win

  • Quiet single-machine inference. No fan noise, no second PSU.
  • Fitting big models. M4 Ultra 256 GB runs DeepSeek V3 at Q4. No NVIDIA consumer card touches that.
  • Power. The whole machine pulls 30-100W during inference. A 4090 alone pulls 400+.
  • Battery laptops. M4 Max in a MacBook Pro lets you run a local LLM on a plane.

Where Macs lose

  • Tokens per second on small models. A 4090 demolishes any Mac on 7B Q4 simply through bandwidth.
  • Multi-user serving. macOS is not a serving stack. Linux on NVIDIA wins for any team scenario.
  • Software polish. vLLM, TensorRT-LLM, ExLlamaV2 all target NVIDIA first. llama.cpp Metal is excellent but not the fastest path on every model.
  • Training. The MLX framework is improving but still well behind PyTorch on CUDA.

Buying advice

If your goal is to run 70B-class models on a single quiet machine and you do not need maximum tok/s, buy a Mac. M4 Max 128 GB is the sweet spot for most people. M4 Ultra 256 GB is the flex if you want DeepSeek V3 to fit.

If your goal is fast 7B-13B for one user, buy a 4090. If you want 70B with throughput, buy 2x 3090 or rent cloud.

The calculator on the home page includes every Mac listed above and computes the actual tok/s you should expect.