Run Llama 3.1 70B on Two RTX 3090s
The classic budget 70B setup. 48 GB combined VRAM, tensor parallel, Q4 quant.
- Step 1
Pick a serving engine
vLLM and TensorRT-LLM both do tensor parallel cleanly. ExLlamaV2 is the connoisseur choice for ExLlama-quantized weights. Pick vLLM if you want one command and reasonable defaults.
- Step 2
Quant to Q4_K_M
Llama 3.1 70B Q4_K_M is ~43 GB. Add 8k KV cache, activations, headroom: ~52 GB total, splits ~26 GB per card. Comfortable on dual 3090.
- Step 3
Set tensor_parallel_size=2
vLLM: `--tensor-parallel-size 2`. The library splits the weights row-wise and synchronizes activations through NCCL. Make sure both cards are on x8 PCIe minimum.
- Step 4
Watch the PSU
Two 3090s under load draw 700W+. Cheap 850W supplies trip during inference spikes. 1200W gold-rated minimum. Power-limit cards to 300W via nvidia-smi if your PSU is borderline.