โ† All playbooks

Run Llama 3.1 70B on Two RTX 3090s

The classic budget 70B setup. 48 GB combined VRAM, tensor parallel, Q4 quant.

  1. Step 1

    Pick a serving engine

    vLLM and TensorRT-LLM both do tensor parallel cleanly. ExLlamaV2 is the connoisseur choice for ExLlama-quantized weights. Pick vLLM if you want one command and reasonable defaults.

  2. Step 2

    Quant to Q4_K_M

    Llama 3.1 70B Q4_K_M is ~43 GB. Add 8k KV cache, activations, headroom: ~52 GB total, splits ~26 GB per card. Comfortable on dual 3090.

  3. Step 3

    Set tensor_parallel_size=2

    vLLM: `--tensor-parallel-size 2`. The library splits the weights row-wise and synchronizes activations through NCCL. Make sure both cards are on x8 PCIe minimum.

  4. Step 4

    Watch the PSU

    Two 3090s under load draw 700W+. Cheap 850W supplies trip during inference spikes. 1200W gold-rated minimum. Power-limit cards to 300W via nvidia-smi if your PSU is borderline.