Tensor Parallelism

Splitting a single model across multiple GPUs by partitioning weight matrices.

When a model is too big for one GPU, tensor parallelism splits each weight matrix across N GPUs row-wise or column-wise, then synchronizes activations through NCCL. vLLM exposes this with `--tensor-parallel-size N`. Best with high-bandwidth interconnects (NVLink, fast PCIe). Two 3090s on PCIe 4.0 x8 work fine for 70B serving.