Tensor Parallelism

Tensor parallelism, also called intra-layer model parallelism, splits the computation inside an individual layer across multiple devices. Where pipeline parallelism cuts a model between its layers, tensor parallelism cuts within a layer: the large matrix multiplications at the heart of a transformer layer are partitioned by columns or rows across GPUs, each device computes its portion, and the partial results are combined with a small number of communication operations.

The approach was introduced at scale by NVIDIA’s Megatron-LM, which showed that the matrix multiplications in a transformer’s attention and feed-forward blocks can be split so cleanly that only a few all-reduce operations per layer are needed, and that this could be done in native PyTorch without a special compiler. Because every device cooperates on the same layer at the same time, tensor parallelism keeps a single very large layer’s work distributed, but it requires fast interconnects between the cooperating GPUs, so it is typically used within a single server while pipeline and data parallelism span across servers. The three together form what is commonly called 3D parallelism.

Why a business reader should care: tensor parallelism is what lets the enormous individual layers of frontier language models be computed at all, and it is a core reason high-bandwidth GPU interconnects matter so much for training.

Sources

Related