Pipeline Parallelism

Pipeline parallelism is a way to train a model that is too large to fit on one device by splitting it across devices along its sequence of layers. The first group of layers lives on the first device, the next group on the second, and so on, so that a forward pass flows through the devices in turn like a factory assembly line, with the backward pass flowing back the other way.

The challenge is keeping the devices busy. If you simply hand one full batch down the line, every device except the one currently working sits idle, a problem often called the pipeline bubble. The standard fix, introduced at scale by Google’s GPipe, is to divide each batch into smaller micro-batches and feed them in one after another, so that while device two works on the first micro-batch, device one is already starting the second. This staggering keeps multiple stages active at once and recovers most of the lost efficiency. Pipeline parallelism is usually combined with data parallelism and tensor parallelism; using all three together is commonly called 3D parallelism.

Why a business reader should care: pipeline parallelism is one of the techniques that make it possible to train models far larger than any single chip can hold, which underpins the scale of today’s frontier models.

Sources

Related