Multi-Head Attention

Multi-head attention is the Transformer’s trick of running attention many times in parallel instead of just once. Rather than computing a single set of attention weights, the model projects each token’s query, key, and value vectors into several smaller subspaces - “heads” - and runs the attention computation separately in each, then concatenates the results. The “Attention Is All You Need” paper explains the motivation: “Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.” The original Transformer used eight parallel heads.

The intuition is that different heads can specialize. One head might track which adjective modifies which noun, another might follow long-range subject-verb agreement, another might attend to nearby tokens for local fluency. Doing this with one undivided attention pass would force the model to average all of those signals together and lose resolution; splitting into heads lets it keep several distinct views of the same sequence and combine them at the end. Studies of trained models later found interpretable heads doing exactly this kind of specialized work, though many heads turn out to be redundant.

Because storing the keys and values for every head is a major part of the memory cost during generation, later efficiency techniques such as grouped-query attention reduce the number of distinct key-value heads while keeping multiple query heads.

Why business readers should care: multi-head attention is a standard line in model architecture descriptions. Knowing it means “several attention computations in parallel, each watching for different patterns” turns an opaque phrase into a clear idea, and it explains where a chunk of a model’s memory footprint during inference comes from.

Sources

Related