“GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints” was submitted to arXiv on May 22, 2023 by Joshua Ainslie, James Lee-Thorp, and colleagues at Google Research. It addressed a bottleneck that only becomes visible when serving large models: the memory cost of attention at inference time.
In standard multi-head attention every attention head has its own set of key and value projections. During generation those keys and values are cached for every past token (the KV cache), and at long context lengths that cache dominates memory and bandwidth. An earlier fix, multi-query attention, shares a single key and value across all heads, which slashes the cache but can hurt quality and is unstable to train. GQA splits the difference: it groups the query heads and gives each group its own shared key-value pair - more than one, but fewer than the full number of heads. This recovers most of the speed and memory benefit of multi-query attention while keeping quality close to full multi-head attention.
The paper also showed you do not have to train a GQA model from scratch: an existing multi-head model can be “uptrained” into a GQA model using only about 5 percent of the original pretraining compute, by averaging the existing key-value heads into groups and briefly continuing training.
GQA was adopted quickly because inference, not training, is where deployed models spend most of their lifetime cost. Llama 2’s larger models, Mistral, and many later open-weight models use it, making GQA one of the standard ingredients for serving large language models efficiently at long context.