Linear Attention

Linear attention refers to a family of techniques that reduce the cost of the Transformer’s self-attention from quadratic to linear in the length of the sequence. Standard self-attention compares every token to every other token, so doubling the sequence length roughly quadruples the work and memory. This quadratic scaling is the central obstacle to processing very long inputs, and linear attention methods aim to remove it.

The common trick is to avoid ever forming the full token-by-token attention matrix. Standard attention applies a softmax to similarity scores; linear attention methods either replace or approximate that softmax with a feature map so that the computation can be reordered. By multiplying the key and value matrices together first and reusing the result, the cost becomes linear in sequence length. The Performer, introduced in 2020, did this with random features and provable accuracy guarantees; other approaches use low-rank approximations or simpler kernel functions.

A useful side effect is that many linear attention formulations can be rewritten as a recurrence, processing one token at a time with a fixed-size state. This connects linear attention to recurrent networks and to architectures like RWKV and RetNet, which exploit exactly this duality to train in parallel but run inference cheaply.

For a general reader, linear attention is one of the main engineering answers to the question of how to give models longer memory affordably. The tradeoff is usually some loss of fidelity compared to exact attention, so it is a balance between cost and quality that practitioners tune for each application.

Sources

Related