Self-Attention

Self-attention is the specific form of attention that the Transformer is built on. Where the original 2014 attention mechanism let a decoder look back at a separate input sentence, self-attention lets a sequence attend to itself: each token examines every other token in the same sequence and weighs how much each one matters for building its own representation. The “Attention Is All You Need” paper defines it directly: “Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.”

In practice, every token is turned into three vectors - a query, a key, and a value. A token’s query is compared against the keys of all other tokens to produce a set of weights, and those weights are used to blend the values into a new, context-aware version of that token. The word “bank” gets a different representation in “river bank” than in “savings bank” because it attends to different neighbors. Because every position can directly reach every other position in a single step, self-attention captures long-range relationships that recurrent networks, which pass information step by step, tend to lose.

The trade-off is cost. Comparing every token to every other token means the compute and memory grow with the square of the sequence length, which is why long context is expensive and why later work like FlashAttention and state-space models such as Mamba exists to ease that quadratic bottleneck.

Why business readers should care: self-attention is the single mechanism underneath nearly every modern large language model. When vendors talk about “context,” “attention,” or why longer prompts cost more, they are talking about the behavior and the quadratic cost of self-attention.

Sources

Related