RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE)

“RoFormer: Enhanced Transformer with Rotary Position Embedding” was submitted to arXiv on April 20, 2021 by Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. It introduced RoPE, the rotary position embedding, which quietly became the standard way modern Transformers encode where each token sits in a sequence.

A Transformer’s attention mechanism is, by itself, blind to order - it sees a bag of tokens and must be told their positions. The original Transformer added learned or sinusoidal position vectors to the token embeddings. RoPE takes a different approach: it rotates each token’s query and key vectors by an angle proportional to its position, using a rotation matrix. Because of how rotations compose, the dot product between two tokens’ attention vectors then depends only on their relative distance, not their absolute positions. So RoPE encodes absolute position mechanically while making attention naturally relative, and it does so without adding parameters.

This combination - relative-position behavior, no extra parameters, and graceful handling of long sequences - made RoPE attractive as models grew and context windows lengthened. It was adopted by GPT-NeoX, the Llama family, PaLM-style models, and most open-weight LLMs since, and it is the substrate that later context-extension tricks (position interpolation, NTK-aware scaling) build on to stretch a model to longer inputs than it was trained on.

RoPE is another of the unshowy architectural choices, like layer normalization and GELU, that the modern Transformer absorbed and standardized on. Few users have heard of it, but it is running inside nearly every large model they touch.

RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE)

Sources

Related