FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” was submitted to arXiv on May 27, 2022 by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re of Stanford and the University at Buffalo.

Earlier attempts to make Transformer attention faster used approximations that traded away model quality. FlashAttention instead computes exact attention but makes it IO-aware: it uses tiling to keep intermediate values in the GPU’s small, fast on-chip SRAM and minimizes reads and writes to the much slower high-bandwidth memory (HBM). The paper reports concrete speedups, including a 3x wall-clock gain on GPT-2 and a 15 percent improvement on BERT-large, while reducing memory use enough to let Transformers handle contexts up to 64,000 tokens that were previously out of reach.

FlashAttention became standard infrastructure, baked into the major training and inference stacks, and later versions pushed efficiency further. Like PagedAttention, it is a hardware-aware systems result rather than a new architecture, and it is a large part of why long-context models became practical and cheaper to run.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Sources

Related