“Efficient Memory Management for Large Language Model Serving with PagedAttention” was submitted to arXiv on September 12, 2023 and published at SOSP 2023. Its authors were Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica, working out of UC Berkeley and collaborators.
The paper identifies the key bottleneck in serving large language models: the key-value (KV) cache that grows with every token and is wasteful to allocate naively. Borrowing the idea of virtual memory and paging from operating systems, the authors introduce PagedAttention, which stores the KV cache in non-contiguous fixed-size blocks. This nearly eliminates fragmentation and lets requests share cached prefixes. Built on it, the vLLM serving system delivers 2-4x higher throughput than systems like FasterTransformer and Orca at the same latency, with the largest gains on longer sequences and bigger models.
vLLM became one of the most widely used open-source inference engines, the default way many teams serve open-weight models in production. PagedAttention is a clear example of classic systems engineering, not new model architecture, producing some of the largest practical efficiency gains in the LLM era.