Noam Shazeer co-authored both the Transformer paper and the sparse mixture-of-experts paper

Two ideas that define how today’s frontier models are built share an author. Noam Shazeer is listed second among the eight authors of “Attention Is All You Need” (arXiv 1706.03762), the 2017 paper introducing the Transformer. He is also the first author of “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (arXiv 1701.06538), published earlier the same year, which introduced the sparse MoE layer used to scale model capacity efficiently. Both the attention-based Transformer and sparse mixture-of-experts routing are now standard ingredients in large language models.

Sources

Last verified June 6, 2026