Tri Dao is an Assistant Professor of Computer Science at Princeton University, where he leads the Dao AI Lab, and co-founder and chief scientist of Together AI. He completed his PhD in computer science at Stanford, and his research sits at the intersection of machine learning and computer systems, focused on hardware-aware algorithms and sequence models with long-range memory.
He is best known for two lines of work. The first is FlashAttention, “Fast and Memory-Efficient Exact Attention with IO-Awareness” (arXiv 2205.14135), which reorganizes the attention computation to be aware of the GPU memory hierarchy - using tiling and recomputation so it never materializes the full attention matrix. The result is the same exact attention but dramatically faster and far more memory-efficient, which is why FlashAttention and its successors became standard infrastructure for training and serving large models. The second is Mamba, the selective state-space architecture he developed with Albert Gu, which processes sequences in linear time and emerged as the most prominent serious challenger to the Transformer’s dominance.
Across both, Dao’s signature is taking the costly quadratic core of modern models - attention over long sequences - and finding ways to make it cheaper, either by computing it more cleverly or by replacing it with a fundamentally different mechanism. His work has been recognized with paper awards at ICML, COLM, and MLSys.
For business readers, Dao is a useful name to know because his contributions are the kind that quietly lower the cost of running AI: longer contexts, faster inference, and lower memory bills all trace in part to this line of efficiency research.