Google's Ironwood TPU is built for the age of inference

At Google Cloud Next on April 9, 2025, Google announced Ironwood, its seventh-generation Tensor Processing Unit and the first the company says was “designed specifically for inference” - running trained models at scale rather than training them. Google reported that an Ironwood pod scales to 9,216 liquid-cooled chips delivering 42.5 exaflops, which it described as “more than 24x the compute power of the world’s largest supercomputer, El Capitan.” Each chip carries 192 GB of HBM memory at 7.37 TB/s, six times the memory and 4.5 times the bandwidth of the prior Trillium generation.

The shift in emphasis reflects how AI economics changed. Once a model is trained, the cost that grows with usage is inference - and the rise of reasoning models that “think” through many steps before answering made inference far more compute-hungry per query. Ironwood was Google’s answer to that demand, optimized for serving large foundation models and mixture-of-experts systems efficiently.

Ironwood continues a TPU program that Google first revealed publicly in 2016 and has iterated through seven generations. Google said the chip is nearly 30 times more power efficient than its first Cloud TPU from 2018, underscoring how much of the competition is now about performance per watt.

Why business readers should care: Ironwood signals that the center of gravity in AI compute is moving from training to inference, where the bill scales with every user query. Hardware purpose-built for serving reasoning models is a sign that running AI in production, not just building it, is now the dominant cost.

Google's Ironwood TPU is built for the age of inference

Sources

Related