xAI builds Colossus, a 100,000-GPU cluster, in 122 days

In late 2024 Elon Musk’s xAI built Colossus, an AI supercomputer in Memphis, Tennessee. According to NVIDIA, the system comprised 100,000 NVIDIA Hopper GPUs and was constructed in 122 days, against a normal timeline of “many months to years” for systems of that size. Training began within 19 days from when the first rack rolled onto the floor, and xAI said it was already in the process of doubling Colossus to 200,000 Hopper GPUs.

Colossus was built to train xAI’s Grok family of large language models. To wire 100,000 GPUs together, xAI used NVIDIA’s Spectrum-X Ethernet networking platform, which NVIDIA said maintained 95 percent data throughput with zero packet loss from flow collisions - the kind of network performance that determines whether a cluster of that scale actually delivers usable training speed.

The defining feature was speed of construction. Most frontier clusters take years to plan and assemble; xAI compressed the build into a few months by converting an existing industrial facility, an approach that became a talking point in the AI infrastructure race against OpenAI, Google and Anthropic.

Why business readers should care: Colossus reset expectations for how fast enormous compute can be deployed. When a competitor can stand up a 100,000-GPU cluster in four months, build speed itself becomes a strategic weapon - and the bottleneck shifts from chips to power, networking and construction logistics.

xAI builds Colossus, a 100,000-GPU cluster, in 122 days

Sources

Related