The 1990 AT&T Long-Distance Network Collapse

On the afternoon of January 15, 1990, network controllers at AT&T’s operations center began seeing a flood of alarms as switching systems across the United States crashed and rebooted in a self-sustaining pattern. For roughly nine hours, a large fraction of AT&T long-distance calls, on the order of half, failed to connect, an outage affecting tens of millions of call attempts and one of the most visible software disasters of its era. A primary account, including AT&T’s own explanation through Larry Seese, the company’s director of technology development, was carried in the RISKS Digest (volume 9, issue 62), drawing on Telephony magazine’s reporting from the days after the failure.

The failure lived in software that AT&T had loaded into all 114 of its 4ESS toll switches the previous month. According to AT&T’s account, the defect was in the recovery code, the logic a switch runs to bring itself back into service after a brief internal fault. The bug had been latent: the new software had been deployed and run for weeks without incident because the precise timing needed to trigger it had not occurred.

The trigger was ordinary. A switch developed a minor internal problem and went through a short recovery, then signaled its neighbors that it was back in service. When a neighboring switch, call it B, received a second call-setup message from switch A while B was still in the middle of resetting its own internal logic from the first message, the recovery software took a wrong branch. In the contemporary engineering reconstructions of the code, the flaw was a misplaced break statement: a break inside a clause nested within a larger switch statement caused control to exit the switch prematurely, skipping past code that should have run and leaving data overwritten. The switch’s own error-detection logic noticed the corrupted state, concluded its processor was unreliable, and shut the switch down to reset, exactly as designed for an isolated fault.

The catastrophe was that the fault was not isolated. Because every switch ran identical software, a switch coming back up would send the same kind of message that had just knocked over its neighbor, and any neighbor that happened to be mid-reset would hit the same defect and crash in turn. The resets propagated through the network as a cascade, each recovering switch tripping others, so the system could not settle. This is the defining shape of a distributed-systems failure: a single deterministic bug, replicated across identical nodes, amplified by the very recovery mechanism meant to provide resilience (see distributed-system).

AT&T stabilized the network not by patching the code in the moment but by reducing the messaging load, cutting the rate of the CCS7 signaling traffic so that switches had enough quiet time to finish resetting without immediately receiving another triggering message. With the cascade starved of its trigger, the network recovered, and the defective software was subsequently rolled back and corrected.

The incident became a standard teaching case. It showed that rigorous testing is not the same as proof of correctness, since the code had been tested yet still shipped with a timing-dependent defect (a regression introduced by the new release; see regression-bug). It showed how fault-tolerance mechanisms can themselves become the vector of collapse when failures are correlated across identical components rather than independent. And it underscored why careful root-cause analysis matters: the visible symptom, switches resetting, was the designed safe response, while the actual cause was a single mishandled control-flow path buried in the recovery logic (see root-cause-analysis).

The 1990 AT&T Long-Distance Network Collapse

Sources

Related