The 2012 Leap-Second Outage

At midnight UTC on July 1, 2012, the IERS inserted a leap second, and a corner of the internet fell over. The extra second itself was scheduled and expected, but the way the Linux kernel applied it tripped a latent defect, and within minutes operators were watching server load spike to maximum for no apparent reason. The bug was already known in the abstract: it had been fixed in the mainline kernel months earlier, but a great many production machines were still running older, unpatched versions on the night it mattered.

The technical fault was a livelock in the kernel’s leap-second handling. Earlier the NTP subsystem had been changed to use an hrtimer to trigger the leap-second adjustment, and as the kernel commit that fixed it describes, this “can cause a potential livelock.” The fix’s author, John Stultz, laid out the deadlock in the commit message: one path held the ntp_lock while waiting for the xtime_lock, while another held xtime_lock and tried to take ntp_lock, so the timer code could fall into a state where threads spun without making progress. The fix reverted leap-second processing into the regular second_overflow() handling and removed the locking conflict, trading a little timestamp precision for the elimination of the hang.

The user-visible symptom was distinctive and alarming: machines that had been idle suddenly pinned their CPUs at 100 percent. Applications that woke threads on timers, very commonly Java services, were hit hardest, because the broken timer handling caused them to spin tightly instead of sleeping. Mozilla recorded the incident in its own bug tracker as bug 769972, titled “Java is choking on leap second,” where the reporter noted that “Servers running java apps such as Hadoop and ElasticSearch and java doesn’t appear to be working” right after the midnight GMT leap second.

The recovery was as low-tech as the cause was subtle. Rebooting cleared the spinning state, and Reddit took its site offline for roughly that purpose, down for around an hour and a half while servers were brought back. Operators who did not want to reboot found a quicker trick, captured in the Mozilla bug’s resolution: stop the NTP daemon and simply reset the system clock by hand, with a command such as resetting the date to the current time, which immediately dropped CPU consumption in the affected Java processes. Setting the clock anew sidestepped the broken leap-second code path entirely.

The episode became the textbook case for the leap-second bug, and it carried a pointed lesson about software maintenance rather than about clocks. The defect had been understood and patched upstream before the leap second arrived, yet the outage still happened because the fix had not propagated to the running fleet in time. A scheduled, well-announced one-second adjustment took down major sites not because the problem was unknown, but because keeping deployed systems current with known fixes is its own hard discipline, separate from finding the bug in the first place.

Sources

Related