The AWS S3 us-east-1 Outage (2017)

On the afternoon of February 28, 2017, Amazon S3 in the US-EAST-1 region stopped serving requests, and because so much of the web depends on S3 for storage, the failure rippled outward to countless other sites and services. Dashboards, image hosts, smart-home devices, and even Amazon’s own status page (which was hosted on the affected infrastructure) faltered for several hours.

Amazon’s published “Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region” lays out the cause plainly. An authorized S3 team member was debugging a billing-system issue and ran an established operational command to remove a small number of servers from one of the S3 subsystems. In Amazon’s words, “one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.” A typo in a maintenance command, run against a system at enormous scale, did the damage.

The removed capacity inadvertently took down two subsystems that the rest of S3 depends on: the index subsystem, which holds metadata and the location of every object, and the placement subsystem, which allocates storage for new objects. Both had to be fully restarted, and Amazon noted that these subsystems had not been completely restarted for many years while S3 had grown massively, so the restart and the safety checks it required took far longer than expected.

While the subsystems came back, S3 could not process GET, LIST, PUT, or DELETE requests, and the failure propagated to other AWS services in the region that relied on S3, including parts of EC2, EBS, and Lambda. Service was fully restored by early afternoon Pacific time, but the outage underscored how a single region’s failure could reach across the internet.

Amazon’s remediation focused on removing the sharp edges. The company modified the capacity-removal tool to operate more slowly and to refuse to take any subsystem below a safe minimum capacity, so a fat-fingered input could no longer remove too much at once. AWS also said it was accelerating work to partition core subsystems into smaller cells to shrink the blast radius of any future fault.

The episode is one of the most cited examples of how operator error plus insufficient guardrails equals a large-scale outage, and of the hidden centralization in “the cloud”: a debugging command in one region briefly knocked over a meaningful fraction of the visible internet. It is a recurring theme across modern infrastructure failures that the tooling around a system matters as much as the system itself.

The AWS S3 us-east-1 Outage (2017)

Sources

Related