Amazon S3 US-EAST-1 outage of February 2017
Amazon · Amazon S3
On February 28, 2017, at 9:37 AM PST, an incident began in the Amazon S3 US-EAST-1 region. An S3 team member, while debugging a billing system issue, inadvertently removed a larger set of servers than intended. This action impacted critical S3 subsystems, leading to service disruption. Recovery efforts saw the index subsystem begin servicing GET, LIST, and DELETE requests by 12:26 PM PST, fully recovering by 1:18 PM PST. The placement subsystem, crucial for PUT requests, completed its recovery by 1:54 PM PST, at which point S3 was operating normally.
The incident was triggered by human error: an authorized S3 team member executed a command with an incorrect input, leading to the removal of a significant portion of server capacity. The affected servers supported the S3 index subsystem, which manages object metadata and location, and the placement subsystem, which handles new storage allocation. The tool used allowed too much capacity to be removed too quickly.
The disruption rendered S3 unable to service GET, LIST, PUT, and DELETE requests. This had a cascading effect on other AWS services in the US-EAST-1 region that rely on S3, including the S3 console, new EC2 instance launches, EBS volumes dependent on S3 snapshots, and AWS Lambda. Communication via the AWS Service Health Dashboard was also impaired due to its dependency on S3.
In response, Amazon modified the problematic tool to ensure slower capacity removal and added safeguards to prevent operations that would take any subsystem below its minimum required capacity. An audit of other operational tools is underway. The S3 team is also reprioritizing and accelerating plans to further partition the index subsystem into smaller cells to improve recovery times. Additionally, the AWS Service Health Dashboard administration console has been reconfigured to run across multiple AWS regions to enhance its resilience during future events.