Postmortem Index

Explore incident reports from various companies


Human error. On February 28th 2017 9:37AM PST, the Amazon S3 team was debugging a minor issue. Despite using an established playbook, one of the commands intending to remove a small number of servers was issued with a typo, inadvertently causing a larger set of servers to be removed. These servers supported critical S3 systems. As a result, dependent systems required a full restart to correctly operate, and the system underwent widespread outages for US-EAST-1 (Northern Virginia) until final resolution at 1:54PM PST. Since Amazon’s own services such as EC2 and EBS rely on S3 as well, it caused a vast cascading failure which affected hundreds of companies.

Source | Edit Metadata | JSON file