Postmortem Index

Explore incident reports from various companies

AWS US East-1 power failure and service disruption in June 2012

Amazon

On Friday, June 29th, 2012, a large-scale electrical storm swept through Northern Virginia, impacting one of AWS’s US East-1 Availability Zones. At 7:24 PM PDT, a voltage spike caused utility power loss in two datacenters. While one datacenter successfully transferred to generator power, the generators in the other datacenter started but failed to provide stable voltage, leading to reliance on Uninterruptable Power Supply (UPS) units. Utility power failed a second time at 7:57 PM PDT, and again, generators in the affected facility failed to stabilize.

As the UPS systems depleted, servers began losing power at 8:04 PM PDT. Onsite personnel worked to stabilize the backup generator power, which was achieved by 8:14 PM PDT, and the full facility had power to all racks by 8:24 PM PDT. The root cause of the power outage was the failure of backup generators to deliver stable power under load, despite having passed rigorous testing, including a full load test just two months prior. This issue was compounded by bottlenecks in server booting processes and control plane limitations.

Customer impact was significant, affecting a single-digit percentage of resources in the US East-1 Region, including EC2 instances, EBS volumes, RDS instances, and ELBs in the affected Availability Zone. Beyond direct resource unavailability, regional control planes for services like EC2, EBS, and ELB experienced degradation, hindering customers’ ability to manage resources or react to the outage by launching new instances in healthy Availability Zones. A bug in RDS Multi-AZ failover also prevented some instances from automatically recovering.

AWS outlined several remediation steps. For power stability, they will lengthen the time electrical switching equipment allows generators to reach stable power, expand power quality tolerances, and increase 24x7 onsite engineering staff for manual intervention. To improve service recovery, they will address server booting bottlenecks, optimize EBS volume recovery processes, and automate EC2/EBS control plane datastore failover.

Further improvements include breaking ELB processing into multiple queues to enhance throughput and developing a backup DNS re-weighting mechanism for rapid traffic shifting away from impacted Availability Zones. A software bug affecting Multi-AZ RDS failovers, introduced in April, will also be mitigated and rolled out to production.

Keywords

awsus-east-1powergeneratorec2ebsrdselb