AWS US-East Region Service Event of October 22, 2012
Amazon · EBS
The incident began on Monday, October 22nd, 2012, at 10:00 AM PDT, affecting the US-East Region. A latent bug in a data collection agent on Amazon Elastic Block Store (EBS) storage servers caused degraded performance and “stuck” volumes. This primary issue subsequently impacted Amazon Elastic Compute Cloud (EC2), Amazon Relational Database Service (RDS), and Amazon Elastic Load Balancing (ELB) services.
The root cause was a memory leak in an EBS data collection agent. Following a hardware failure and replacement of a data collection server, a DNS update failed to propagate to all internal DNS servers. This caused a fraction of EBS storage servers to continuously attempt to contact the failed server, triggering a latent bug where the agent slowly consumed system memory instead of gracefully handling the failed connection. Existing monitoring failed to alarm on this memory leak due to the dynamic nature of EBS server memory usage.
Customer impact was widespread, starting with EBS volumes becoming unresponsive, affecting EC2 instances. Aggressive API throttling during recovery disproportionately impacted some customers, hindering their ability to manage resources. RDS instances experienced inaccessibility and failover issues, while ELB load balancers degraded, with recovery further stalled by an Elastic IP (EIP) shortage. The event’s full resolution, including ELB recovery, extended until 9:50 PM PDT.
Remediation efforts include deploying specific memory leak monitoring, enhancing general memory monitoring, and implementing resource limits for processes on EBS servers. DNS configuration updates are planned for reliable propagation. For APIs, aggressive throttling policies were revised, and per-customer throttling monitoring is being added. RDS is receiving fixes for two failover-related software bugs. ELB improvements involve ensuring EIP capacity, reducing EBS interdependency, refining recovery workflows, and enhancing traffic shifting logic.