Postmortem Index

Explore incident reports from various companies

Amazon EC2 and EBS Issues in Tokyo (AP-NORTHEAST-1) on August 23, 2019

Amazon · EC2, EBS

2019-08-23 automation cloud

On August 23, 2019, a single Availability Zone in the Asia Pacific (Tokyo) region (AP-NORTHEAST-1) experienced overheating, leading to the shutdown of a portion of EC2 servers. This resulted in impact to EC2 instances and EBS volume performance within that AZ. Other services like RDS, Redshift, ElastiCache, and Workspaces were also affected if their underlying EC2 instances were impacted.

The incident began at 12:36 JST. Cooling systems were restored by 15:21 JST, and room temperatures began to normalize, allowing affected instances to power back on. Most affected EC2 instances and EBS volumes recovered by 18:30 JST. EC2 RunInstances API experienced elevated error rates from 13:21 JST, particularly for requests using idempotency tokens or from Auto Scaling groups, with these issues largely resolved by 16:05 JST.

The root cause was a bug in a third-party datacenter control system used for cooling and optimization. During a control host failover, a logic bug caused excessive information exchange, rendering the control system unresponsive. While most cooling systems correctly entered a maximum cooling mode, a small portion failed and shut down. Attempts to manually activate a “purge” mode also failed due to unresponsive Programmable Logic Controllers (PLCs) controlling air handlers.

Customer impact included EC2 instance shutdowns and EBS performance degradation. Some customers running applications across multiple Availability Zones still experienced unexpected issues, particularly those using Application Load Balancers with AWS Web Application Firewall or sticky sessions, seeing increased Internal Server Errors.

Remediation involved manual investigation and resetting of equipment to restore cooling. The problematic failover mode in the third-party control system was disabled. AWS also initiated operator training for detection and recovery and is working to modify air conditioning units to allow “purge” mode to bypass PLCs entirely, aligning with methods used in newer datacenters.

Keywords

ec2ebstokyoap-northeast-1datacentercoolingoverheatingthird-party