{"UUID":"e44dd5d3-5105-45ed-9582-f9041a7cfdb9","URL":"https://aws.amazon.com/message/56489/","ArchiveURL":"","Title":"Amazon EC2 and EBS Issues in Tokyo (AP-NORTHEAST-1) on August 23, 2019","StartTime":"2019-08-23T03:36:00Z","EndTime":"2019-08-23T09:30:00Z","Categories":["automation","cloud"],"Keywords":["ec2","ebs","tokyo","ap-northeast-1","datacenter","cooling","overheating","third-party"],"Company":"Amazon","Product":"EC2, EBS","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T19:52:41.537549Z","Summary":"A bug in third-party datacenter control system code caused excessive interactions during a control-host failover, making the cooling control system unresponsive. Most of the datacenter correctly failed cooling into \"max cooling\" mode, but in a small portion the cooling units shut down instead, and the operator-initiated \"purge\" mode also failed because the PLCs controlling the air handlers had become unresponsive too. EC2 servers in one Tokyo AZ overheated and powered off; customers using ALB + AWS WAF or sticky sessions saw cross-AZ impact despite running multi-AZ.","Description":"On August 23, 2019, a single Availability Zone in the Asia Pacific (Tokyo) region (AP-NORTHEAST-1) experienced overheating, leading to the shutdown of a portion of EC2 servers. This resulted in impact to EC2 instances and EBS volume performance within that AZ. Other services like RDS, Redshift, ElastiCache, and Workspaces were also affected if their underlying EC2 instances were impacted.\n\nThe incident began at 12:36 JST. Cooling systems were restored by 15:21 JST, and room temperatures began to normalize, allowing affected instances to power back on. Most affected EC2 instances and EBS volumes recovered by 18:30 JST. EC2 RunInstances API experienced elevated error rates from 13:21 JST, particularly for requests using idempotency tokens or from Auto Scaling groups, with these issues largely resolved by 16:05 JST.\n\nThe root cause was a bug in a third-party datacenter control system used for cooling and optimization. During a control host failover, a logic bug caused excessive information exchange, rendering the control system unresponsive. While most cooling systems correctly entered a maximum cooling mode, a small portion failed and shut down. Attempts to manually activate a \"purge\" mode also failed due to unresponsive Programmable Logic Controllers (PLCs) controlling air handlers.\n\nCustomer impact included EC2 instance shutdowns and EBS performance degradation. Some customers running applications across multiple Availability Zones still experienced unexpected issues, particularly those using Application Load Balancers with AWS Web Application Firewall or sticky sessions, seeing increased Internal Server Errors.\n\nRemediation involved manual investigation and resetting of equipment to restore cooling. The problematic failover mode in the third-party control system was disabled. AWS also initiated operator training for detection and recovery and is working to modify air conditioning units to allow \"purge\" mode to bypass PLCs entirely, aligning with methods used in newer datacenters."}