{"UUID":"36858814-a276-4723-8bd2-ce1d46236417","URL":"https://aws.amazon.com/message/101925/","ArchiveURL":"","Title":"Amazon DynamoDB US-EAST-1 outage of October 2025","StartTime":"2025-10-19T23:48:00-07:00","EndTime":"2025-10-20T14:20:00-07:00","Categories":["automation","cloud"],"Keywords":["dynamodb","us-east-1","dns","race condition","ec2","network load balancer","lambda","aws"],"Company":"Amazon","Product":"DynamoDB","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T19:51:25.704356Z","Summary":"A latent race condition in DynamoDB's DNS management left the regional endpoint `dynamodb.us-east-1.amazonaws.com` with an empty record. Two redundant \"DNS Enactor\" processes raced when one was unusually delayed; a second Enactor applied a newer plan and ran cleanup while the first overwrote the regional endpoint with a stale older plan, which the cleanup process then deleted. The DynamoDB outage cascaded into EC2 (DWFM \"congestive collapse\"), Network Load Balancer health-check flapping, and Lambda for ~15 hours.","Description":"The incident began on October 19, 2025, at 11:48 PM PDT, with increased DynamoDB API error rates in the N. Virginia (us-east-1) Region. This initial disruption lasted until 2:40 AM PDT on October 20. The issue stemmed from endpoint resolution failures for DynamoDB, preventing new connections to the service.\n\nThe root cause was a latent race condition within DynamoDB's automated DNS management system. Specifically, an unusual delay in one DNS Enactor allowed a second Enactor to apply a newer plan and then clean up older plans. The delayed Enactor then overwrote the regional endpoint with its stale, older plan, which was subsequently deleted by the cleanup process, resulting in an empty DNS record for dynamodb.us-east-1.amazonaws.com. This left the system in an inconsistent state requiring manual intervention.\n\nThe DynamoDB outage cascaded into multiple other AWS services. EC2 experienced increased API error rates, latencies, and instance launch failures from 11:48 PM PDT on October 19 until 1:50 PM PDT on October 20, due to the DropletWorkflow Manager (DWFM) failing to re-establish leases with droplets. Network Load Balancer (NLB) saw increased connection errors between 5:30 AM and 2:09 PM PDT on October 20, caused by health check failures related to delayed network state propagation for new EC2 instances.\n\nLambda functions experienced API errors and latencies, with issues in function creation, updates, and event source processing. Other services like ECS, EKS, Fargate, Amazon Connect, AWS Security Token Service (STS), and AWS Management Console sign-ins also faced disruptions, including container launch failures, elevated call errors, API errors, and authentication failures, all stemming from the initial DynamoDB issue or its cascading effects.\n\nEngineers identified DynamoDB's DNS state as the source by 12:38 AM on October 20, and temporary mitigations were applied by 1:15 AM. All DNS information was restored by 2:25 AM. For EC2, after DynamoDB recovery, DWFM entered a \"congestive collapse\" state, requiring throttling and selective restarts of DWFM hosts. Network Manager also experienced backlogs. NLB issues were mitigated by disabling automatic health check failovers. Full recovery across all impacted services was achieved by 2:20 PM PDT on October 20."}