Postmortem Index

Explore incident reports from various companies

Amazon DynamoDB US-EAST-1 outage of October 2025

Amazon · DynamoDB

2025-10-20 automation cloud

The incident began on October 19, 2025, at 11:48 PM PDT, with increased DynamoDB API error rates in the N. Virginia (us-east-1) Region. This initial disruption lasted until 2:40 AM PDT on October 20. The issue stemmed from endpoint resolution failures for DynamoDB, preventing new connections to the service.

The root cause was a latent race condition within DynamoDB’s automated DNS management system. Specifically, an unusual delay in one DNS Enactor allowed a second Enactor to apply a newer plan and then clean up older plans. The delayed Enactor then overwrote the regional endpoint with its stale, older plan, which was subsequently deleted by the cleanup process, resulting in an empty DNS record for dynamodb.us-east-1.amazonaws.com. This left the system in an inconsistent state requiring manual intervention.

The DynamoDB outage cascaded into multiple other AWS services. EC2 experienced increased API error rates, latencies, and instance launch failures from 11:48 PM PDT on October 19 until 1:50 PM PDT on October 20, due to the DropletWorkflow Manager (DWFM) failing to re-establish leases with droplets. Network Load Balancer (NLB) saw increased connection errors between 5:30 AM and 2:09 PM PDT on October 20, caused by health check failures related to delayed network state propagation for new EC2 instances.

Lambda functions experienced API errors and latencies, with issues in function creation, updates, and event source processing. Other services like ECS, EKS, Fargate, Amazon Connect, AWS Security Token Service (STS), and AWS Management Console sign-ins also faced disruptions, including container launch failures, elevated call errors, API errors, and authentication failures, all stemming from the initial DynamoDB issue or its cascading effects.

Engineers identified DynamoDB’s DNS state as the source by 12:38 AM on October 20, and temporary mitigations were applied by 1:15 AM. All DNS information was restored by 2:25 AM. For EC2, after DynamoDB recovery, DWFM entered a “congestive collapse” state, requiring throttling and selective restarts of DWFM hosts. Network Manager also experienced backlogs. NLB issues were mitigated by disabling automatic health check failovers. Full recovery across all impacted services was achieved by 2:20 PM PDT on October 20.

Keywords

dynamodbus-east-1dnsrace conditionec2network load balancerlambdaaws