Postmortem Index

Explore incident reports from various companies

Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region

Amazon · DynamoDB

A brief network disruption occurred on Sunday, September 20, at 2:19am PDT, affecting a portion of Amazon DynamoDB’s storage servers in the US-East Region. This disruption caused storage servers to query the internal metadata service for their membership assignments. However, responses from the metadata service exceeded allowed retrieval times, leading storage servers to remove themselves from accepting requests.

The root cause was a combination of factors. Rapid adoption of Global Secondary Indexes (GSIs) had significantly increased the size of storage server membership data, which the metadata service was not adequately provisioned to handle. Insufficient monitoring for membership size meant capacity planning did not account for these heavier requests. When the network disruption triggered simultaneous requests for these larger memberships, the metadata service became overloaded, leading to timeouts and a retry storm that further exacerbated the load.

This led to a peak error rate of approximately 55% for DynamoDB customer requests by 2:37am PDT. Other AWS services dependent on DynamoDB, including SQS, EC2 Auto Scaling, and CloudWatch, also experienced elevated errors, increased latencies, or delays in operations.

To mitigate the issue, requests to the metadata service were paused at 5:06am PDT to relieve load, allowing for administrative requests to add significant capacity. DynamoDB was largely restored to normal operations by 7:10am PDT. A lingering issue with a metadata partition affecting a small number of customers was resolved on Monday.

Corrective actions include significantly increasing metadata service capacity, implementing stricter monitoring for performance dimensions like membership size, reducing the rate at which storage nodes request membership data, and lengthening query processing time. Longer-term, the DynamoDB service will be segmented into multiple metadata service instances to contain future impacts.

Keywords

dynamodbus-eastawsmetadata servicenetwork disruptionglobal secondary indexesgsistorage servers