Amazon SimpleDB US East Region Disruption on June 13

On June 13, the Amazon SimpleDB service in the US East Region experienced a disruption. The service was unavailable for all API calls, with the exception of a fraction of eventually consistent read calls, from 9:16 PM to 11:16 PM PDT. Following this, elevated error rates for CreateDomain and DeleteDomain API calls persisted until 1:30 AM PDT.

The incident began when multiple SimpleDB storage nodes simultaneously lost power in a single data center. While SimpleDB is designed to handle multiple node failures, this specific pattern resulted in a sudden and significant increase in load on the internal lock service as it rapidly de-registered the failed nodes.

This increased load led to elevated handshake latencies between healthy SimpleDB nodes and the lock service. Nodes were unable to complete their periodic handshakes within a predefined, too-low “handshake timeout” value. After multiple retries and timeouts, both storage and metadata nodes removed themselves from the production cluster, causing API requests to return 500 server-side errors.

A critical issue arose because affected storage nodes could not rejoin the cluster without authorization from metadata nodes. However, these metadata nodes were also down due to the same handshake timeout problem, creating a deadlock where neither could recover independently.

To resolve the issue, engineers manually increased the handshake timeout values and restarted a subset of metadata nodes, allowing them to authorize and bring storage nodes back online. CreateDomain and DeleteDomain API calls were throttled until full recovery. Two key improvements identified are setting a longer lock service handshake timeout and revising the node self-removal behavior to prevent immediate cluster exit after multiple handshake timeouts.

Postmortem Index

Amazon SimpleDB US East Region Disruption on June 13

Keywords