{"UUID":"00b7f759-21f8-4767-8803-f09d863005cd","URL":"https://aws.amazon.com/message/65649/","ArchiveURL":"","Title":"Amazon SimpleDB US East Region Disruption on June 13","StartTime":"0001-01-01T00:00:00Z","EndTime":"0001-01-01T00:00:00Z","Categories":["cloud"],"Keywords":["simpledb","us east","power loss","lock service","handshake timeout","data center","api errors"],"Company":"Amazon","Product":"SimpleDB","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T19:51:05.703409Z","Summary":"Multiple SimpleDB storage nodes lost power simultaneously in one US-East data center. The lock service de-registered them rapidly, which spiked load and pushed handshake latencies above SimpleDB's too-aggressive handshake timeout. Healthy storage and metadata nodes failed their handshakes, removed themselves from the cluster, and couldn't rejoin because the metadata nodes that would authorize them had also taken themselves out. Recovery required manually raising the handshake timeout and restarting metadata nodes.","Description":"On June 13, the Amazon SimpleDB service in the US East Region experienced a disruption. The service was unavailable for all API calls, with the exception of a fraction of eventually consistent read calls, from 9:16 PM to 11:16 PM PDT. Following this, elevated error rates for CreateDomain and DeleteDomain API calls persisted until 1:30 AM PDT.\n\nThe incident began when multiple SimpleDB storage nodes simultaneously lost power in a single data center. While SimpleDB is designed to handle multiple node failures, this specific pattern resulted in a sudden and significant increase in load on the internal lock service as it rapidly de-registered the failed nodes.\n\nThis increased load led to elevated handshake latencies between healthy SimpleDB nodes and the lock service. Nodes were unable to complete their periodic handshakes within a predefined, too-low \"handshake timeout\" value. After multiple retries and timeouts, both storage and metadata nodes removed themselves from the production cluster, causing API requests to return 500 server-side errors.\n\nA critical issue arose because affected storage nodes could not rejoin the cluster without authorization from metadata nodes. However, these metadata nodes were also down due to the same handshake timeout problem, creating a deadlock where neither could recover independently.\n\nTo resolve the issue, engineers manually increased the handshake timeout values and restarted a subset of metadata nodes, allowing them to authorize and bring storage nodes back online. CreateDomain and DeleteDomain API calls were throttled until full recovery. Two key improvements identified are setting a longer lock service handshake timeout and revising the node self-removal behavior to prevent immediate cluster exit after multiple handshake timeouts."}