Postmortem Index

Explore incident reports from various companies

Amazon S3 Availability Event: July 20, 2008

Amazon · Amazon S3

2008-07-20 cloud

On July 20, 2008, at 8:40am PDT, Amazon S3 experienced a significant availability event, with error rates quickly climbing across all datacenters. By 8:50am PDT, error rates were significantly elevated, and very few customer requests were completing successfully. Engineers were engaged by 8:55am PDT, and by 9:41am PDT, it was determined that servers within Amazon S3 were having difficulty communicating with each other.

The core issue was that Amazon S3 servers, which use a gossip protocol to spread server state information, were spending almost all their time gossiping and failing while doing so. This prevented the system from successfully processing customer requests. To resolve this, at 10:32am PDT, S3 teams decided to shut down all server-to-server communication, clear the system’s state, and reactivate request processing components. This shutdown was complete by 11:05am PDT.

Internal communication was restored by 2:20pm PDT, and request processing components were reactivated concurrently in the US and EU. The EU location returned to normal by 3:10pm PDT, and the US location by 4:58pm PDT.

The root cause was identified as message corruption. A handful of internal state messages had a single bit corrupted, making the system state information incorrect. Unlike customer object data, which uses MD5 checksums, there was no protection in place to detect corruption of this internal state information, allowing it to spread throughout the system and cause widespread communication failures.

As remediation, Amazon S3 deployed changes to significantly reduce system restoration time and modified how it gossips about failed servers to prevent similar behavior. Additional monitoring and alarming for gossip rates and failures were implemented. Crucially, checksums are being added to proactively detect and reject corrupted system state messages, enhancing the system’s resilience against such internal data corruption.

Keywords

s3amazonawsavailabilitygossip protocolmessage corruptiondatacenteruseuamazon s3outage