Amazon Kinesis Data Streams US-EAST-1 Degradation July 2024

A service disruption occurred in the Northern Virginia (US-EAST-1) Region on July 30th, 2024, impacting several AWS services with increased latencies and elevated error rates between 2:45 PM PDT and 9:37 PM PDT. The issue stemmed from a degradation in one of Amazon Kinesis Data Streams’ internally-used cells. While a routine deployment began at 9:09 AM PDT, the impact on services started at 2:45 PM PDT. Initial improvements were observed at 5:39 PM PDT, with significant recovery by 7:21 PM PDT, and normal operations restored by 9:37 PM PDT. Backlogs for CloudWatch Logs and S3 events took longer to clear, fully resolving by 5:50 AM PDT on July 31st and 2:38 AM PDT on August 1st, respectively.

The root cause was an impairment in a Kinesis Data Streams cell that had recently been migrated to a new architecture. This specific cell handled a novel workload profile characterized by an unusually high number of very low-throughput shards. During a routine deployment, the cell management system, which focuses on balancing work based on throughput, unevenly distributed these low-throughput shards. This resulted in a small number of hosts processing an excessive number of shards.

These overloaded hosts generated abnormally large status messages for the cell management system, which were then delayed or unprocessed. The system misinterpreted these delays as host failures and initiated a rapid redistribution of shards. This “redistribution storm” subsequently overloaded a critical subsystem responsible for provisioning secure connections for Kinesis data plane communication, leading to impaired traffic processing.

Customers experienced elevated error rates and latencies across various AWS services, including CloudWatch Logs, Amazon Data Firehose, the Amazon S3 event framework, Amazon Elastic Container Service (ECS), AWS Lambda, Amazon Redshift, Amazon Managed Workflows for Apache Airflow (MWAA), and AWS Glue. For instance, ECS tasks using blocking log drivers were impacted, Lambda functions experienced missing CloudWatch Logs, and Redshift users faced intermittent connection issues. While Firehose experienced increased failures, no data loss was reported.

Engineering teams mitigated the issue by deploying changes to shed load from less time-sensitive internal workloads and by adding additional capacity to the secure connection provisioning subsystem. Further capacity improvements were made to enhance connection provisioning. These changes, including increased capacity for data plane connections, new load-shedding tooling, and adjusted connection limits, have been implemented in US-EAST-1 and other regions with the new Kinesis architecture. AWS plans further updates to the cell management system to better manage such workload profiles.

Postmortem Index

Amazon Kinesis Data Streams US-EAST-1 Degradation July 2024

Keywords