{"UUID":"43164f6c-de49-4e61-aa58-a394a400d431","URL":"https://launchdarkly.com/blog/what-happened-what-we-learned-and-how-were-improving/","ArchiveURL":"","Title":"LaunchDarkly service disruption due to AWS us-east-1 outage and internal cascading failures (October 2025)","StartTime":"2025-10-20T06:50:00Z","EndTime":"2025-10-21T07:05:00Z","Categories":["automation","cascading-failure","cloud"],"Keywords":["launchdarkly","aws","us-east-1","feature flags","streaming","sdk","ec2","dynamodb"],"Company":"Launchdarkly","Product":"LaunchDarkly platform","SourcePublishedAt":"2025-10-27T23:51:18Z","SourceFetchedAt":"2026-05-04T19:51:40.69344Z","Summary":"The AWS us-east-1 outage degraded EC2/Lambda/DynamoDB/Route 53, leaving Launchdarkly's US web app, API, and client-side streaming SDKs unable to autoscale and dropping events. After AWS recovered, an internal change meant to shed load reverted flag-delivery to a legacy routing path with cold caches; SDKs hammered the streaming service with retries, the load balancer became unresponsive, and EC2 provisioning issues prevented scale-out, taking server-side streaming globally to ~99% errors and keeping US streaming down for another ~12 hours.","Description":"On October 19, 2025, at 11:50 PM PT, LaunchDarkly experienced a widespread service disruption. The initial phase was triggered by a major AWS us-east-1 outage, which degraded services like EC2, Lambda, DynamoDB, and Route 53. This led to the LaunchDarkly web application and API in the US becoming unstable, unable to autoscale, and flag delivery updates being delayed or unavailable. Client-side streaming SDKs in the US were significantly impacted, and event ingestion degraded, causing data loss.\n\nBy October 20, 11:40 AM PT, AWS services began to recover, and LaunchDarkly's web application and API returned to normal. However, a second phase of disruption began shortly after. An internal change, intended to reduce load, inadvertently reverted flag delivery to a legacy routing path with cold caches. This caused excessive retries from SDKs, overwhelming the streaming service and its load balancer, which then became unresponsive. Ongoing EC2 provisioning issues prevented the infrastructure from scaling out.\n\nThis cascading failure resulted in server-side SDKs across all regions experiencing connection errors, reaching approximately 99% globally. While EU and APAC regions recovered by mid-afternoon, US-based streaming remained unavailable until late that night. The US commercial environment was most affected, experiencing degradation or non-availability of the main application, flag delivery failures, and event data loss. All streaming services fully recovered by October 21, 12:05 AM PT.\n\nLaunchDarkly is implementing several improvements. These include decoupling the Flag Delivery Network from the feature management application, scaling load balancers, and accelerating migration to a new fault-tolerant delivery architecture. They are also enhancing SDK behavior to support automatic failover from streaming to polling, relocating disaster recovery orchestration systems out of us-east-1, and improving multi-region availability and DR testing. Additionally, communication processes during incidents are being updated to provide customer workarounds earlier."}