LaunchDarkly service disruption due to AWS us-east-1 outage and internal cascading failures (October 2025)
Launchdarkly · LaunchDarkly platform
On October 19, 2025, at 11:50 PM PT, LaunchDarkly experienced a widespread service disruption. The initial phase was triggered by a major AWS us-east-1 outage, which degraded services like EC2, Lambda, DynamoDB, and Route 53. This led to the LaunchDarkly web application and API in the US becoming unstable, unable to autoscale, and flag delivery updates being delayed or unavailable. Client-side streaming SDKs in the US were significantly impacted, and event ingestion degraded, causing data loss.
By October 20, 11:40 AM PT, AWS services began to recover, and LaunchDarkly’s web application and API returned to normal. However, a second phase of disruption began shortly after. An internal change, intended to reduce load, inadvertently reverted flag delivery to a legacy routing path with cold caches. This caused excessive retries from SDKs, overwhelming the streaming service and its load balancer, which then became unresponsive. Ongoing EC2 provisioning issues prevented the infrastructure from scaling out.
This cascading failure resulted in server-side SDKs across all regions experiencing connection errors, reaching approximately 99% globally. While EU and APAC regions recovered by mid-afternoon, US-based streaming remained unavailable until late that night. The US commercial environment was most affected, experiencing degradation or non-availability of the main application, flag delivery failures, and event data loss. All streaming services fully recovered by October 21, 12:05 AM PT.
LaunchDarkly is implementing several improvements. These include decoupling the Flag Delivery Network from the feature management application, scaling load balancers, and accelerating migration to a new fault-tolerant delivery architecture. They are also enhancing SDK behavior to support automatic failover from streaming to polling, relocating disaster recovery orchestration systems out of us-east-1, and improving multi-region availability and DR testing. Additionally, communication processes during incidents are being updated to provide customer workarounds earlier.