Postmortem Index

Explore incident reports from various companies

incident.io service disruption during AWS us-east-1 outage on October 20, 2025

incident.io

On October 20, 2025, incident.io experienced a service disruption during a major AWS us-east-1 outage (07:11 to 10:53 UTC). Although incident.io’s core platform is hosted on Google Cloud, several key services were impacted due to dependencies on AWS-hosted third parties. These included Scribe (AI note taker), SAML authentication, Status Pages, and on-call notifications, leading to intermittent unavailability and delays for users.

Specific failures included Scribe being unable to join calls from 07:34 to 10:07 UTC and again from 12:40 to 17:37 UTC due to an AWS-hosted transcription provider. SAML authentication was degraded between 07:17 and 09:28 UTC due to an AWS-hosted provider, exacerbated by issues with Slack. Status Pages saw a burst of errors in two regions (europe-west3 and australia-southeast1) from 07:54 to 08:19 UTC due to an infrastructure provider. On-call notifications via SMS and phone were delayed from 07:20 to 09:26 UTC due to a telecom provider outage, which also cascaded to other notification types.

A critical issue was the failure of incident.io’s deployment pipeline. It was unable to pull golang:1.24.9-alpine from Docker Hub, which runs on AWS, preventing the deployment of urgent fixes. Attempts to work around this, such as using mirror.gcr.io or relying on cached images, were unsuccessful until Docker Hub partially recovered at 09:45 UTC. Furthermore, scaling Kubernetes deployments for escalation workers led to a net throughput decrease. This was caused by Postgres dead tuples in the escalation-acquisition index, making the acquisition of escalations an expensive operation and increasing database pressure.

Another contributing factor was a traffic management feature flag that failed to apply globally. A recent usability tweak meant limits were only applied to the top 10 organizations by traffic volume, rendering it ineffective for broader control. Remediation efforts included removing the Docker Hub dependency, implementing region switching for Scribe, optimizing escalation acquisition logic and autovacuum settings, and planning for telecom redundancy. The incident highlighted the complex nature of third-party dependencies and the challenges of predicting cascading failures during large-scale cloud outages.

Keywords

awsus-east-1google cloudscribeauthenticationnotificationsdeployment pipelinepostgres