{"UUID":"f9f4ba4c-686c-46a4-982c-8dc21bd8a448","URL":"https://status.incident.io/incidents/01JRDFKAGE07YYDY0KZR137BX3/write-up","ArchiveURL":"","Title":"incident.io database outage due to PGAudit","StartTime":"2025-04-09T14:16:00Z","EndTime":"2025-04-09T14:27:00Z","Categories":["cascading-failure"],"Keywords":["postgres","pgaudit","database","migration","locks","outage","incidentio"],"Company":"incident.io","Product":"database","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T19:52:53.897457Z","Summary":"After a Postgres 17 upgrade the weekend before, PGAudit was re-enabled based on staging testing. A routine migration to create an empty table and add an index triggered a pathological interaction with PGAudit: the extension hung while holding critical locks, ignored timeout signals, and blocked other DB operations across the dashboard, mobile app, Slack app, and API. The primary was restarted to break the deadlock (~2 minutes hard outage), then PGAudit was removed entirely.","Description":"On Wednesday, April 9, 2025, between 14:16 and 14:27 UTC, incident.io experienced intermittent availability issues, culminating in a 2-minute database outage from 14:25 to 14:27 UTC. This affected their dashboard, mobile app, Slack app, and API, though on-call alerts remained operational.\n\nThe incident stemmed from a pathological interaction involving the PGAudit extension. PGAudit had been re-enabled in production after a recent Postgres 17 upgrade and successful staging environment testing.\n\nAt 14:16 UTC, a routine database migration, intended to create an empty table and add an index, triggered PGAudit to become unresponsive. The extension held critical database locks and ignored timeout signals, preventing the release of these locks.\n\nAttempts to kill the offending PGAudit processes failed, leading to cascading effects and intermittent slowness. To resolve the deadlock, the primary database was restarted, causing a 2-minute hard outage. Following restoration, PGAudit was temporarily disabled and then completely removed to prevent recurrence.\n\nIncident.io plans further investigation, an internal debrief, and enhancements to their processes and monitoring capabilities to better detect and prevent similar situations."}