Postmortem Index

Explore incident reports from various companies

incident.io database outage due to PGAudit

incident.io · database

2025-04-09 cascading-failure

On Wednesday, April 9, 2025, between 14:16 and 14:27 UTC, incident.io experienced intermittent availability issues, culminating in a 2-minute database outage from 14:25 to 14:27 UTC. This affected their dashboard, mobile app, Slack app, and API, though on-call alerts remained operational.

The incident stemmed from a pathological interaction involving the PGAudit extension. PGAudit had been re-enabled in production after a recent Postgres 17 upgrade and successful staging environment testing.

At 14:16 UTC, a routine database migration, intended to create an empty table and add an index, triggered PGAudit to become unresponsive. The extension held critical database locks and ignored timeout signals, preventing the release of these locks.

Attempts to kill the offending PGAudit processes failed, leading to cascading effects and intermittent slowness. To resolve the deadlock, the primary database was restarted, causing a 2-minute hard outage. Following restoration, PGAudit was temporarily disabled and then completely removed to prevent recurrence.

Incident.io plans further investigation, an internal debrief, and enhancements to their processes and monitoring capabilities to better detect and prevent similar situations.

Keywords

postgrespgauditdatabasemigrationlocksoutageincidentio