Postmortem Index

Explore incident reports from various companies

Mandrill Postgres XID Wraparound Outage February 2019

Mandrill · Mandrill

2019-02-04 – 2019-02-05 automation time

Mandrill experienced a significant outage from February 4, 05:35 UTC, to February 5, 22:09 UTC, which severely impacted its ability to send emails. During this period, only 80% of queued emails were being sent. The incident was triggered when a PostgreSQL database, specifically shard4, entered a safety shutdown mode.

The root cause was a transaction ID (XID) wraparound in PostgreSQL. Postgres uses 32-bit XIDs, which increment and must be periodically cleared by an auto_vacuum process to prevent them from wrapping around and appearing to be in the future, which causes the database to shut down. Shard4 was a “hotter” shard due to the hashing algorithm used for load balancing, leading to a higher load of writes. This likely caused the auto_vacuum process on shard4 to fall behind or fail, resulting in the XID hitting its upper limit.

The primary customer impact was a reduced ability to send transactional emails. Many jobs failed as they attempted to write to the affected database shard, leading to increased job queues and low disk space on Mandrill app servers. Mandrill later refunded affected users for purchases made between January 1 and February 13, 2019, and credited them for future purchases.

Initial attempts to resolve the issue, including running a full VACUUM and a dump-and-restore operation, were estimated to take days or weeks. A more radical solution was adopted: truncating the large “Search” and “Url” tables directly on the locked database. This freed the associated XIDs, allowing the vacuum to complete within an hour. Operational data for job queues was also moved to other shards, and storage volumes on app servers were replaced. Post-incident, new XID wraparound alerting was implemented, and incident response protocols were strengthened with trained incident commanders and defined roles.

Keywords

mandrillpostgresxidwraparounddatabaseautovacuumemailoutage