Postmortem Index

Explore incident reports from various companies

GitHub.com availability issues in September 2012

GitHub · GitHub.com

GitHub.com experienced two outages and a period of degraded performance early in the week of September 10, 2012, totaling one hour and 46 minutes of downtime and another hour of significantly degraded performance. These incidents stemmed from issues with a newly implemented 3-node MySQL cluster and its high-availability management stack.

The first incident on Monday, September 10th, began with a database schema migration that generated unusually high load. This caused Percona Replication Manager’s health checks to fail on the primary MySQL server, triggering an automated failover. The new primary had a cold InnoDB buffer pool, leading to poor performance and a subsequent failback to the original server.

The second, more severe incident occurred on Tuesday, September 11th. After discovering replication issues on a standby node, an attempt to disable Pacemaker’s maintenance-mode resulted in a Pacemaker segfault and a cluster state partition. This led to two simultaneous master elections, with one electing a stale node as primary, causing 7 minutes of data drift.

Customer impact included general site unavailability and degraded performance. Specifically, the data drift on Tuesday caused inconsistencies between MySQL and Redis, leading to some dashboard events appearing on incorrect user dashboards and 16 private repositories being briefly routed to the wrong owners, making them accessible to non-collaborators for 7 minutes.

Remediation efforts include modifying Pacemaker configuration to require manual initiation for primary database failovers, investigating solutions for warming InnoDB buffer pools to prevent performance degradation during failovers, and conducting a full audit of the Pacemaker and Heartbeat stack to address the segfault. The status site also experienced availability issues during the Tuesday outage, which were resolved by migrating to a production database.

Keywords

githubmysqlpacemakerperconareplicationfailoverclusterinnodb