Postmortem Index

Explore incident reports from various companies

GitHub.com outage of December 2012

GitHub · GitHub.com

2012-12-22 – 2012-12-23 automation config-change security

On Saturday, December 22nd, 2012, GitHub experienced a significant outage following scheduled maintenance. The maintenance involved an in-service software upgrade on aggregation switches. Initial instability occurred for 20-30 minutes, leading to a decision to revert the update if issues persisted.

The core problem arose when, during forensic data gathering after 12:15 PST, an agent on one aggregation switch was terminated. This termination, combined with unlucky timing, caused the peer switch to perform a disruptive MLAG failover rather than a stateful one. The link between the switches remained active just long enough for the heartbeat to be lost, but not long enough for the link to be detected as down, triggering the disruptive failover. This resulted in a network freeze for approximately 90 seconds.

The network freeze had a cascading impact on GitHub’s fileserver architecture, which uses Pacemaker, Heartbeat, and DRBD in active/passive pairs. Many fileservers, distributed across different racks for redundancy, exceeded their heartbeat timeouts. This led them to issue STONITH commands to their partner nodes. Due to the compromised network, some STONITH commands were not delivered, resulting in a “split-brain” scenario where both nodes in a pair believed they were active, ultimately causing both nodes to power off.

GitHub.com was placed into maintenance mode, and the entire operations team was paged for recovery. The recovery process involved downgrading the aggregation switches and meticulously identifying the previously active node for each fileserver pair from logs to ensure data consistency. This was a time-consuming process, taking over five hours to complete due to the widespread nature of the problem.

To prevent similar incidents, GitHub plans to work with their network vendor to revisit MLAG failover timeouts, establish a functional duplicate of their production environment for testing, and place fileserver high-availability software into maintenance mode before any network changes. They are also reviewing all high-availability configurations and collaborating with their hosting provider to reduce the fileserver’s reliance on network infrastructure.

Keywords

aggregation switchesmlagnetworkfileserverpacemakerheartbeatdrbdstonithhigh availabilityoutage