GitHub network problems on November 30, 2012
GitHub · network
On Friday, November 30th, 2012, GitHub experienced a day-long period of sporadic slow responses and intermittent errors, including 18 minutes of complete unavailability. This degradation affected a small percentage of repositories due to a fileserver pair going offline as a side effect of the network issues. The incident was resolved by Saturday morning, December 1st, 2012.
The problems stemmed from issues encountered during the migration to a new aggregation network. Initially, a misconfiguration on access switches, intended to detect partial link failures, erroneously disabled redundant links when one was disconnected during troubleshooting. This led to the 18 minutes of hard downtime.
The underlying and more prolonged issue was a bug in the new aggregation switches. These switches failed to learn a significant percentage of MAC addresses, forcing them to flood traffic across all ports. This flooding saturated the links between the access and aggregation switches, causing the widespread performance degradation throughout the day.
To mitigate the issue, core processes on the aggregation switches were restarted, allowing them to learn MAC addresses again and restoring performance. GitHub plans to deploy a permanent software update from the vendor. For future prevention, GitHub intends to invest in a duplicate network stack for a staging environment, enhance automated network monitoring, and conduct incident response exercises to improve troubleshooting and avoid tunnel vision.