GitHub DNS Outage on January 8, 2014
GitHub · DNS infrastructure
On Wednesday, January 8, 2014, GitHub experienced an outage of its DNS infrastructure. The incident began at 13:20 PM PST during a rollout of firewall and DNS server configuration changes aimed at improving DDoS defenses. This led to 42 minutes of full service downtime for customers, followed by an additional 1 hour and 35 minutes of partial downtime affecting a subset of repositories.
The initial problem stemmed from a bug in Puppet manifests. During the configuration rollout, only the authoritative name server was restarted, while the caching name server was not. This resulted in the caching name server attempting to request DNS records from an old, no longer serving IP address, causing query timeouts.
During the incident response, a deployment of the DNS system was triggered. This deployment relied on an internal provisioning service that itself depended on a functioning DNS infrastructure. This circular dependency, combined with inadequate sanity checks on the API call’s output, led to the generation of corrupted DNS zone files, causing many records to return NXDOMAIN.
Even after DNS query timeouts were resolved and missing DNS records restored, GitHub’s performance remained degraded. A subset of fileservers experienced memory exhaustion due to a significant number of processes spawned during the DNS outage. This created back pressure, preventing connections to healthy fileservers. Engineers manually removed misbehaving fileservers and performed checks on DRBD block devices. Full service was restored by 15:47 PM PST.
To prevent future occurrences, GitHub is decoupling internal and external DNS infrastructure, reviewing configuration management code for service restart bugs, and addressing the circular dependency in the provisioning system. They are also reviewing fileserver management systems, implementing Linux cgroups for process accounting, and analyzing fileserver code for tight DNS dependencies to improve resilience against similar events.