{"UUID":"65c669af-6430-4bea-90e0-16d536798892","URL":"https://github.blog/news-insights/the-library/dns-outage-post-mortem/","ArchiveURL":"","Title":"GitHub DNS Outage on January 8, 2014","StartTime":"2014-01-08T21:20:00Z","EndTime":"2014-01-08T23:47:00Z","Categories":["automation","cascading-failure","config-change","security"],"Keywords":["dns","outage","puppet","configuration","nxdomain","fileservers","memory","circular dependency"],"Company":"GitHub","Product":"DNS infrastructure","SourcePublishedAt":"2014-01-19T02:50:54Z","SourceFetchedAt":"2026-05-04T19:51:42.321765Z","Summary":"A Puppet manifest bug restarted only the authoritative nameserver (not the caching one) after an IP change, causing query timeouts. The deploy run during incident response then rebuilt the zone file from an internal provisioning API call that itself depended on DNS, producing a corrupt zone with `NXDOMAIN` for many records. Memory exhaustion from spawned processes on the fileservers extended impact to 1h35m of partial downtime.","Description":"On Wednesday, January 8, 2014, GitHub experienced an outage of its DNS infrastructure. The incident began at 13:20 PM PST during a rollout of firewall and DNS server configuration changes aimed at improving DDoS defenses. This led to 42 minutes of full service downtime for customers, followed by an additional 1 hour and 35 minutes of partial downtime affecting a subset of repositories.\n\nThe initial problem stemmed from a bug in Puppet manifests. During the configuration rollout, only the authoritative name server was restarted, while the caching name server was not. This resulted in the caching name server attempting to request DNS records from an old, no longer serving IP address, causing query timeouts.\n\nDuring the incident response, a deployment of the DNS system was triggered. This deployment relied on an internal provisioning service that itself depended on a functioning DNS infrastructure. This circular dependency, combined with inadequate sanity checks on the API call's output, led to the generation of corrupted DNS zone files, causing many records to return NXDOMAIN.\n\nEven after DNS query timeouts were resolved and missing DNS records restored, GitHub's performance remained degraded. A subset of fileservers experienced memory exhaustion due to a significant number of processes spawned during the DNS outage. This created back pressure, preventing connections to healthy fileservers. Engineers manually removed misbehaving fileservers and performed checks on DRBD block devices. Full service was restored by 15:47 PM PST.\n\nTo prevent future occurrences, GitHub is decoupling internal and external DNS infrastructure, reviewing configuration management code for service restart bugs, and addressing the circular dependency in the provisioning system. They are also reviewing fileserver management systems, implementing Linux cgroups for process accounting, and analyzing fileserver code for tight DNS dependencies to improve resilience against similar events."}