Postmortem Index

Explore incident reports from various companies

GitHub DNS infrastructure failure and service degradation on October 11, 2024

GitHub · DNS

On October 11, 2024, starting at 05:59 UTC, GitHub experienced an incident that resulted in degraded performance across various services. The root cause was identified as a database migration that led to the failure of DNS infrastructure in one of GitHub’s sites, preventing lookups.

Attempts to recover the database cascaded into further failures, impacting the entire DNS system for that site. Customer impact began around 17:31 UTC, affecting 4% of Copilot users with IDE code completion degradation, 25% of Actions workflow users with delays exceeding 5 minutes, and 100% failure of code search requests for approximately four hours.

An initial mitigation attempt at 18:05 UTC involved repointing the degraded DNS site to a different site. While this restored internal connectivity within the affected site, it inadvertently caused new issues with cross-site connectivity from healthy sites back to the degraded one.

At 20:52 UTC, a new remediation plan was initiated, focusing on deploying temporary DNS resolution capabilities directly into the degraded site. This led to DNS resolution recovery starting at 21:46 UTC, with full health restored by 22:16 UTC. Lingering issues specifically with code search were fully resolved at 01:11 UTC on October 12, marking the end of the incident after 19 hours and 12 minutes of impact.

GitHub is actively working to enhance the resiliency and automation processes surrounding this infrastructure to improve diagnosis and resolution times for future incidents. The team also continued to restore the original functionality within the site after public service was restored.

Keywords

githubdnsdatabase migrationoctober 2024copilotactionscode searchoutage