Postmortem Index

Explore incident reports from various companies

GitHub.com database configuration change causes 36-minute outage

GitHub · GitHub.com

On August 14, 2024, GitHub.com experienced an incident causing all services to be inaccessible for all users between 23:02 UTC and 23:38 UTC. The issue began when an erroneous configuration change was rolled out to GitHub.com databases at 22:59 UTC.

The root cause was this configuration change, which impacted the ability of the databases to respond to health check pings from the routing service. This led to the database hosts being marked as unhealthy, rendering the production read-only database endpoint inaccessible.

As a direct consequence, the GitHub.com application could no longer connect to critical data for read operations, resulting in widespread inaccessibility across the platform. Despite the severity of the outage, there was no data loss or corruption reported.

The incident was mitigated by reverting the erroneous configuration change, which restored connectivity to the databases. Traffic resumed and services recovered to full health by 23:38 UTC. The incident was officially resolved at 00:30 UTC on August 15 after continued monitoring.

To prevent recurrence, GitHub has implemented additional guardrails in its database change management process. The company is also prioritizing several repair items, including faster rollback functionality and enhanced resilience to dependency failures, with these efforts being addressed at the highest priority.

Keywords

github.comdatabaseconfiguration changeoutageread-onlyhealth checkrouting serviceaugust 2024