Postmortem Index

Explore incident reports from various companies

GitHub November 2021 Availability Incident due to MySQL Schema Migration

Github · GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, Webhooks

On November 27, 2021, starting at 20:40 UTC and lasting 2 hours and 50 minutes, GitHub experienced an incident that significantly impacted the availability of core services. Affected services included GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks.

The incident stemmed from a novel failure mode during a schema migration on a large MySQL table. Specifically, during the final rename step of the migration, a significant portion of GitHub’s MySQL read replicas entered a semaphore deadlock. This caused the affected read replicas to enter a crash-recovery state.

The crash-recovery state of the deadlocked replicas led to an increased load on the remaining healthy read replicas. This cascading effect resulted in an insufficient number of active read replicas to handle production requests, thereby degrading the availability of core GitHub services for users. Write operations remained healthy, and no data corruption occurred.

During mitigation, GitHub attempted to increase capacity by promoting healthy internal replicas to production, but this was not sufficient. To restore service, production traffic was proactively removed from broken replicas until they could successfully process the table rename and recover. Once recovered, these replicas were returned to production, restoring normal operations.

To prevent similar incidents and reduce recovery time, GitHub is prioritizing functional partitioning efforts, which will allow migrations to run in canary mode on single shards. Additionally, internal procedures are being updated to increase the over-provisioning of each cluster. Schema migrations have been paused while the specific failure scenario is further investigated and migration tooling improvements are classified.

Keywords

githubmysqlschema migrationread replicasdeadlockavailabilitynovember 2021