GitHub background job system degraded availability October 2020

On October 9, 2020, starting at 21:30 UTC, GitHub experienced an incident that lasted for two hours and 32 minutes. This led to a degraded state of availability for several services, including issues, pull requests, webhooks, GitHub Actions, and GitHub Pages.

The root cause was identified during routine reprovisioning of ZooKeeper nodes. New hosts were introduced too rapidly, which resulted in the election of a second ZooKeeper leader. This effectively created two logically distinct ZooKeeper clusters where only one should have existed.

While the ZooKeeper hosts were in this inconsistent state, a single Kafka broker, which powers GitHub’s internal background job system, connected to the newly formed second ZooKeeper cluster and elected itself as the Kafka controller. This resulted in two distinct Kafka clusters serving conflicting state information to clients.

The conflicting Kafka state caused approximately 10% of write requests to the background job service to fail. This led to a backup of jobs; however, no background jobs were lost due to the retry behavior implemented in clients and the presence of redundant queueing systems.

To prevent similar occurrences, GitHub updated its ZooKeeper provisioning checklist. Additionally, the company plans to introduce automation for both ZooKeeper and Kafka cluster maintenance.

Postmortem Index

GitHub background job system degraded availability October 2020

Keywords