{"UUID":"27ad61bb-5843-4164-bb8f-3c8def55d77c","URL":"https://github.blog/news-insights/company-news/github-availability-report-october-2020/","ArchiveURL":"","Title":"GitHub background job system degraded availability October 2020","StartTime":"2020-10-09T21:30:00Z","EndTime":"2020-10-10T00:02:00Z","Categories":["automation","config-change","security"],"Keywords":["github","zookeeper","kafka","background jobs","outage","october 2020","availability","incident"],"Company":"GitHub","Product":"background job system","SourcePublishedAt":"2020-11-04T17:30:43Z","SourceFetchedAt":"2026-05-04T19:51:21.694001Z","Summary":"During routine ZooKeeper reprovisioning, replacement hosts were added too quickly and elected a second leader, creating two distinct ZooKeeper clusters. A Kafka broker for the background-job system connected to the new cluster and elected itself controller, so two Kafka clusters served conflicting state to clients; ~10% of background-job writes failed over 2h32m.","Description":"On October 9, 2020, starting at 21:30 UTC, GitHub experienced an incident that lasted for two hours and 32 minutes. This led to a degraded state of availability for several services, including issues, pull requests, webhooks, GitHub Actions, and GitHub Pages.\n\nThe root cause was identified during routine reprovisioning of ZooKeeper nodes. New hosts were introduced too rapidly, which resulted in the election of a second ZooKeeper leader. This effectively created two logically distinct ZooKeeper clusters where only one should have existed.\n\nWhile the ZooKeeper hosts were in this inconsistent state, a single Kafka broker, which powers GitHub's internal background job system, connected to the newly formed second ZooKeeper cluster and elected itself as the Kafka controller. This resulted in two distinct Kafka clusters serving conflicting state information to clients.\n\nThe conflicting Kafka state caused approximately 10% of write requests to the background job service to fail. This led to a backup of jobs; however, no background jobs were lost due to the retry behavior implemented in clients and the presence of redundant queueing systems.\n\nTo prevent similar occurrences, GitHub updated its ZooKeeper provisioning checklist. Additionally, the company plans to introduce automation for both ZooKeeper and Kafka cluster maintenance."}