{"UUID":"e696c413-9af6-4e51-b073-51edbdb1ed2a","URL":"https://github.blog/news-insights/company-news/addressing-githubs-recent-availability-issues-2/","ArchiveURL":"","Title":"GitHub availability incidents in February and March 2026","StartTime":"0001-01-01T00:00:00Z","EndTime":"0001-01-01T00:00:00Z","Categories":["automation","cascading-failure","cloud","config-change","security"],"Keywords":["github","availability","performance","database","authentication","actions","redis","cache"],"Company":"GitHub","Product":"GitHub platform, GitHub Actions, authentication, user management, Redis","SourcePublishedAt":"2026-03-11T21:41:51Z","SourceFetchedAt":"2026-05-04T19:52:42.241712Z","Summary":"Two popular client apps had been quietly increasing read traffic 10×, then a Saturday change shortened a user-settings cache TTL from 12h to 2h. On Monday's peak, the combined write amplification from cache rewrites plus read load overwhelmed the core auth/user-management database cluster, cascading through every service that depends on it (github.com, API, Actions, Git over HTTPS, Copilot, etc.).","Description":"GitHub experienced significant availability and performance issues across its platform in early 2026, with three major incidents occurring on February 2, February 9, and March 5. These issues arose during a period of rapid usage growth, exposing scaling limitations and architectural coupling that allowed localized problems to cascade across critical services.\n\nThe February 9 incident was a high-impact event caused by an overloaded core database cluster supporting authentication and user management. This was triggered by a tenfold increase in read traffic from two popular client applications, combined with a cache TTL reduction from 12 to 2 hours for user settings, leading to a surge in write volume. The combined load overwhelmed the database, impacting all dependent services.\n\nOn February 2, GitHub Actions hosted runners experienced an outage due to a telemetry gap that misapplied security policies to internal storage accounts, blocking VM metadata access across all regions. On March 5, another Actions incident occurred when a Redis cluster failover exposed a latent configuration issue, leaving the cluster without a writable primary and requiring manual intervention.\n\nCommon contributing factors across these incidents included insufficient isolation between critical components, inadequate safeguards for load shedding, and gaps in end-to-end validation and monitoring. The impact was broad, affecting multiple GitHub services and extending incident durations.\n\nIn response, GitHub is prioritizing near-term stabilization by redesigning its user cache system for higher volume, expediting capacity planning, and auditing critical infrastructure. Efforts are also focused on further isolating key dependencies like GitHub Actions and Git to prevent cascading failures and protecting downstream components during traffic spikes.\n\nLonger-term investments include migrating infrastructure to Azure for improved vertical and horizontal scaling and global resiliency, aiming to serve 50% of traffic from Azure by July. Additionally, GitHub is breaking down its monolith into more isolated services and data domains to enable independent scaling, isolated change management, and localized load shedding decisions."}