{"UUID":"b1b44424-63b6-4e26-8654-b1bfb77ac481","URL":"https://github.blog/2022-03-23-an-update-on-recent-service-disruptions/","ArchiveURL":"","Title":"GitHub mysql1 cluster repeated service disruptions (March 2022)","StartTime":"2022-03-16T14:09:00Z","EndTime":"2022-03-23T17:40:00Z","Categories":["automation","config-change","security"],"Keywords":["github","mysql","database","cluster","failover","resource contention","webhooks","actions"],"Company":"GitHub","Product":"mysql1 cluster","SourcePublishedAt":"2022-03-23T20:39:01Z","SourceFetchedAt":"2026-05-04T19:52:24.677606Z","Summary":"Peak-hour load on the shared `mysql1` cluster repeatedly exhausted ProxySQL connections over a week, requiring four primary failovers plus an emergency index and proactive throttling of webhooks and Actions. Memory profiling turned on to debug performance later triggered another connection failure, requiring yet another failover.","Description":"Over several weeks in March 2022, GitHub experienced multiple service disruptions stemming from resource contention within its shared `mysql1` database cluster. These incidents primarily manifested as the database proxying technology exhausting its maximum connection limit, leading to widespread service degradation. Affected services included Git operations, webhooks, pull requests, API requests, issues, GitHub Packages, GitHub Codespaces, GitHub Actions, and GitHub Pages.\n\nThe underlying cause was identified as resource contention in the `mysql1` cluster during periods of peak load, exacerbated by poor query performance under specific circumstances. The cluster, configured with a classic primary-replica setup, struggled to handle heavy read/write traffic when connection limits were reached. A subsequent attempt to analyze performance by enabling memory profiling on the database proxy also inadvertently triggered another connection failure.\n\nCustomers experienced significant productivity impacts due to the inability of various GitHub services to perform write operations. This included disruptions to core functionalities like Git pushes, webhook processing, and interactions with GitHub's API, packages, and CI/CD services.\n\nInitial recovery involved failing over to healthy replicas. After the second incident, an emergency index was implemented to address a main performance problem. Further incidents led to proactive measures like throttling webhook traffic to reduce load. GitHub is now auditing load patterns, implementing performance fixes, moving traffic to other databases, and reviewing change management procedures to prevent future occurrences."}