Postmortem Index

Explore incident reports from various companies

GitHub February 2020 mysql1 service disruptions

GitHub · mysql1 database cluster

2020-02-19 – 2020-02-27 automation config-change security

GitHub experienced multiple service interruptions in late February 2020, totaling 8 hours and 14 minutes across four distinct events. These incidents led to degraded service due to unexpected variations in database load and an unintended configuration issue within the mysql1 database cluster.

The incidents occurred on February 19, 20, 25, and 27. The first was triggered by an analytics query inadvertently hitting the master database. The second was a planned master promotion that recreated similar load issues. The third and fourth incidents involved active database connections exceeding critical thresholds, causing stalled writes and slow performance on the mysql1 cluster.

A primary root cause was identified as ProxySQL’s file descriptor limits being silently capped at 65,536, significantly lower than the intended 1,073,741,824, due to a system-level kernel limit of 1,048,576. This prevented ProxySQL from adequately handling high load. A race condition during remediation also slowed the ability to adjust the file limit.

In response, GitHub improved observability and performance monitoring for ProxySQL and temporarily froze production deployments to stabilize the system. Immediate remediation included a significant data partitioning effort for the “abilities” table, which reduced load on the mysql1 cluster master by 20% and queries per second by 15%.

Long-term initiatives include auditing and lowering reads from master databases, expanding the use of feature flags for faster recovery, completing in-flight functional partitioning (expected to reduce writes by 60% and storage by 70%), refining dashboards for better deploy safety, and investing in additional data partitioning and sharding for horizontal scalability.

Keywords

mysqldatabaseproxysqlfile descriptorspartitioningshardinggithubfebruary 2020