CircleCI DB performance issue
CircleCI · database
On July 7, 2015, CircleCI experienced a severe and lengthy downtime where its build queue came to a complete standstill. This began after GitHub push hooks resumed with an unprecedented intensity following a GitHub outage, leading to a sustained surge in new build requests.
The rapid insertion of these new builds caused severe performance degradation in CircleCI’s main database, which underpins the complex build queue system. The database quickly became unresponsive, going from normal operation to fully locked within minutes due to resource contention and slow, timing-out queries.
Customers faced a complete halt in their build processes, with the queue dequeueing only one build per minute instead of many per second. Many queued builds aged significantly, losing their value, and the platform became largely inaccessible.
Initial attempts to salvage the queue and throttle incoming requests were unsuccessful. Engineers utilized the live patching capabilities of Clojure to disable problematic queries and modify code in production. They then cleared the “usage queue” and “run queue” using scripts, a process that took over an hour.
After gaining control of the database and queue, CircleCI initiated a switch to new, upgraded database hardware. This allowed them to restore service, scale capacity, and eventually clean up temporary fixes, bringing the system back to normal operation by July 8, 2015.