Postmortem Index

Explore incident reports from various companies

CircleCI Linux build queue backing up October 2015

CircleCI · linux build queue

2015-10-14 – 2015-10-15 config-change

On Wednesday, October 14, 2015, starting around 20:17 UTC, CircleCI experienced a significant incident where the Linux build queue began backing up. The operations team observed that standard demand management tools were ineffective, and capacity was available but not being utilized for builds. This led to a rising queue and an escalation to engineering.

Initial investigations focused on increased database load, which was exacerbated during peak Wednesday afternoon traffic. While no direct correlation to recent changes was found, suspicious changes were rolled back. However, the database was already saturated with queued operations. A failover to a different primary database at 00:11 UTC on Thursday provided temporary relief, but queued operations quickly returned.

By Thursday 07:00 UTC, runnable builds were processing, but a large number of builds were blocked from reaching the runnable state. Attempts to promote these builds using normal code flooded the next queue. Furthermore, the build scheduler’s throttling mechanisms, designed to back off during failure, were misfiring under normal conditions, preventing necessary throughput. This resulted in a 17-hour backlog of builds.

CircleCI engineers manually forced builds through and rapidly developed new tools to automate batch processing of the backlog. They also added new metrics and updated the throttling code to improve behavior. By Thursday 14:20 UTC, the last of the leftover builds were processed, and the system returned to handling new inbound traffic normally.

The incident highlighted a lack of immediate tools to manage such situations, leading to on-the-fly development and inconsistent build states. CircleCI committed to investing in better tools for rapid incident response. Architecturally, ongoing efforts to reduce database strain through a central scheduler, data migration to separate databases, and custom-tuned deployments for different data types were reinforced as critical for future reliability.

Keywords

linuxbuild queuedatabaseci/cdthrottlingbacklogcircleci