Delay in starting Docker Jobs. Machine & remote Docker environments blocked
CircleCI · job execution
On May 21, 2021, starting at 13:30 UTC, CircleCI customers experienced significant delays or complete blocking of jobs across Docker, Machine executor (including Windows, Mac, Arm, and GPU), and remote Docker environments. This widespread disruption to job execution continued until May 22, 2021, at 01:18 UTC.
The incident’s root cause stemmed from a routine RabbitMQ upgrade on May 20, which updated the system from version 3.8.9 to 3.8.16. A critical change introduced in this upgrade, a 15-minute consumer acknowledgment timeout, was incorrectly documented in the changelog as only affecting quorum queues. However, it was later discovered to impact all queue types.
This misdocumented change led to consumers on the VM destroyer service queue being disconnected after 15 minutes, eventually resulting in the queue having zero active consumers. Consequently, old virtual machines were not being deleted, causing a buildup of resources and preventing the creation of new VMs. This bottleneck subsequently led to CircleCI’s systems hitting CPU quotas within Google Cloud Platform (GCP), further exacerbating the inability to provision new jobs.
Remediation efforts involved a multi-pronged approach, including manually blocking Linux machine jobs, requesting increased GCP quotas, and manually deleting old VMs. The team also identified and addressed database issues, such as RDS burst balance exhaustion, by scaling up the database instance and implementing load shedding measures like blocking remote-docker jobs and scaling down provisioning services.
The system was gradually restored by bringing components back online one at a time, with the incident officially resolved by 01:18 UTC on May 22. To prevent recurrence, CircleCI plans to improve VM provisioning robustness and independence, enhance observability around RabbitMQ, batch GCP API requests, and refine incident response protocols. RabbitMQ has also been informed about the documentation discrepancy.