{"UUID":"22e646c4-198c-4985-bb8e-e64e8a13d68e","URL":"https://discuss.circleci.com/t/postmortem-may-21-2021-delay-in-starting-docker-jobs-machine-remote-docker-environments-blocked/40274","ArchiveURL":"","Title":"Delay in starting Docker Jobs. Machine \u0026 remote Docker environments blocked","StartTime":"2021-05-21T13:30:00Z","EndTime":"2021-05-22T01:18:00Z","Categories":["automation","cloud","config-change"],"Keywords":["docker","machine executor","remote docker","rabbitmq","gcp","vm provisioning","job execution","timeout"],"Company":"CircleCI","Product":"job execution","SourcePublishedAt":"2021-06-03T14:11:05Z","SourceFetchedAt":"2026-05-04T19:51:05.703299Z","Summary":"A routine RabbitMQ upgrade from 3.8.9 to 3.8.16 introduced a 15-minute consumer ack timeout that the changelog described as scoped to quorum queues but actually applied to all queue types. Consumers on the VM-destroyer queue gradually got their channels closed until the queue had zero consumers, so VMs in one region stopped being deleted; this eventually backed up VM creation and blocked Docker, machine, Windows, Mac, Arm, GPU, and remote-Docker jobs for ~12 hours.","Description":"On May 21, 2021, starting at 13:30 UTC, CircleCI customers experienced significant delays or complete blocking of jobs across Docker, Machine executor (including Windows, Mac, Arm, and GPU), and remote Docker environments. This widespread disruption to job execution continued until May 22, 2021, at 01:18 UTC.\n\nThe incident's root cause stemmed from a routine RabbitMQ upgrade on May 20, which updated the system from version 3.8.9 to 3.8.16. A critical change introduced in this upgrade, a 15-minute consumer acknowledgment timeout, was incorrectly documented in the changelog as only affecting quorum queues. However, it was later discovered to impact all queue types.\n\nThis misdocumented change led to consumers on the VM destroyer service queue being disconnected after 15 minutes, eventually resulting in the queue having zero active consumers. Consequently, old virtual machines were not being deleted, causing a buildup of resources and preventing the creation of new VMs. This bottleneck subsequently led to CircleCI's systems hitting CPU quotas within Google Cloud Platform (GCP), further exacerbating the inability to provision new jobs.\n\nRemediation efforts involved a multi-pronged approach, including manually blocking Linux machine jobs, requesting increased GCP quotas, and manually deleting old VMs. The team also identified and addressed database issues, such as RDS burst balance exhaustion, by scaling up the database instance and implementing load shedding measures like blocking remote-docker jobs and scaling down provisioning services.\n\nThe system was gradually restored by bringing components back online one at a time, with the incident officially resolved by 01:18 UTC on May 22. To prevent recurrence, CircleCI plans to improve VM provisioning robustness and independence, enhance observability around RabbitMQ, batch GCP API requests, and refine incident response protocols. RabbitMQ has also been informed about the documentation discrepancy."}