Postmortem Index

Explore incident reports from various companies

CircleCI jobs stuck in "not running" state on November 8, 2021

CircleCI · job distribution service

On Monday, November 8, 2021, customer jobs on CircleCI were blocked in a “not running” state from approximately 18:50 UTC to 19:43 UTC, with elevated queue times persisting until 20:19 UTC. This incident affected all executors, with Docker executors recovering first, followed by machine executors.

The root cause was a database schema change rolled out at 18:50 UTC to the PostgreSQL database used by the job distribution service. This change was not backwards compatible; incoming work used the new data type, while existing work used the old type, causing the distribution service’s strict schema validation to fail. A rollback was immediately performed, but this rendered data written between the initial deploy and the rollback unreadable, leading to continued distribution failures.

To mitigate the issue, a modified build was quickly developed and manually deployed, allowing the distribution system to ignore the problematic field. Concurrently, the primary compute cluster was manually scaled to handle the anticipated influx of work. However, new nodes took longer than expected to join the fleet, and machine executors experienced slower provisioning due to provider rate limiting, prolonging the recovery.

Customer impact included jobs being stuck and significant delays in queue times across the platform. The incident was resolved after job queues decreased and normal processing was observed. CircleCI is implementing fixes to their testing strategy, modifying the distributor’s reaction to unexpected data, and improving recovery mechanisms to prevent similar future incidents and limit their impact.

Keywords

jobsstuckdatabaseschema changepostgresqldistribution serviceexecutorsqueue times