Postmortem Index

Explore incident reports from various companies

CircleCI workflows latency and failures on April 4, 2025

CircleCI · workflows

On April 4, 2025, CircleCI customers experienced increased latency and failures when starting and canceling workflows and jobs. This incident, which lasted from 22:08 UTC to 23:45 UTC, resulted in delays and difficulty viewing workflows in the UI, with some jobs being dropped due to retry exhaustion.

The incident began following a blue/green deployment initiated around 22:00 UTC to upgrade the service responsible for workflows. While the deployment completed and queries were initially served, increased latency was identified in the workflows service by 22:17 UTC, leading to errors and job failures.

Investigation revealed that all queries on the newly promoted database were hitting disk, indicating a problem with database statistics tables. Initial attempts to mitigate included upsizing the database, disabling non-critical operations, and scaling down the workflows service to a single pod to allow the database to recover while statistics were rebuilt.

However, these measures did not improve database performance. The team ultimately decided to re-enable writes on the old (blue) database and reinstate its primary status to restore service. This work was completed by 23:29 UTC, and by 23:45 UTC, workflow queues returned to normal operating levels.

The root cause was determined to be that the ANALYZE operation, intended to rebuild the database’s statistics table for indexes, was executed too early. A subsequent major version upgrade within the same deployment then rendered these statistics stale, causing the database query planner to operate inefficiently.

To prevent recurrence, CircleCI has updated blue/green database deployment procedures to run analysis after every major version change. The team is also adding automated tests and manual checkpoints before future migrations to identify and resolve issues prior to cutover.

Keywords

workflowsdatabaseblue/greenupgradelatencystatistics