{"UUID":"8afe308d-b0aa-4792-b2f1-170cd8428d8e","URL":"https://discuss.circleci.com/t/post-incident-report-april-4-2025-delays-in-starting-workflows/53113","ArchiveURL":"","Title":"CircleCI workflows latency and failures on April 4, 2025","StartTime":"2025-04-04T22:08:00Z","EndTime":"2025-04-04T23:45:00Z","Categories":["automation","config-change"],"Keywords":["workflows","database","blue/green","upgrade","latency","statistics"],"Company":"CircleCI","Product":"workflows","SourcePublishedAt":"2025-04-16T13:41:43Z","SourceFetchedAt":"2026-05-04T19:52:00.768714Z","Summary":"A blue/green upgrade of the workflows database succeeded mechanically, but the post-cutover database was running every query against disk because its statistics tables had not been updated. The team ran `ANALYZE` early in the upgrade procedure, but a second major-version upgrade in the same deployment then made those statistics stale, leaving the planner without usable indexes after the cutover. Workflows latency spiked, jobs were dropped after exhausting their 10-minute retry, and the team eventually re-promoted the old (blue) database to recover.","Description":"On April 4, 2025, CircleCI customers experienced increased latency and failures when starting and canceling workflows and jobs. This incident, which lasted from 22:08 UTC to 23:45 UTC, resulted in delays and difficulty viewing workflows in the UI, with some jobs being dropped due to retry exhaustion.\n\nThe incident began following a blue/green deployment initiated around 22:00 UTC to upgrade the service responsible for workflows. While the deployment completed and queries were initially served, increased latency was identified in the workflows service by 22:17 UTC, leading to errors and job failures.\n\nInvestigation revealed that all queries on the newly promoted database were hitting disk, indicating a problem with database statistics tables. Initial attempts to mitigate included upsizing the database, disabling non-critical operations, and scaling down the workflows service to a single pod to allow the database to recover while statistics were rebuilt.\n\nHowever, these measures did not improve database performance. The team ultimately decided to re-enable writes on the old (blue) database and reinstate its primary status to restore service. This work was completed by 23:29 UTC, and by 23:45 UTC, workflow queues returned to normal operating levels.\n\nThe root cause was determined to be that the `ANALYZE` operation, intended to rebuild the database's statistics table for indexes, was executed too early. A subsequent major version upgrade within the same deployment then rendered these statistics stale, causing the database query planner to operate inefficiently.\n\nTo prevent recurrence, CircleCI has updated blue/green database deployment procedures to run analysis after every major version change. The team is also adding automated tests and manual checkpoints before future migrations to identify and resolve issues prior to cutover."}