{"UUID":"14024599-5ca4-479d-a537-d36c393f97a6","URL":"https://discuss.circleci.com/t/postmortem-march-26-april-10-workflow-delay-incidents/30060","ArchiveURL":"","Title":"CircleCI Workflow Delay Incidents March 26 - April 10, 2019","StartTime":"2019-03-26T21:29:00Z","EndTime":"2019-04-10T15:00:00Z","Categories":["config-change"],"Keywords":["mongodb","jvm","workflow","queue","api","datastore","performance","incident"],"Company":"CircleCI","Product":"MongoDB","SourcePublishedAt":"2019-04-29T17:50:13Z","SourceFetchedAt":"2026-05-04T19:51:05.70343Z","Summary":"Slow queries on the MongoDB replica sets backing the build queue caused workflows to back up over a two-week run of incidents. A roughly concurrent minor-version JVM upgrade enabled Docker-awareness by default, which silently shrank thread and connection pools across most JVM services and constrained throughput, masking the underlying MongoDB capacity problem. Tuning thread/connection pools and upsizing MongoDB stabilized the platform after multiple cascading outages on March 26, April 2, April 3, and April 10.","Description":"Between March 26 and April 10, 2019, CircleCI experienced multiple incidents involving workflow delays and platform instability. The issues began on March 26 with workflow processing delays, followed by further incidents on April 2 (job start delays), April 3 (API degradation and Out of Memory errors), April 4 (unresponsive MongoDB replicas), April 5 (MongoDB primary saturation), and April 10 (workflow delays leading to a failover).\n\nThe core problem stemmed from the datastore backing the builds queue, specifically MongoDB replica sets, which suffered from slow queries, degradation, and stalls. This led to jobs and workflows being unable to process, and public APIs experiencing degradation. Initially, a concurrent minor JVM upgrade, intended for security, inadvertently reduced thread and connection pool sizes across services due to default Docker-awareness, constraining throughput and masking the underlying MongoDB capacity issues.\n\nThe primary root cause was identified as severe contention within MongoDB, leading to the exhaustion of \"tickets\" (internal read/write concurrency limits). A significant contributing factor was the application's practice of redundantly declaring indexes on MongoDB startup. Although idempotent, this operation required database-level locks, which, when performed frequently by a large fleet of service instances, caused significant contention and exacerbated ticket unavailability, leading to frequent and pervasive stalls.\n\nCustomers experienced significant delays in workflow processing and job starts. The public and private APIs became degraded, with some API service containers crashing. The cumulative effect was a prolonged period of instability where the platform struggled to process builds, leading to a backlog of jobs and confusion when incidents were declared \"fixed\" while queues were still draining.\n\nRemediation involved scaling up MongoDB capacity, disabling Docker detection for JVMs, tuning thread and connection pools, and critically, removing the unnecessary index creation logic. CircleCI also engaged MongoDB support, applying TCMalloc-related tuning parameters and planning for a version upgrade. Further steps include improving application-level data access, enhancing monitoring with tools like Datadog and Honeycomb.io to detect leading indicators of degradation, and implementing failover procedures for slow MongoDB processes to minimize customer impact."}