Postmortem Index

Explore incident reports from various companies

CircleCI Workflow Delay Incidents March 26 - April 10, 2019

CircleCI · MongoDB

2019-03-26 – 2019-04-10 config-change

Between March 26 and April 10, 2019, CircleCI experienced multiple incidents involving workflow delays and platform instability. The issues began on March 26 with workflow processing delays, followed by further incidents on April 2 (job start delays), April 3 (API degradation and Out of Memory errors), April 4 (unresponsive MongoDB replicas), April 5 (MongoDB primary saturation), and April 10 (workflow delays leading to a failover).

The core problem stemmed from the datastore backing the builds queue, specifically MongoDB replica sets, which suffered from slow queries, degradation, and stalls. This led to jobs and workflows being unable to process, and public APIs experiencing degradation. Initially, a concurrent minor JVM upgrade, intended for security, inadvertently reduced thread and connection pool sizes across services due to default Docker-awareness, constraining throughput and masking the underlying MongoDB capacity issues.

The primary root cause was identified as severe contention within MongoDB, leading to the exhaustion of “tickets” (internal read/write concurrency limits). A significant contributing factor was the application’s practice of redundantly declaring indexes on MongoDB startup. Although idempotent, this operation required database-level locks, which, when performed frequently by a large fleet of service instances, caused significant contention and exacerbated ticket unavailability, leading to frequent and pervasive stalls.

Customers experienced significant delays in workflow processing and job starts. The public and private APIs became degraded, with some API service containers crashing. The cumulative effect was a prolonged period of instability where the platform struggled to process builds, leading to a backlog of jobs and confusion when incidents were declared “fixed” while queues were still draining.

Remediation involved scaling up MongoDB capacity, disabling Docker detection for JVMs, tuning thread and connection pools, and critically, removing the unnecessary index creation logic. CircleCI also engaged MongoDB support, applying TCMalloc-related tuning parameters and planning for a version upgrade. Further steps include improving application-level data access, enhancing monitoring with tools like Datadog and Honeycomb.io to detect leading indicators of degradation, and implementing failover procedures for slow MongoDB processes to minimize customer impact.

Keywords

mongodbjvmworkflowqueueapidatastoreperformanceincident