Postmortem Index

Explore incident reports from various companies

High queue times on OS X builds (.com and .org)

TravisCI · OS X builds

2015-08-04 – 2015-08-06 config-change

On August 4th, 2015, Travis CI experienced significant instability and high queue times for OS X builds on both open source and private repositories, leading to a large backlog and disruption for users. The incident began with elevated re-queue rates and VM creation errors, prompting a public status incident at 17:45 UTC.

Investigation revealed that the vsphere-janitor service, responsible for cleaning up build VMs, was failing to authenticate with the vSphere API. This was due to its configuration not being updated after a password rotation on July 31st. Consequently, the service “leaked” over 6000 virtual machines onto the Xserve cluster, exhausting resources and preventing new VMs from powering on. A defect in the vsphere-janitor service also masked this issue by reporting stale metrics, preventing alerts from firing.

Travis CI paused all OS X builds at 18:46 UTC on August 4th, fixed the password configuration, and restarted the vsphere-janitor service. After an initial cleanup, builds were resumed at 21:01 UTC. However, the sheer number of leaked VMs and some being in an unexpected state necessitated a more aggressive cleanup, leading to another pause at 22:10 UTC.

After the aggressive cleanup of 6326 VMs, resuming builds at 22:56 UTC immediately led to new VM boot failures, indicating a deeper infrastructure issue. Travis CI escalated to their infrastructure provider, MacStadium, at 23:28 UTC. It was determined that a misconfiguration in vSphere’s CPU Reservations prevented the maximum number of builds from booting, a limitation only exposed by the high load from the incident’s backlog.

Travis CI operated at a reduced capacity from 00:13 UTC on August 5th, processing the backlog. On August 6th, at 03:43 UTC, they implemented MacStadium’s recommended configuration changes to the CPU reservations. After load testing later that day, which confirmed stability and increased throughput, the incident was considered resolved, with the backlog fully cleared by 04:17 UTC on August 6th.

Travis CI plans to implement more robust load testing for OS X infrastructure, improve metrics collection and reporting under failure conditions, and enhance build agent software to minimize impact during future incidents.

Keywords

osxbuildsvmvspheremacstadiumconfigurationpasswordcpu reservations