Travis CI GCE base image deletion
Travis CI · Travis CI (GCE builds)
On August 9, 2016, Travis CI’s GCE-backed build platform — used for jobs that need sudo or Docker-in-VM — went down because the platform’s stable base VM images were deleted by an internal cleanup service. The outage itself was about 90 minutes; the recovery promoted hastily-tested replacement images and the resulting build breakage rolled on for weeks.
Travis ran builds on three platforms: Docker-in-EC2 (default), VMware vSphere on dedicated hardware (for macOS/Objective-C builds), and Linux VMs on Google Compute Engine (for builds requiring sudo or Docker). Storage on GCE was limited, so an automated cleanup service deleted images that had been removed from Travis’s internal image catalog.
The team was shipping a new image-provisioning process and had been publishing many more candidate images than usual for testing. The cleanup service had been briefly disabled to debug a potential race condition. When it was re-enabled, it queried the catalog with a hard-coded page size of 100, sorted newest-first. Because the development push had produced more than 100 fresh images, the stable images — which were the oldest — fell off the end of that page and never appeared in the cleanup service’s view of the catalog. The service concluded the stable images were no longer referenced and deleted them from GCE. Builds stopped immediately.
The team could not exactly recreate the deleted images: their provisioning code had drifted in the nine months since the stable images were last built. They rolled forward to the available development images instead, which got builds running about 90 minutes after the deletion and cleared the backlog within another two hours.
The new images broke a long tail of customer builds — they shipped without docker-compose in the first version, for example, which is precisely the use case GCE builds exist for — and Travis halted feature development for over a week and put every engineer on customer support to absorb the resulting flood. Committed counter-measures: a reproducible image-provisioning process so any image can be rebuilt exactly, and more frequent image updates so that any future regression is small. The cleanup service’s bug — a “delete things not in catalog X” job paging through only the first 100 entries of catalog X — was likewise tracked for fixing.