Travis CI production database truncation
TravisCI · travis-ci.com
On Tuesday, March 13, 2018, travis-ci.com experienced a major outage starting at 12:14 UTC, lasting approximately 5.5 hours, with a build backlog persisting for another 3.5 hours. The incident began when a database query was accidentally run against the production database, truncating all tables. Although the query was blocked for about 10 minutes, it executed at 12:14 UTC.
The root cause was identified as a developer executing a test suite using the Database Cleaner gem in an old terminal window. This window, part of a tmux session, unknowingly had a DATABASE_URL environment variable set to the production database from previous inspection activities. The tooling and processes at the time made it easier to connect to the primary database with write access rather than a read-only follower.
Customer impact included travis-ci.com being non-operational. For about 30 minutes after the truncation, the API remained operational but connected to an almost empty database, causing blank user profiles. Users who logged in during this window had new user records created, leading to mismatched tokens and incorrect user logins after the database was restored. All affected tokens were revoked by 14:22 UTC on March 14. Additionally, the cron scheduler was not restarted, causing errors with scheduled jobs. Some customer build logs were potentially exposed, and affected users were contacted.
Remediation steps included revoking the truncate permission on databases, patching internal spec helpers to check for the DATABASE_URL environment variable, and adding a shell prompt warning for DATABASE_URL. A Pull Request was submitted to the Database Cleaner gem to prevent similar issues. To avoid compounding problems, an alias for the follower database was created, and database failover and maintenance were automated. Travis CI was able to recover the entire production database with only about 15 minutes of data loss.