Flowdock outage and cross-organization data leak

A surge in remote-work traffic during the early COVID-19 pandemic spiked CPU on Flowdock’s application database and caused it to hang at ~13:30 UTC on April 21, 2020. The team restarted the application and database services, ending the first hard-down period.

The restart silently corrupted the table that maintains user-ID sequences, resetting it to a value almost 80 days old. As users logged back in over the next eight hours, the application reissued IDs that had already been allocated to users in other organizations. Customers reported missing collaborators, login failures, and — most seriously — seeing users from unrelated organizations appear inside their own flows. Sessions established before the crash were unaffected.

Flowdock was taken down manually at ~21:30 UTC on April 21 to investigate the cross-organizational leakage. Engineers restored a pre-crash snapshot of the primary database (which also repaired the corrupted sequence table) and brought the service back online. Between 04:30 and 06:30 UTC on April 22 a few users still saw cross-org flows, because the leaked user IDs were still warm in the application cache. Flowdock was taken down a second time at ~06:30 UTC, the pre-crash snapshot was restored again, and all active sessions were invalidated. End-to-end tests then confirmed no remaining cross-org access. The application was brought back at ~21:30 UTC on April 22.

The two snapshot restores wiped user activity performed during the “Incident Window” (April 21 13:30–21:30 UTC and April 22 04:30–06:30 UTC): new flows, new and updated users in an org, new members added to existing flows including 1-1 conversations, and inbound integration calls (e.g. from Rally) that had failed and silently removed integrations from their flows. Log analysis reported that no .doc, .xls, .pdf, or .csv attachments were downloaded during the window.

Committed follow-ups: more headroom on the database server beyond the increase already in place, finer-grained monitoring of database response time, an automated failover to a secondary database for unrecoverable primary failures, adding a DBA to the emergency change approval committee so database failures are evaluated before application restart, and a mechanism to invalidate logged-in sessions and cached data with advance notification when the application restarts.

Postmortem Index

Flowdock outage and cross-organization data leak

Keywords