Honeycomb total outage on July 25th, 2023
Honeycomb · Honeycomb
On July 25th, 2023, Honeycomb experienced a total outage impacting all user-facing components from 1:40 p.m. UTC to 2:48 p.m. UTC. The incident began with a routine switch between Retriever (storage and query engine) clusters on July 24th to avoid a bug. Hours later, an internal SLO for Shepherd (ingest service) started burning, indicating performance degradation.
Engineers discovered that the new query engine cluster failed to write updates for schema refresh, undermining the ingest cache and causing performance issues. An attempt to fix this by toggling a feature flag to flip writes back to the old cluster failed due to a subtle implementation bug: hosts told to stop never tried again, requiring a full reboot. This flaw, hidden by deployment mechanisms, eventually led to a complete cessation of writes.
The system ultimately crashed when the main database seized up and ran out of connections. The core root cause was identified as a non-deterministic deadlock within MySQL’s internals, likely triggered by the increased read load on the database when the ingest cache stopped functioning. Contributing factors included indirect dependencies between services, where the cache failure overloaded the database, and the unexpected interaction of seemingly good practices like feature flags and frequent deploys.
During the outage, no data could be processed or accessed, and any new data ingested by users was lost if not buffered. This also had knock-on effects on alerting and querying capabilities. The incident was the most severe since Honeycomb had paying customers.
Recovery involved setting up circuit-breakers to protect the database, failing over to a replica, manually updating schema timestamps to force a cache reload, and restarting Retriever hosts. Corrective actions include strengthening the cache, reducing database contention during schema updates, and stabilizing performance. In the short term, the specific failure mode has been drastically reduced due to completed migrations and code removal, and a faster database failover response is now expected.