Slack’s Incident on 2-22-22

On February 22, 2022, just after 6 a.m. Pacific Time, Slack experienced a major incident where many users were unable to connect to the service. The primary symptom was the failure of client boot operations, which prevented users from fetching essential data like channel listings and preferences, rendering Slack unusable.

The incident was triggered by a percentage-based rollout (PBR) of Consul agent upgrades. As Consul agents on Memcached nodes restarted, they temporarily deregistered and re-registered. Mcrib, Slack’s cache control plane, efficiently replaced these nodes with empty spares, leading to a significant drop in the cache hit rate across the system.

This cache degradation exposed an underlying inefficiency in a “scatter query” used for Group Direct Message (GDM) conversations. This query, sharded by user, required querying every shard in the Vitess database when data was not in the cache. With a high cache miss rate, the database was overwhelmed by these superlinear read loads, causing queries to time out and preventing caches from refilling, leading to a cascading failure.

Initial mitigation involved throttling client boot requests to reduce database load, which helped users with already booted clients but prevented new connections. The Consul agent restart operation was paused. Subsequently, the problematic scatter query was modified to read only missing data from Vitess and to utilize replicas, which allowed caches to refill and database load to decrease.

These remediations enabled engineers to slowly increase the client boot rate limit back to normal levels, gradually restoring full service. The incident highlighted complex interactions between the application, Vitess datastores, caching system, and service discovery, leading to process changes for Consul rollouts and modifications to the problematic query.

Postmortem Index

Slack’s Incident on 2-22-22

Keywords