Postmortem Index

Explore incident reports from various companies


“At approximately 14:01, a Redis instance acting as the primary for a highly-available cluster used by Discord’s API services was migrated automatically by Google’s Cloud Platform. This migration caused the node to incorrectly drop offline, forcing the cluster to rebalance and trigger known issues with the way Discord API instances handle Redis failover. After resolving this partial outage, unnoticed issues on other services caused a cascading failure through Discord’s real time system. These issues caused enough critical impact that Discord’s engineering team was forced to fully restart the service, reconnecting millions of clients over a period of 20 minutes.”

Source | Edit Metadata | JSON file