Unavailable Guilds & Connection Issues

On October 13, 2017, at 14:01 PDT, a Google Cloud Platform-initiated migration of a primary Redis instance caused it to drop offline. This triggered a known bug in Discord’s API instances, which failed to properly handle the Redis failover, leading to a partial outage. Engineers performed rolling restarts of API instances and escalated the issue to GCP support.

Following the initial Redis issue, a misconfigured edge caching rule for an expensive API route was discovered, causing performance degradation and overloading a Cassandra cluster. This misconfiguration was corrected, the Cassandra cluster recovered, and further API instance restarts were performed. By 14:57 PDT, these issues were believed resolved, and the API returned to a nominal state.

However, at 15:41 PDT, new anomalies emerged, with users reporting unavailable guilds and connection problems. A misbehaving node in the “guilds” cluster was identified, followed by issues in the “sessions” cluster. These cascading failures prompted Discord engineers to initiate a full service restart at 16:07 PDT, which involved rebooting various components and correcting an improperly configured API setting for client reconnection.

The root cause of the initial partial outage was a known bug in how Discord API instances handle Redis failover, which was triggered by the GCP migration. The exact cause of the subsequent cascading full system failure was not fully understood at the time of the postmortem, but it was theorized that the initial failure caused other service nodes to misbehave, run out of memory, and trigger further failures despite safeguards. A misconfigured edge caching rule also contributed to API degradation.

The incident resulted in unavailable guilds and connection issues for users, requiring millions of clients to reconnect over a 20-minute period after the full service restart. Remediation included multiple rolling restarts, correcting the caching rule, recovering Cassandra, and the full service restart. Discord committed to increasing the priority of fixing Redis failover issues, modifying caching behavior, and adding monitoring to detect cascading failures earlier.

Postmortem Index

Unavailable Guilds & Connection Issues

Keywords