Postmortem Index

Explore incident reports from various companies

Spotify Popcount service outage of April 2013

Spotify · Popcount

2013-04-27

Spotify experienced a major outage for European users in April 2013, impacting music playback and login functionality. This incident was preceded by a similar issue two months prior involving Popcount, a backend service storing playlist subscriber lists. Popcount was designed to fail fast, but a legacy desktop client component lacked this behavior.

The legacy client continuously retried fetching Popcount data without exponential backoff, overwhelming the service. This led to a state where recovery was difficult due to the volume of pending requests. Developers deployed a fix for Popcount to fast-fail and return empty lists, which temporarily resolved the issue. However, the root cause in the client was not prioritized for a permanent fix.

On April 27th, Popcount became unhealthy again. A new “Discovery” feature (Bartender service) had unknowingly introduced a dependency on Popcount, increasing its load. The previous fast-fail logic was insufficient. Additionally, excessive logging in Accesspoints, intended for debugging, caused them to become unresponsive due to I/O issues, exacerbating the problem as the faulty client retry behavior continued.

The combination of factors led to notable service degradation, with most Accesspoints becoming unreachable or extremely slow. To restore service, engineers firewalled off unresponsive Accesspoints, forcing clients to trigger their exponential backoff logic. This allowed the Accesspoints to recover, and service was restored within minutes.

Key lessons included the importance of prioritizing root cause fixes, the dangers of excessive logging, and the need for thorough testing of extreme conditions. Post-incident remediations included fixing the client’s faulty retry behavior, implementing static caching for Discovery service data, optimizing Accesspoint logging, and improving syslog flushing. Company-wide education on the incident was also a remediation.

Keywords

spotifypopcountoutageeuropean usersclient bugexponential backoffmicroservicecascading failure