Postmortem Index

Explore incident reports from various companies

Stack Exchange Network outage of July 20, 2016

Stack Exchange · Stack Exchange Network

2016-07-20

On July 20, 2016, the Stack Exchange Network experienced a 34-minute outage starting at 14:44 UTC. The incident involved a rapid response, with 10 minutes to identify the cause, 14 minutes to develop a fix, and 10 minutes to deploy it, restoring Stack Overflow’s availability.

The direct cause of the outage was a malformed post that triggered high CPU consumption by a regular expression on the web servers. This post was present in the homepage list, causing the expensive regex to be executed with every homepage view. As a result, the homepage became unresponsive.

Since the homepage is used by the load balancer for health checks, its unresponsiveness led the load balancer to remove the affected servers from rotation. This action made the entire Stack Exchange Network unavailable to users.

The root cause was a specific regular expression, ^[\s\u200c]+|[\s\u200c]+$, designed to trim unicode whitespace. A malformed post containing approximately 20,000 consecutive whitespace characters exploited a performance characteristic of the backtracking regex engine. This caused the engine to perform an O(n²) number of character checks, leading to excessive CPU usage.

To resolve the immediate issue, the problematic regular expression was replaced with a substring function. As follow-up actions, Stack Exchange plans to audit other regular expressions, improve post validation workflows, add controls to the load balancer to disable health checks during outages, and create a comprehensive outage checklist to streamline future incident responses.

Keywords

stack exchangeoutageregexcpuload balancerhealth checkbacktrackingstack overflow