Stack Exchange network outage due to StackEgg on March 31, 2015
Stack Exchange · Stack Exchange network
On March 31, 2015, enabling the StackEgg feature for all logged-in users across the Stack Exchange network led to a significant increase in load on primary load balancers. This surge in traffic caused a 6-minute public outage on Stack Overflow, with general site slowness observed across the network.
The root cause was an unexpected increase in concurrent sessions. The initial StackEgg deployment introduced additional AJAX requests per page load. Combined with HTTP Keep-Alive tuned to 15 seconds, this resulted in an attempt to sustain over 51,000 additional concurrent sessions, exceeding the HAProxy frontend’s 40,000 session capacity.
Users experienced a 6-minute period where Stack Overflow was reported offline by Pingdom, starting at 13:41:15 UTC. Prior to and during this period, general site slowness was observed across the Stack Exchange network as load balancers struggled with the increased session count, leading to queued TCP connections and potential timeouts.
Initial remediation involved raising the concurrent session limit on the HAProxy frontend and reducing StackEgg’s request load. This was achieved by embedding necessary data directly into pages and increasing reliance on WebSockets for game data, significantly reducing HTTP requests.
A subsequent traffic increase on April 1st prompted further action. The HTTP Keep-Alive duration for StackEgg traffic was temporarily lowered from 15 seconds to 5 seconds. This adjustment successfully reduced concurrent sessions from approximately 60,000 to around 19,000, stabilizing the network.