Postmortem Index

Explore incident reports from various companies

Stack Exchange SQL Server bugcheck outage January 2017

Stack Exchange · sql server

2017-01-24

On January 24, 2017, starting at 17:53 UTC, the Stack Exchange network experienced system degradation, entering a read-only state for approximately 5 minutes. This was followed by a complete site outage that lasted for 12 minutes.

The incident was triggered when the primary SQL Server, identified as NY-SQL02, initiated a bugcheck on its SQL Server process. This event initially forced the SQL server into a read-only state.

The root cause involved a combination of the SQL Server bugcheck and a critical bug in the application-level failover logic. Although the system was designed to switch to standby SQL servers in read-only mode during such events, the failover mechanism was disabled due to this bug, leading to the complete network outage instead of a graceful degradation. The underlying cause of the SQL Server bugcheck is currently unknown, but logs suggest a potential issue with a bad DIMM.

Customer impact included the Stack Exchange network being inaccessible for 12 minutes after an initial 5-minute period of read-only access. Approximately 3.5 seconds of data may have been lost due to uncommitted transactions being rolled back.

Remediation involved restarting the SQL service on NY-SQL02, which brought the network back online in read-only mode. After a sanity check, sites were restored to read-write functionality. NY-SQL02 has since been taken out of production for thorough testing, including memory diagnostics. Additionally, the SQL cluster was updated to 2016 SP1 CU1.

Keywords

sql serverbugcheckread-onlyoutagefailoverny-sql02stack exchangedatabase