Postmortem Index

Explore incident reports from various companies

EVE Online long downtime on July 15th, 2015

CCP Games · EVE Online Tranquillity cluster

On July 15th, 2015, the EVE Online Tranquillity (TQ) cluster experienced an extended downtime following a patch deployment for Aegis Sovereignty. The deployment tool initially failed, and subsequent startup attempts at 11:42 UTC revealed that multiple server nodes were getting stuck in the “-1” state, preventing the cluster from becoming ready. This led to repeated restarts and investigations by the operations and development teams.

Initial troubleshooting involved ruling out environmental issues, hardware problems, and code defects by testing rollbacks and disabling network interfaces. A significant breakthrough occurred when clearing sovereignty campaign data from the database allowed the cluster to start successfully at 16:33 UTC. This indicated that the new sovereignty data or its processing during startup was somehow involved in the issue.

Further live console experiments on the running cluster, after temporarily disabling campaign loading, revealed the true culprit. When campaign loading functions were executed with their logging operations enabled, the cluster became unresponsive, and nodes began to die. However, when the same functions were run with logging disabled, they completed instantly and the cluster remained healthy. This pointed to a specific logging channel used by the campaign system on TQ as the cause of the performance degradation.

The root cause was identified as a problematic logging channel within the sovereignty campaign system. When a large volume of log messages were generated during the startup sequence, this channel severely degraded node performance, causing nodes to become unresponsive and drop out of the cluster. This issue was specific to the Tranquillity server and was not reproducible on test environments.

To resolve the incident, a hotfix (Hotfix #5) was deployed at 22:07 UTC, which entirely removed all logging within the campaign system. This change allowed the TQ cluster to start successfully by 22:22 UTC. After VIP checks to ensure data integrity, EVE Online was reopened to all players at 22:41 UTC, resulting in a total downtime of 699 minutes.

Keywords

eve onlinedowntimeclusterloggingsovereigntypatchnodesstartup