Postmortem Index

Explore incident reports from various companies

Roblox 73-hour outage due to Consul and BoltDB issues (October 2021)

Roblox · Roblox

2021-10-28 – 2021-10-31 automation config-change hardware time

Roblox experienced a 73-hour outage from October 28th to October 31st, 2021, impacting its 50 million daily players. The incident began with degraded Vault performance and high CPU load on a Consul server, eventually leading to a complete system outage as critical services like service discovery failed.

The core issue stemmed from an unhealthy Consul cluster, which is vital for service discovery, health checks, and session locking. When Consul became unhealthy, dependent services like Nomad and Vault could not function, preventing container scheduling and secret retrieval. Initial attempts to diagnose and remediate involved replacing hardware and resetting Consul’s state, but these were unsuccessful due to underlying systemic issues.

Two primary root causes were identified. First, a relatively new streaming feature in Consul, enabled under unusually high read and write loads, caused excessive contention and poor performance, exacerbated by higher core-count servers. Disabling this feature significantly improved Consul’s health. Second, Roblox’s specific load conditions triggered a pathological performance issue in BoltDB, used by Consul for write-ahead-logs, particularly concerning freelist maintenance.

The outage resulted in 73 hours of downtime for Roblox players, though no user data was lost or accessed by unauthorized parties. Remediation involved disabling the problematic Consul streaming feature, implementing workarounds for slow Consul leaders, and redeploying the caching system. Roblox also committed to accelerating engineering efforts to improve monitoring, remove circular dependencies in its observability stack, and move towards multiple availability zones and data centers to prevent similar future incidents.

Keywords

robloxoutageconsulboltdbstreamingservice discoveryhashicorp73 hours