Postmortem Index

Explore incident reports from various companies

Platform.sh EU region outage of August 2016

Platform.sh · Platform.sh EU region

On August 18, 2016, Platform.sh experienced a 4-hour downtime in its EU region. A scheduled maintenance window began at 19:00 UTC to upgrade the orchestration software, initially affecting only git servers and the UI. However, at 19:30 UTC, websites also went down, and all services were restored by 23:30 UTC.

The incident occurred due to a series of cascading issues. During the orchestration software upgrade, gateways were prematurely restarted before the orchestration software had fully started. This caused the gateways to lose their application list and be unable to fetch a new one, leading to website downtime.

The orchestration software itself failed to start correctly due to connection drops to ZooKeeper. Investigation revealed a bug in the Kazoo library, which dropped connections when its internal pipe buffer became full (64k queries). After a fix for this was deployed, a subsequent issue was discovered: one ZooKeeper node exceeded its maximum allowed size, preventing full recovery.

Remediation involved several steps. Automated checks were implemented to ensure the orchestration software is fully started before proceeding with further maintenance. A semaphore lock was added to the Kazoo client to prevent pipe buffer overflow, and a pull request is being prepared for the upstream Kazoo project. Finally, the ZooKeeper max buffer size was increased, and monitoring for ZooKeeper node sizes was implemented to prevent future occurrences.

Keywords

platform.sheu regionzookeeperkazooorchestration softwaredowntimeaugust 2016maintenance