Postmortem Index

Explore incident reports from various companies

Elastic Cloud AWS us-east-1 outage of February 2019

Elastic · Elastic Cloud

On February 4, 2019, at approximately 02:50 UTC, Elastic Cloud customers with deployments in the AWS us-east-1 region experienced degraded access to their clusters. The incident was triggered during a routine patching procedure for the coordination layer (ZooKeeper) in that region. Despite following documented procedures, the patching led to unanticipated instability and an outage of the coordination services.

The primary root cause was identified as a failure in the coordination layer, stemming from insufficient metrics during host replacement that failed to accurately reflect the health of individual hosts and the overall coordination layer. This resulted in instability and a loss of quorum within the ZooKeeper ensemble. A contributing factor was a previously unknown runc bug that caused CPU softlocks and system unresponsiveness on ZooKeeper ensemble members.

Customer impact included partial or complete unavailability for Elasticsearch Service deployments in AWS us-east-1 between 02:50 and 09:28 UTC. Kibana access was disrupted for most customers from 02:50 to 09:28 UTC, with some experiencing degraded access until 18:44 UTC. The Elastic Cloud User Console also saw increased timeouts and was in a degraded state from 02:50 to 07:17 UTC.

Remediation involved re-establishing quorum within the ZooKeeper ensemble by reducing client load and stabilizing the ZooKeeper observer layer through an increased initLimit setting. The extended Kibana issues were resolved by restarting internal proxying containers and Kibana instances, and applying sysctl limits to prevent recurrence. This also addressed identified connection leaks and HTTP request amplification bugs within Kibana.

Elastic has since implemented several action items, including reducing ZooKeeper dataset size, optimizing proxy health-checks, and improving ZooKeeper visibility. Ongoing efforts include a ground-up rewrite of the proxy layer, improving Kibana resiliency and addressing connection leaks, and formalizing maintenance procedures to prevent similar incidents.

Keywords

elastic cloudawsus-east-1zookeeperelasticsearchkibanaruncpatching