Honeycomb operational burden and scaling issues in September and October
During September and early October, Honeycomb experienced a period of significant operational burden, encompassing over 20 internal issues and leading to five publicly declared incidents. This surge in activity was primarily attributed to accelerated growth, with data ingestion increasing by 40% over a few weeks, causing multiple components to reach their scaling limits simultaneously.
The incidents affected various internal systems, including a stuck Kafka auto-balancer, EXT4 filesystem corruptions on retriever instances, a missing Lambda deploy artifact, an undersized RDS instance, and an under-provisioned dogfood ingestion pipeline. A significant customer query spike on September 16 overloaded the dogfood environment, causing ingestion delays and crashing Kafka metrics reporters, which mimicked a production outage. Beagle processing, responsible for SLO data, also suffered delays, sometimes leading to a public 5-minute data delay for users and an SLO processing delay outage.
The root causes were multifaceted, stemming from the rapid growth. Key issues included previously unknown network limits in AWS for Kafka, limitations in automation stability, and unexpected interactions between production and dogfood environments. Specific technical problems involved the Kafka auto-balancer, a Linux EXT4 bug, an undersized RDS instance, and a critical bottleneck in the Sarama library’s Kafka consumer group implementation, which limited throughput to a single connection per leader, exacerbating Beagle processing delays.
Remediation efforts included manual intervention for Kafka rebalancing, scaling up RDS, vertically scaling Kafka for increased network capacity, and adjusting Sarama library settings. Retriever instances were also scaled horizontally, though an initial attempt was hampered by an incorrect runbook entry. Honeycomb learned that scaling individual components requires a deep understanding of distinct scaling patterns, and that a balanced operational tempo is crucial to avoid both constant firefighting and losing touch with system limits.