Postmortem Index

Explore incident reports from various companies

CircleCI jobs not starting due to Kubernetes networking failure

CircleCI · jobs

2023-03-14 – 2023-03-15 cloud

On March 14, 2023, at approximately 18:00 UTC, CircleCI experienced significant delays in starting customer jobs, eventually leading to jobs not running. This incident, which lasted until 01:27 UTC on March 15th, stemmed from an ongoing Kubernetes upgrade. The remediation efforts for this primary incident also triggered two subsequent incidents related to RabbitMQ, causing further delays in pipelines and GitHub checks.

The core issue manifested as widespread networking failures and service-to-service communication problems across CircleCI’s main production Kubernetes cluster. Engineers observed increasing communication errors and job processing delays, leading to job queues backing up. Initial investigations considered kube-proxy, core-dns, or node-local-dns-cache as potential culprits, with kube-proxy eventually identified as the primary source of the problem.

The root cause was a version mismatch between kubelet and kube-proxy components during a staged Kubernetes upgrade. Changes in kube-proxy’s ruleset format between versions meant that as pods churned and Endpoints objects changed, kube-proxy’s Proxier.syncProxyRules() method repeatedly failed to execute iptables-restore. This led to corrupted iptables rulesets on nodes, silently breaking internal service-to-service routing.

The incident severely impacted all customer pipelines, causing jobs to not start or experience significant delays. There were also impacts on the CircleCI UI and API access. The primary remediation involved a full node-by-node restart of the Kubernetes cluster, which restored networking. Subsequent issues with RabbitMQ required further restarts and monitoring to fully resolve all related degradations.

CircleCI identified that the incident was not due to a single trigger but rather an interaction of multiple events during a risky Kubernetes upgrade. Future prevention efforts include enhancing the Kubernetes upgrade process, improving risk mitigation strategies, and adding specific observability metrics and alerts for kube-proxy sync errors. They are also evaluating strategies to improve RabbitMQ infrastructure resilience.

Keywords

kuberneteskube-proxyiptablesnetworkingjobspipelinesupgraderabbitmq