{"UUID":"874ed11e-fa23-4e90-b306-febd62f27a1d","URL":"https://status.circleci.com/incidents/dcqb3fykhgvg","ArchiveURL":"","Title":"CircleCI jobs not starting due to Kubernetes networking failure","StartTime":"2023-03-14T18:00:00Z","EndTime":"2023-03-15T01:27:00Z","Categories":["cloud"],"Keywords":["kubernetes","kube-proxy","iptables","networking","jobs","pipelines","upgrade","rabbitmq"],"Company":"CircleCI","Product":"jobs","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T19:52:00.108526Z","Summary":"A staged Kubernetes upgrade of CircleCI's main production cluster left `kube-proxy` and `kubelet` at incompatible versions. The change between versions altered the format of `kube-proxy`'s `iptables` rulesets, so as pods churned and `Endpoints` objects changed, `kube-proxy`'s `Proxier.syncProxyRules()` (an `iptables-save` / `iptables-restore` read-modify-write) repeatedly hit \"Sync failed\" errors, leaving the per-node iptables in a corrupted state and silently breaking service-to-service routing across the cluster. Recovery required a full node-by-node cluster restart and triggered two follow-on incidents.","Description":"On March 14, 2023, at approximately 18:00 UTC, CircleCI experienced significant delays in starting customer jobs, eventually leading to jobs not running. This incident, which lasted until 01:27 UTC on March 15th, stemmed from an ongoing Kubernetes upgrade. The remediation efforts for this primary incident also triggered two subsequent incidents related to RabbitMQ, causing further delays in pipelines and GitHub checks.\n\nThe core issue manifested as widespread networking failures and service-to-service communication problems across CircleCI's main production Kubernetes cluster. Engineers observed increasing communication errors and job processing delays, leading to job queues backing up. Initial investigations considered `kube-proxy`, `core-dns`, or `node-local-dns-cache` as potential culprits, with `kube-proxy` eventually identified as the primary source of the problem.\n\nThe root cause was a version mismatch between `kubelet` and `kube-proxy` components during a staged Kubernetes upgrade. Changes in `kube-proxy`'s ruleset format between versions meant that as pods churned and `Endpoints` objects changed, `kube-proxy`'s `Proxier.syncProxyRules()` method repeatedly failed to execute `iptables-restore`. This led to corrupted `iptables` rulesets on nodes, silently breaking internal service-to-service routing.\n\nThe incident severely impacted all customer pipelines, causing jobs to not start or experience significant delays. There were also impacts on the CircleCI UI and API access. The primary remediation involved a full node-by-node restart of the Kubernetes cluster, which restored networking. Subsequent issues with RabbitMQ required further restarts and monitoring to fully resolve all related degradations.\n\nCircleCI identified that the incident was not due to a single trigger but rather an interaction of multiple events during a risky Kubernetes upgrade. Future prevention efforts include enhancing the Kubernetes upgrade process, improving risk mitigation strategies, and adding specific observability metrics and alerts for `kube-proxy` sync errors. They are also evaluating strategies to improve RabbitMQ infrastructure resilience."}