Postmortem Index

Explore incident reports from various companies

Google Compute Engine, Cloud VPN, and Network Load Balancer connectivity issues

Google · Google Compute Engine

2017-01-30 cloud config-change

On Monday, January 30, 2017, an incident began at 10:54 US/Pacific and was fully mitigated by 12:50 US/Pacific, lasting 2 hours and 8 minutes. During this period, newly created Google Compute Engine instances, Cloud VPNs, and network load balancers experienced connectivity issues.

The primary impact was that any GCE instances, Cloud VPN tunnels, or GCE network load balancers created or live-migrated between 10:36 and 12:42 PDT were unavailable via their public IP addresses. This also prevented outbound traffic from affected instances and caused load balancing health checks to fail. Previously created resources that did not undergo live migration were unaffected.

The root cause was identified in the shared layer 2 load balancers responsible for inbound networking for GCE instances, load balancers, and VPN tunnels. A large set of updates applied to a rarely used load balancing configuration exposed an inefficient code path. This caused a canary deployment to time out, subsequently queuing all public addressing changes behind these stalled updates.

To resolve the issue, Google engineers restarted the jobs responsible for programming changes to the network load balancers. This allowed the problematic changes to be processed in a batch, bypassing the inefficient code path, and normal traffic resumed. The fix was applied zone by zone between 11:36 and 12:42.

For short-term prevention, canary timeouts are being increased to prevent complete stoppage of network changes. Long-term, the inefficient code path is being improved, new tests are being written, and work on replacing global address configuration propagation with decentralized routing is being accelerated. Additionally, new metrics and alerting are being developed for earlier identification and faster resolution of similar issues.

Keywords

google compute enginecloud vpnnetwork load balancerconnectivitypublic ipcanaryload balancinginefficient code path