Postmortem Index

Explore incident reports from various companies

BigQuery Storage WriteAPI elevated error rates in US Multi-Region

Google · BigQuery Storage WriteAPI

2022-10-13 – 2022-10-14 automation cascading-failure cloud time

On October 13, 2022, at 23:30 US/Pacific, Google BigQuery’s Storage WriteAPI experienced elevated error rates in the US Multi-Region, affecting customers for approximately 5 hours until 04:30 US/Pacific on October 14. The issue manifested as increased connection failures for users making calls to the Write API and a slight increase in failures for the InsertAll API.

The incident was triggered by an unexpected increase in incoming and logging traffic to the BigQuery Storage Write API. This traffic surge, combined with a bug in Google’s internal streaming RPC library, caused a deadlock and overloaded the Write API Streaming frontend. Although automated systems attempted to scale up instances, a separate bug in the Write API prevented existing instances from recovering, leading to continued elevated error rates due to load balancing.

Customers utilizing the Write API in the US Multi-Region observed increased levels of connection failures. Additionally, customers using the InsertAll API in the US region may have experienced a slight increase in failures due to the subsequent traffic increase.

Google engineers were alerted at 23:47 US/Pacific and began investigation. Initial automated and manual scaling attempts did not resolve the issue. The fix involved manually restarting the stuck instances, an action that commenced between 03:20 and 04:30 US/Pacific on October 14, leading to full mitigation by 04:30 US/Pacific.

To prevent recurrence, Google has completed fixing the bug in its internal RPC library and the bug in the Write API that caused the cascading deadlock. They are also deploying additional automation in the Write API backend for automatic load balancing based on concurrent connections and improved error handling.

Keywords

bigquerystorage writeapistreaming apius multi-regionrpcdeadlockconnection failures