Postmortem Index

Explore incident reports from various companies

Cloudflare Control Plane and Analytics Outage due to Flexential Power Failure

Cloudflare · control plane, analytics services

2023-11-02 – 2023-11-04 cascading-failure hardware

On November 2, 2023, at 11:43 UTC, Cloudflare’s control plane and analytics services experienced an outage. While most control plane services were restored at a disaster recovery facility by 17:57 UTC on November 2, full restoration of all services, including those dependent on the affected data center, was not achieved until November 4, 2023, at 04:25 UTC.

The incident stemmed from a power failure at Flexential’s PDX-DC04 data center in Oregon, which houses Cloudflare’s largest analytics cluster and a significant portion of its high-availability cluster. An unplanned maintenance event by Portland General Electric (PGE) affected one power feed, leading Flexential to activate generators. A subsequent ground fault on a PGE transformer then shut down both utility feeds and the generators, causing a complete power loss. Flexential’s operational issues, including lack of communication, access control system failure, insufficient overnight staffing, and faulty circuit breakers, severely hampered power restoration efforts.

Cloudflare’s internal systems contributed to the impact, as some critical services, particularly Kafka and ClickHouse for log processing and analytics, had unaddressed dependencies on PDX-DC04 and were not fully integrated into the high-availability cluster. Additionally, newer products lacked robust disaster recovery procedures. Although Cloudflare’s global network and security services continued to operate, customers were unable to make configuration changes via the dashboard or APIs, and analytics data and raw logs experienced significant gaps or loss.

In response, Cloudflare has initiated a “Code Orange” program to prioritize control plane reliability. Key remediation steps include removing dependencies on core data centers for control plane configuration, ensuring control plane functionality even if core data centers are offline, mandating high availability for all generally available products, rigorous testing of disaster recovery plans and system blast radii, comprehensive auditing of data centers, and developing a logging and analytics disaster recovery plan to prevent data loss.

Keywords

cloudflarecontrol planeanalyticsoutageflexentialpowerdata centeroregon