Cloudflare global outage due to router configuration error

On March 3, 2013, at 09:47 UTC, Cloudflare experienced a global outage affecting all its services, including DNS and web proxy. Customers encountered DNS errors and “No Route to Host” messages, rendering Cloudflare-protected sites inaccessible. Services were fully restored by 10:49 UTC.

The incident was caused by a system-wide failure of Cloudflare’s edge routers across its 23 data centers. The immediate trigger was a router rule, generated by an internal attack profiler, intended to mitigate a DDoS attack targeting a customer’s DNS servers.

The problematic rule specified an impossible packet length range (99,971 to 99,985 bytes), far exceeding network maximums. When this rule was applied via Juniper’s Flowspec protocol, the edge routers consumed all available RAM and crashed. While some routers automatically rebooted, many became unresponsive, requiring manual intervention.

Cloudflare’s operations and network teams detected the incident immediately. They identified the crashing routers, removed the faulty rule, and coordinated physical reboots of unresponsive routers across their global data centers. This allowed for service restoration within approximately 30 minutes of the response effort beginning.

Cloudflare is investigating the issue with Juniper to understand the bug and plans more extensive testing of Flowspec filters. They are also evaluating methods to isolate rule application to specific data centers rather than network-wide deployment to prevent similar widespread failures. Service credits will be issued to affected customers under SLAs.

Postmortem Index

Cloudflare global outage due to router configuration error

Keywords