Cloudflare has apologized for an outage on Tuesday morning, tracing the problem to a configuration change during network upgrades.
Cloudflare is one of today’s major content delivery networks (CDNs). The US firm also provides distributed denial-of-service protection to online domains, speed optimization, and various cybersecurity services.
The company accounts for millions of customers worldwide including major enterprise firms.
On Tuesday morning, a number of websites and online services suddenly went down, including Feedly, Cloudflare itself, blogs, cryptocurrency services, and more. Ironically, this also meant that down detectors — websites used to check the status of another domain you are having trouble connecting to — also went offline.
The outage caused widespread disruption. With the scope and scale of Cloudflare’s operations, when the firm’s network goes down, the entire internet feels the impact.
In an update at 7.43 am, Cloudflare’s service team said they were investigating “widespread issues with our services and/or network.” According to the Cloudflare status page, the Cloudflare API went offline.
The company said:
“Users may experience errors or timeouts reaching Cloudflare’s network or services. We will update this status page to clarify the scope of impact as we continue the investigation.”
In a further update roughly 30 minutes later, Cloudflare said: “The issue has been identified and a fix is being implemented.”
Cloudflare has described the situation as a “critical P0” incident — a situation loosely described as an urgent, first-priority problem. Furthermore, the company said the incident impacted connectivity in Cloudflare’s network in “broad regions,” leading to 500 errors.
“The incident impacts all data plane services in our network,” the company added.
By 8.20 am, Cloudflare said they rolled out a fix and “are monitoring the results.” At this point, service was restored to some websites taken offline by the problem in Cloudflare’s network.
By 9.13 am, Cloudflare’s update page showed that all services were operational.
Cloudflare told ZDNet:
“Earlier today, Cloudflare saw an outage across parts of our network. This was not the result of an attack. A network change in some of our data centers caused a portion of our network to be unavailable.
Due to the nature of the incident, customers may have had difficulty reaching websites and services that rely on Cloudflare from approximately 6.28 – 7.20 am UTC. Cloudflare was working on a fix within minutes, and the network is running normally now. Given Cloudflare’s scale and the percentage of the Internet that relies on our network, when we have problems it is vital that we are open and transparent about what happened, why it happened, and what we’re doing to ensure it doesn’t happen again.”
Update 2.13 pm: Cloudflare has published a blog post with a post-mortem of the disruption. According to the company, the outage impacted 19 data centers which “handle a significant proportion of our global traffic.”
A network configuration change in prefixes was the cause of the problem and meant that many IP addresses were no longer accessible.
“This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations,” Cloudflare says. “We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.”