The Cloudflare outage that affected ChatGPT, X, and Spotify
The bug was straightforward, but recovery was delayed by sporadic "recoveries" that confused engineers.
Major websites became inaccessible after Cloudflare was hit by an outage on Tuesday evening. Who is Cloudflare and why the outsized impact?
The outage has since been resolved. But why were so many seemingly unrelated sites affected?
What happened?
Looking at news sources, some of the sites that were affected include ChatGPT, Spotify, Canva, and X.
The outage began around 7:20pm SGT. "Core traffic" was largely restored within three hours, whilst complete restoration took another three hours. In total, the outage lasted around six hours.
So what happened? Cloudflare says the outage wasn't caused by a cyberattack or malicious activity. Instead, it stemmed from an internal configuration error that caused its core software to fail.
The details
In a detailed blog post, Cloudflare CEO Matthew Prince explained exactly what happened.
The bug: A change to one of Cloudflare's database systems' permissions caused it to output multiple entries into a "feature file" used by its Bot Management system. The result was a much larger feature file that propagated to all machines. Unable to manage the outsized file, the software relying on this file to route traffic failed.
The delay: Whilst the bug was straightforward compared to the complex one that downed AWS last month, recovery was delayed due to sporadic recovery that caused engineers to suspect a massive DDoS.
Turns out that a good feature file was occasionally generated by an unaffected part of the cluster. So Cloudflare systems would work again, only to be downed by the next update. Updates went out every five minutes.
Role of Cloudflare
What does Cloudflare do? Simply put, it provides content delivery network (CDN) services that serve as a cache of websites to reduce load times. The result is better user experiences.
Over time, it gained other capabilities, including helping businesses to defend against DDoS attacks, reduce load on their servers, and provide an "always on" feature when servers are down, though only for static content.
This results in better reliability and uptime.
Other options
Should companies drop Cloudflare? It's worth noting that Cloudflare has a massive network handling an estimated 20% of global web traffic. It can shrug off attacks that even most telcos can't.
It competes against the likes of Akamai, Fastly, in-house CDNs from the public cloud giants, and scores of smaller competitors.
And whilst options exist, it takes time to switch to another provider. In most cases, tight integration and use of proprietary security-related capabilities might make a quick transition impossible.
My take? Organisations will simply keep using Cloudflare, especially given the quick rectification and transparency.