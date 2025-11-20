On November 18 of 2025 a large part of the Internet suddenly cried out and went silent, as Cloudflare’s infrastructure suffered the software equivalent of a cardiac arrest. After much panicked debugging and troubleshooting, engineers were able to coax things back to life again, setting the stage for the subsequent investigation. The results of said investigation show how a mangled input file caused an exception to be thrown in the Rust-based FL2 proxy which went uncaught, throwing up an HTTP 5xx error and thus for the proxy to stop proxying customer traffic. Customers who were on the old FL proxy did not see this error.
The input file in question was the features file that is generated dynamically depending on the customer’s settings related to e.g. bot traffic. A change here resulted in said feature file to contain duplicate rows, increasing the number of typical features from about 60 to over 200, which is a problem since the proxy pre-allocates memory to contain this feature data.
While in the FL proxy code this situation was apparently cleanly detected and handled, the new FL2 code happily chained the processing functions and ingested an error value that caused the exception. This cascaded unimpeded upwards until panic set in:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
The Rust code in question was the following:
The obvious problem here is that an error condition did not get handled, which is one of the most basic kind of errors. The other basic mistake seems to be that of input validation, as apparently the oversized feature file doesn’t cause an issue until it’s attempted to stuff it into the pre-allocated memory section.
As we have pointed out in the past, the biggest cause of CVEs and similar is input validation and error handling. Just because you’re writing in a shiny new language that never misses an opportunity to crow about how memory safe it is, doesn’t mean that you can skip due diligence on input validation, checking every return value and writing exception handlers for even the most unlikely of situations.
We hope that Cloudflare has rolled everyone back to the clearly bulletproof FL proxy and is having a deep rethink about doing a rewrite of code that clearly wasn’t broken.
7 thoughts on “How One Uncaught Rust Exception Took Out Cloudflare”
That we know of.
Or, to put it another way, assume any given code has bugs until formally proven otherwise.
Hello World, might be safe.
These major outages have always happened and will always happen, but they do seem to be becoming more problematic as services centralise and become more integrated into everything. AWS, Cloudflare, OVH, CrowdStrike, and more recently have all had far more impact than similar outages last century.
And this risk grows as companies start to build their entire mission critical workflows around BigAI without thinking about “and what do we do when ChatGPT also fails for hours or days?” Let alone “What do we do if the bubble bursts and ChatGPT go out of business”
The thing many beancounters and c-suite forget is business continuity. One must assume that if you use a third party service, any third party service, you need to know what you will do when (not if) it fails, and you need to also ensure that you test these plans.
As for Cloudflare, I am pleased to see they are open and published so much information about the root cause. Too many companies also think they need to hide this stuff, when the opposite is true.
Rule #1 of writing any code that’s interacting with other systems (or people) is that it has to accept any junk thrown at it at any rate that the interface can deliver it (ideally, somewhat higher — I’m used to network testing and since ‘line rate’ implies well formed and timed packets you need to test interfaces with badly formed and ill-timed traffic — anything, in fact). If the interface generates error messages on that interface then they’re not allowed to add to the mayhem (i.e. don’t signal overload conditions by generating network traffic).
Although this rule was promulgated for network interfaces (not by me) it applies to any interface. In this case the code should have been tested with not just legal files but also badly formed files, even some that were pure junk. Unfortunately because the module reading the file was written in an idiot proof language the programmers just assumed that it was proof against idiots.
People write code that doesn’t even test for commonly existing names, eg those with apostrophes. Like those persons of Irish extraction with O'[anything] as their last name. Compounded by a certain large vendor of software who decided in their version of SQL to delimit text/strings with apostrophes instead of quotation marks. What did they think people were using their database for? Did they even try it with real data, like names?
It was understandable when crap was on a mainframe with very limited resources. After 30 years, it’s gotten a little old. Kudos to the poor sap at Pizza Hut some years ago who put in at least half a dozen backslashes to escape the apostrophe in my name, so I could order a pizza online and not have to call.
One of the things that’s most annoying actually is trying to determine if there’s a condition in which panic() could be called in embedded rust code. If it is possible then that’s another error condition you need to handle, but it’s not very intuitive to find out. The guide I saw a while back resorted to looking for the panic symbol in the resulting binary to see if it could be called.
The error was not “cleanly detected and handled” in FL, it failed silently.
Blaming Rust the language as a whole for the Cloudflare outage is like blaming Michelin when you crash your car for driving too fast in the rain.
You are absolutely right that they should have been verifying inputs, handling errors correctly, and probably testing extreme cases like this in such a critical code path. This was a rookie mistake on Cloudflare’s part, and they should know better, but this kind of error can and does happen in every language and paradigm, so blaming it on Rust the language is a bit of a reach, and feels more like flame-bait and “see-i-was-right” journalism, given Maya’s vocal history of distaste for Rust.
Rust evangelists are annoying, but Rust anti-vangelists are the flip side of the same coin. It has pros and cons, as all languages and tools do. Everyone decides what features are important based on the ones their favorite tool does best, and by George are they gonna tell you about it. This is true of programming languages, OSes, car brands, political parties, religions, and even grocery stores. Confirmation bias is virtually unavoidable.
“There are only two kinds of languages: the ones people complain about and the ones nobody uses.”
