Every system administrator worth their salt knows that the right way to coax changes to network infrastructure onto a production network is to first validate it on a Staging network: a replica of the Production (Prod) network. Meanwhile all the developers who are working on upcoming changes are safely kept in their own padded safety rooms in the form of Test, Dev and similar, where Test tends to be the pre-staging phase and Dev is for new-and-breaking changes. This is what anyone should use, and yet Cloudflare apparently deems itself too cool for such a rational, time-tested approach based on their latest outage.
In their post-mortem on the December 5th outage, they describe how they started doing a roll-out of a change to React Server Components (RSC), to allow for a 1 MB buffer to be used as part of addressing the critical CVE-2025-55182 in RSC. During this roll-out on Prod, it was discovered that a testing tool didn’t support the increased buffer size and it was decided to globally disable it, bypassing the gradual roll-out mechanism.
This follows on the recent implosion at Cloudflare when their brand-new, Rust-based FL2 proxy keeled over when it encountered a corrupted input file. This time, disabling the testing tool created a condition in the original Lua-based FL1 where a NIL value was encountered, after which requests through this proxy began to fail with HTTP 500 errors. The one saving grace here is that the issue was detected and corrected fairly quickly, unlike when the FL2 proxy fell over due to another issue elsewhere in the network and it took much longer to diagnose and fix.
Aside from Cloudflare clearly having systemic issues with actually testing code and validating configurations prior to ‘testing’ on Prod, this ought to serve as a major warning to anyone else who feels that a ‘quick deployment on Prod’ isn’t such a big deal. Many of us have dealt with companies where testing and development happened on Staging, and the real staging on Prod. Even if it’s management-enforced, that doesn’t help much once stuff catches on fire and angry customers start lighting up the phone queue.

I wish there was a way for me to test and validate on Stage. But when you have to support over 1,000+ different software packages that usually play loose with the specifications I build to, stuff always ends up breaking anyway no matter my best intentions. Then I have to share the specifications with the angry customer to let them know they were never doing things right in the first place.
Latest fiasco is when I made changes that were spec-compliant, only to find out that a majorly large international company who will remain unnamed didn’t even know the specs existed in the first place. They hacked together a solution that didn’t even LOOK at the spec sheets, and built an entire internal software stack around that hack. And then they demanded the change be reversed to allow them to keep using non-spec solutions. I said no, and they eventually got spec compliant and things worked again.
Other customers who are following the specs are fine though at least. Sigh.
I suspect the fact that they were patching a critical and actively exploited (I think) vulnerability may have swayed their judgement. “Sorry we couldn’t protect you and you got hacked because it took 5 days to update our test tool, and we refuse to make exceptions even for critical issues” is also bad for PR – and the bottom line.
Reacting quickly to patch stuff is kinda cloudflare’s selling point? If they’re slower than everyone else to fix stuff, why use them?
From decades of professional experience bypassing thorough testing in a staged environment will eventually end in a major or minor disaster. The same lazy mindset that skips staged cautious deployment of critical systems updates seldom thinks to have backup original files handy to restore functionality. Ask me how I know someone else violating said guidelines when one is on vacation can lead to a trashed Vacation.
I’ve started calling them ClownF@rt when I talk to other retired IT folk.
Old school thinking. Make me suffer with a days long outage vs getting hacked for 3 days w,/o notice. I’ll choose CF!
Administro sistemas de missão crítica há quase 28 anos, e nunca, absolutamente NUNCA, aplicando correções de vulnerabilidades nesse anos todos, bati no peito e cogitei aplicar correções em ambiente produtivo diretamente.
Que seja um ambiente de DR, pelo menos, ou algo em blue/green, canary, etc, mas jamais de uma vez em produção.
Uma falha de segurança não aplicada ainda dentro do SLA do ciclo de gestão de correções seria o suficiente para reportar aos clientes qualquer tipo de “invasão”.
Minha opinião é que os ambientes dev, staging e/ou qa da empresa não são fieis ao ambiente produtivo.
From decades of professional experience strictly adhering to the testing protocols would sooner or later lead to the situation where you can’t roll out anything to the prod for ages because of changes required to the staging to make it work. And then you’ll end up needing another pre-prod environment. And suddenly what was supposed to be a quick fix ends up a resource drain.
It’s executive decision whether to strictly follow the protocols or not. Should be ready to take the blame in either case
There is the problem: it’s not an executive decision, but a technical one.
The executive can decide the imperative to ship before X. It can weigh “Risk a hack” vs “crash the business” vs “have a planned outage of X hours”. It can decide to put together a budget, limited or not, and have people on hot standby, or in front of the consoles during deployment.
The means and risks involved in doing the tests in some specific way, or deploying following a specific plan is best made by the professionals that know the machine. They are also responsible for evaluating the cost of the different scenarios. The decision of changing from the regular test-deploy procedures to some other procedure is mostly a technical one, except if there are more than one solution that are equivalent but with different business effects.
Indeed. This wasn’t your ordinary feature of bugfix, and thus required extraordinary measures.
I wonder if Cloudflare etc. should be looped into the the CVE process earlier, to give time to implement the standard testing procedure ahead of the CVE being published.
Agreed, CNA need informed all those cloud providers before publishing such bugs
Vendors, especially large and critical ones, are involved early when that is an option.
I dont know how CVE is published but i think CNA need informed all cloud provider before publishing such insane bug like that
Or, we don’t give special treatment to big corporations. Should WalMart get access to this year’s flu shot before the rest of us because they employ so many people?
I have neither network experience, nor much programming expertise. I just came to say that the writing in this short article is brilliant. It’s informative, concise, snappy, witty, entertaining, and accessible for folks like me who aren’t entirely plugged in to the subject matter. Awesome!
Well in my 4 decade experience bypassing the staged development system will lead to problems at best, disaster at worst. Usually people who take the route of expediency aren’t prepared with good backup or rollback procedures, leaving others to cleanup the mess.
Can’t beleive Cloudflare couldn’t do better.
+1
In the end, circumventing tests on non-prod environments just because it’s a critical bugfix sounds like (a) lame excuse (b) indicates, that there might be other process issues lurking in the dark.
How is it even possible to do that could be an interesting question, the CIO should prepare for the next external audit.
Only a lame excuse if the critical bug isn’t that bad so you can live with it at least for a few days of tesing – if your house is already really really on fire you are not going to worry that a change might make the roof leaky/collapse etc if its going to control/put out the flames – hope to save more than it costs…
So while I agree changes straight into the production environment, or even just skimping on the testing is as a rule a very bad idea it is possible this time it was really required.
In my experience this comes down to management (after being sold on the idea that the DevOps model is cheaper, faster, and more reliable no matter what you’re developing) wants to cut the time and cost represented by the bake time for a full test battery.
This leads to ill-considered metrics like trying to write the fastest and cheapest tests to hit X% coverage in terms of source lines exercised during the test. This may be OK for programs which are both trivial and insular, but it tends to miss subtle things like indirect interactions between modules in real world scenarios (e.g. cascade failure, error amplification) as well as stuff which shows up over time (slow leaks, fragmentation, rare-but-nasty race conditions).
Then there’s the large class of issues that don’t even show up on code coverage to begin with (e.g. table driven state machines and parsers tend to have essentially all data flow and no control flow at their hearts).
So it’s often the case that things are tested outside production, but the moral hazard of trying to make that testing fast and cheap at the expense of being truly representative is very strong and bean counters rarely resist it.
I’d fully agree with you (with only 3.5 decades experience) if we would be talking about business applications with a front-end, a back-end and a database. CI/CD is intended to prevent issues in these kind of applications.
However, Cloudflare is not a typical bussines application but a large scale distributed system that serves a large part of the internet. {else it would not make headlines, right? ;-))
At the scale of cloudflare it’s just not possible to have a copy of prod in pre-prod, hence they use rolling upgrades where gradually more servers implement the update and in case of problems the upgrade is rolled back. This is quite a common approach in large scale distributed systems and for such systems way more reliable than changing everything at once like done with a pre-prod/prod setup used on typical business apps. The postmortem by cloudflare shows the problem occurred as a unrelated side effect of a config change to disable an internal test system. Anyone who writes that this shows “systemic issues with actually testing code and validating configurations prior to ‘testing’ on Prod” does not grasp the complexity of the engineering involved imo. E.g. see https://blog.cloudflare.com/building-our-maintenance-scheduler-on-workers/
Agreed.
Based on the complexities and sheer size of cloudflare, if they had systemic problems they’d be down daily.
Also a reminder, that overly simplifying a problem is its own problem…
Staging is a good idea. We see a lot of success stories about slow corporate development practices that we never notice as such. I noticed one of these recent cloudflare outages but it’s been about a decade since i had a credit card fail, and that was because of poor testing and diagnostic practices at my local bank.
But i think part of the story here is the growth in complexity of cloudflare’s internal infrastructure. Complexity makes it harder to get right on the first try, harder to test, harder to ensure the test is the same as the production, and harder to bring back once it hits the fan. It also tends to expand your attack surface.
It reminds me of the attitude that updates cause security. No, actually it’s simplicity that causes security. There’s of course trade-offs in every direction but our fundamental challenge is managing complexity. Updates and staging are both good but they can’t make up for a failure to manage complexity.
I think there is a nugget of truth to updates cause security too – moving the goalposts has a tendency to break or twist the existing flaws so the bad actor has to start again. At least as long as its not a really really awful update, or worse a malicious update that has slipped in it does provide some temporary security from that.
Not that I disagree complexity and managing it well is very important to not having goalposts that need moving in the first place…
Is there anything that isn’t really really awful?
Exact replication of a production environment is often inaccurate or infeasible. Increases with scale. The worst is usually just some wasted time, but I’ve also seen inaccurate tests lead to overconfidence. I’m not saying don’t test outside of production, but they probably think they did do that.
There’s no test like production.
God, this is so true. LOL
Love it! So true. Everyone things this tiny change won’t effect anything. I’ve heard “I’ve tested it ib my VM”. Having worked in a User acceptance testing lab for 6 years for a major utility company and having a formal change control process where people sign off on changes is the best process. Always load the previous code on the device and then apply the change. Never start with new code on new machine or hardware. That isn’t duplicating production.
Fortunately, our company doesn’t do such evil things. Instead, our production environment runs on the test server!
(The URL to our citrix replacement remote desktop service is desktoptest..com)
haha – classic!
management: “We need to be sure Test is a perfect replica of Prod”
engineers: “Say no more!”
Luckily, unlike the author, the cool kids actually read and understand the original post mortem. Which shows that the situation is a bit different than portrayed in this article ;-)
Who still doesn’t understand that CICD is a project that never finishes, so all the developers have a job for life (as long as they want it)?
It’s a bit of the attitude that “fixing this buffer overrun shouldn’t need testing because the fix is obvious” where it’s say, an out by one in a loop. But it really does need testing, because something else references it.
I’ll put my two cents. In my job, it’s told us to always test our features. But we have different servers with different OSes with different versions. Our test server is using a Ubuntu 24 (which I updated, cause nobody else had the guts), one of our prod servers are using a old CentOS version and other using Alma Linux.
As I’m in charge of system automations, I can say that only by using different Linux distros you can face issues where you would need to test in production.
For example, in Ubuntu, the http service is called apachectl, while in CentOS it’s httpd.
I have a shirt that says “I don’t always test my code, but when I do, I do it in production.” I tell people that I am not crazy, I don’t use my production environment, yours will do just fine!
“sure, let’s pipe much of the internet’s traffic through one censorious corp’s service, what could go wrong?”
It blows my mind that a very large portion of the self hosting community relies 100% on cloudflare and tailscale for accessing their homelabs. If you are going to rely on some big corp to access your own equipment then why bother having the equipment at all? Just throw it all on AWS like the rest of the world.
Gee, what’s with the “I don’t get it but I will write something anyway” attitude, Maya? Your style and level of comprehension would be a perfect match for a boulevard paper.
Some things can not be simulated in the lab but need to run in production. And a partial roll out system is perfectly tuned to just do that. As seen in this instance some things only cause problems at scale.
IIRC – read the post mortem some weeks ago – the real problem was they did not check their IPC input but relied on stable data formats to increase processing speed.
Normally that should be roll-back. No questions.
Indicates – with a competant team – there were other pressing needs. It’s what those pressing needs were, and how much importance was being placed on them to undoubtaly have the sysop talk to change management talk to management and then send that back down the line.
It sounds like one of those cases – that always happen – where the swiss cheese didn’t work.
improvements are made, the manuals get updated, the procedures changed, lesson learned, and on we go to the next one.
The outlier event will always happen.
I approve cloudflare failing, I don’t like every damn site in the world to have a single US-based mofo sitting in-between.
Yeah I know there are the bots and hackers, but to me as a user cloudflare is more of the same.
It’s no longer possible to run a significant site if any kind without a company like Cloudflare in the mix. DDOS attacks are ridiculously common.
I had to deal with a concentrated attack intended to bring our site down without tools like Cloudflare and its damn near impossible. Cloudflare, Akamai, etc are necessary to deal with the level of traffic that can be brought to bare, combined with the shear size of hacks that can be brought to the mix.
Dozens of vulnerable systems are scanned for from random IPs, preventing any reasonable server side mitigations.
Even a tiny personal site can benefit from these services in front of them. Sad reality, but reality nonetheless
And yet, there are sites that don’t use goddamn cloudflare.
I just hope they continue with this policy of hacking themselves and as a result bankrupt themselves.
Oh and did anybody check if they perhaps are encouraging DDOS attacks against sites that don’t use Cloudflare? That would befit modern US practices surely.
I know there are big companies that get into that kind of thing and are already beyond the law, but ‘just a tiny bit’ which is deemed OK when it’s against foreigners it seems.
3.9 decades experience development, system engineering, incident management says: while testing in pre-prod is necessary, if you believe that your pre-prod environment ‘mirrors’ your production environment you’re just fooling yourself. There are always differences, always edge cases. Eventually those will bite you regardless of how much testing you do and in what environments, that’s where incident management comes in. History says most time is not spent finding the solution, it’s spent finding the right guy who understands and can fix the problem. That right guy is not a new hire, that right guy is built up from experience and broad understanding of the specific environment.
Want to not end up as an incident management case? design more robust code. understand edge cases. do full and proper error handling. Don’t assume anything. Again, this is where experience and broad understanding of the specific environment comes into play.
To the companies, how do you build experience and broad understanding if you keep cycling people in and out every 3-5 years?
Been in the change management world. Its all about risk and reward. Nothing is super clear cut. Test environments never completely replicate production. Test tools sometimes dont work correctly. In this case it seems that my questions would have been:
How long to fix the test tool?
How long before this CVE causes damage?
If this change goes wrong, how long will we be down?
What is the rollback plan?
Based on those inputs you have to make the call. Changes you make under fire require that you sometimes take chances.