Cloudflare has gotten more active in its efforts to identify and block unauthorized bots and AI crawlers that don’t respect boundaries. Their solution? AI Labyrinth, which uses generative AI to efficiently create a diverse maze of data as a defensive measure.
This is an evolution of efforts to thwart bots and AI scrapers that don’t respect things like “no crawl” directives, which accounts for an ever-growing amount of traffic. Last year we saw Cloudflare step up their game in identifying and blocking such activity, but the whole thing is akin to an arms race. Those intent on hoovering up all the data they can are constantly shifting tactics in response to mitigations, and simply identifying bad actors with honeypots and blocking them doesn’t really do the job any more. In fact, blocking requests mainly just alerts the baddies to the fact they’ve been identified.
Instead of blocking requests, Cloudflare goes in the other direction and creates an all-you-can-eat sprawl of linked AI-generated content, luring crawlers into wasting their time and resources as they happily process an endless buffet of diverse facts unrelated to the site being crawled, all while Cloudflare learns as much about them as possible.
That’s an important point: the content generated by the Labyrinth might be pointless and irrelevant, but it isn’t nonsense. After all, the content generated by the Labyrinth can plausibly end up in training data, and fraudulent data would essentially be increasing the amount of misinformation online as a side effect. For that reason, the human-looking data making up the Labyrinth isn’t wrong, it’s just useless.
It’s certainly a clever method of dealing with crawlers, but the way things are going it’ll probably be rendered obsolete sooner rather than later, as the next move in the arms race gets made.
 
            
 
 
    									 
    									 
    									 
    									 
			 
			 
			 
			 
			 
			 
			 
			 
			 
			
I doubt this will work. The first defense a crawler designer would implement is detect the general HTML tree to not be a real website. Real websites have diverse CSS layouts, lots of scripts and are just more than text. This site here for example seems to use a font called Proxima Nova.
What does Cloudflare’s AI labyrinth even look like? Right, if they would immediately show us it would be probably apparent how to counter-detect it. Don’t tell, show Cloudflare. See you on the other side of the AI bubble. It’s gonna burst soon.
Disclaimer: I enjoyed the article no matter. Just a bit salty about too much smoke and mirrors on the side of the company.
Erm… it’s pretty trivial to drop LLM generated junk into the body of a real website, CSS, layouts, and all. Especially if you’re cloudflare and are serving up the site.
Yeah and JWZ did this on his blog decades ago. You can check his site. I think this wonder technology is called a Markov chain. AI was never needed for this.
Still generative AI, just oooooold gen AI.
It’s a stochastic process and used for Markov chain Monte Carlo (MCMC), thus if you find meaning in random outputs, I’m a bit puzzled.
There is no training model or weights. I can just pull random Subjects, verbs objects.
“The dog ate the sun.” There I did manual Markov.
Wouldn’t it be rather easy to just use the existing site’s layouts, and generate script references that look legitimate? AI also isn’t a bubble. This stuff is here to stay.
It’s a bubble as in it’s over-valued. A lot of big investments are going to be written off in the next 5 years. We’re already seeing OpenAI under threat from far cheaper competitors.
The products of the dot-com bubble are also here to stay, but that doesn’t invalidate it from being a bubble. A “bubble” doesn’t mean nothing underlying is real, it only means it is undervalued. We currently have a MASSIVE real-estate bubble, but the properties are real and they will remain long after the bubble bursts, with people living in them and everything.
*overvalued whoops. I blame a rogue LLM for attacking my brainstem at the opportune moment
If you wanted to identify malicious content like this, all you would have to do is look in robots.txt and that will tell you directly how to avoid the honeypot. They can’t make it any clearer than that, right?
Enforcing robots.txt by vigilante justice. It’s like the wild west of the internet all over again!
Just what the world needs, more garbage data for the LLMs to steal and be regurgitated as true facts by those who can’t think for themselves.
Can’t wait ’til the bubble bursts
I expected it to burst at well. But someone explained to me how much money from huge companies is involved in this bubble right now, and they will do everything to prevent it from bursting.
So it will more likely slowly sizzle out, just like VR did.
But it’s all just a large assumption that “AI is a bubble” in the first place, and one that is little more than shooting from the hip. Flip a coin. Maybe you’re right, but I don’t see it (though I’m biased, working on implementing AI into our infrastructure at a large tech company). Companies like OpenAI are over bought and over sold, sure. AI as its being implemented into products and into infrastructure and processes isn’t going away and is only increasing in scope.
I think VR isn’t the best comparison as well, as VR was a misunderstood thing. I think most of us envisioned VR as doing the same type of gaming as traditional gaming, but once you start using it, it becomes abundantly clear that it’s not better or even as well suited for those things. It requires a completely different concept on the design of games. It’s not a replacement for traditional gaming, and so with the associated costs and economic reality, it will invariably be a relatively niche product for quite a while still. I don’t think it will necessarily fizzle out, but it’s going to be a long road with a lot of technological development and economies of scale that might not come to fruition.
AI has a bit of a parallel to that if you only consider basic GenAI stuff like LLMs and image generation, but how agentic AI is being utilized clearly has a domain that is partially unique to itself and will also replace some existing pipelines. It doesn’t have the limited scope that VR has, and what’s happened with agentic AI has already cemented it into the landscape.
AI specifically LLMs and related, are massively oversold.
They are not a spark of new thought, just needing a couple of decades to perfect.
They are a ball of statistical plagiarism, fine tuned and as good as they are going to get.
It’s not that their aren’t good applications for neural net based solutions.
It is that those are niche, aren’t always obvious, depend of carefully curated training sets and won’t work outside well trained scenarios.
ChatGPT etc is also a brilliant way to ‘poison the well’ at industrial scale, on VCs dime.
The idea of feeding Chatbot the entire internet and thinking your going to get something useful is insane.
More noise, less signal, everyday.
I think you only read half the article?
“That’s an important point: the content generated by the Labyrinth might be pointless and irrelevant, but it isn’t nonsense. After all, the content generated by the Labyrinth can plausibly end up in training data, and fraudulent data would essentially be increasing the amount of misinformation online as a side effect. For that reason, the human-looking data making up the Labyrinth isn’t wrong, it’s just useless.”
What you are describing has happened via carbon-based equivalents to LLMs for thousands of years. I wouldn’t hold your breath. The current AI craze will most likely hit a lull, but this phenomenon is here to stay.
Well done holding back the future of humanity!
An internet that’s crippled by the excessive traffic of LLMs and AI stealing IP and content is the future of humanity?
Is that you Cypher?
I feel that the future you describe is already mostly here. This point is can be illustrated quite compellingly by the act of looking up a recipe.
Actually yes. It’s not what you WANT it to be, but that’s what it is
How is disregarding someone’s preference to not have their webpage scraped moving humanity forward?
You seem to have the common confusion between the actual direction of history as it exists and the concept of progress as an inexorable advancement of what we think is good. The latter is not real
When I’ve ‘contract scraped’ it was ‘moving humanity forward’…
I wanted the easy fat money and the site’s competitor wanted deniability.
No crime was committed, free to contract…
Robots, schmobots.
Have I told you about the philosophy of ‘personal utilitarianism’?
The outcome that provides the most utility, to me, is the moral and ethical one.
I guess someone had to offer a Markov tarpit as a service at some point. Though, I guess this might be a LLM tarpit as a service?
A very early application for generative AI was always destined to be this sort of thing… An eternal purgatory-trap like the simulation from I Have No Mouth and I Must Scream but the other way around: a torture-prison for undesirable AIs
Hot take: this LMM training / anti-training fight has been cooked up and fuelled by the clandestine actions of the book binding cartel, who hope that if the internet degenerates into untrustworthy AI-mush, people will return to buying physical books.
Seach engines were the original guilty party. Robots.txt is a way of saying “dont scrape this page” and it is abundantly clear that LLM training companies do not care about copyrights or safety or scraping etiquette. Feeding an LLM content which was generated by another LLM is a poison pill and will cause a sort of neural degenerative effect on the model. Ironically the best use of LLMs are to destroy other LLMs
Knowing publishers for college text books and the dirty things they get up to, you’re probably on to something.
I like the joke but the publishing industry as it exists today is so clueless and hosed that there is no hope there. I’m sure they already have AIs writing physical printed MFA-bait books of the nothing sort that you find by the entrance of Barnes and Noble, called something like “My Spirit Spoke through Dragonfly’s Wings” by Xavier Mbouti, number of legitimate readers: zero.
I like physical books. In fact I prefer them for most ‘readable’ content. Reading manuals and such online is a pain. On-line is ‘not’ where I want books in general to go. Some things ‘ok’. Like searching for a snip of code or a how-to. But otherwise…. Give me a book. Latest computer book I bought was an x86_64 assembly language on Linux as it has been a long time since I’ve had to deal with x86_64 assembly and my references were dated back in 2000 era. Nice to kick back in a chair and read…
AI (which isn’t really AI but so it goes) used for ‘specific’ tasks is ok. Just another algorithm for the tool box. A specific task like PCB routing. Or chemistry where looking for viable compounds and materials. But for general use…. Forget it. Waste of energy in my mind. A nice fast searchable database of validated ‘facts and figures’ is where we should be going. Not having a computer that infers something and makes stuff up as it goes and write homework for kids/adults. Or tries to write code :rolleyes: .
As someone who has a website that gets pummeled by bots of all kinds- Fantastic.
I’m actually a general fan of AI, but the AI bots don’t play by the rules at all, they just increase workload of the sites.
I find it a bit galling they would classify the generated data as “real and related to scientific facts”, when it is in fact being generated by LLM’s which are already known to hallucinate generating plausibly looking but false information.
Setting aside for the moment the tacit acknowledgment by a large network services provider that they know the major AI players are STILL unethically scraping the internet for training data; Feeding LLM generated data back into a LLM for training is going to inevitably end up reinforcing internal associations with such data drowning out whatever actual real life public consensus exists on the topic, so it is still going to have a negative effect on any trained AI models despite cloudfare’s assertions.
And for my own part as someone who’s life is made very difficult by “is real and related to scientific facts” information that I KNOW to be wrong and the latest research is showing to be incorrect, reinforcing any such preexisting associations in the face of new and emerging information is only going to serve to set back the progress of advancement in nearly every field of study, harming our advancement as a society.
Inb4 this is used exclusively on humans and bots get right through it as usual
A bit late for cloudflare being used on humans. I run into that #*@&ing “just a moment” screen of theirs, all the damned time. And can never get past it.
Seriously sucks when you’re trying to look up information and there’s no other site with it.
Cloudflare should go rot in a do-loop of their own making.
Cloudflare -is- the problem.
It creates a walled garden that only Bing, Google, etc. have access to. Go research why Gigablast gave up trying to offer and independent search engine.
Many people aren’t ready for it, but here I say: for the so called AI become an actual good tool, it must reliably protect itself from garbage.
Poisoning weak models is a way to get there faster, as the big shots will need to invest in ways to make it more robust and trustable.
After all, it’s all robots… And like all robots, safety measures must be applied to make it useful.