Cloudflare’s AI Labyrinth Wants Bad Bots To Get Endlessly Lost

March 24, 2025 by Donald Papp 37 Comments

Cloudflare has gotten more active in its efforts to identify and block unauthorized bots and AI crawlers that don’t respect boundaries. Their solution? AI Labyrinth, which uses generative AI to efficiently create a diverse maze of data as a defensive measure.

This is an evolution of efforts to thwart bots and AI scrapers that don’t respect things like “no crawl” directives, which accounts for an ever-growing amount of traffic. Last year we saw Cloudflare step up their game in identifying and blocking such activity, but the whole thing is akin to an arms race. Those intent on hoovering up all the data they can are constantly shifting tactics in response to mitigations, and simply identifying bad actors with honeypots and blocking them doesn’t really do the job any more. In fact, blocking requests mainly just alerts the baddies to the fact they’ve been identified.

Instead of blocking requests, Cloudflare goes in the other direction and creates an all-you-can-eat sprawl of linked AI-generated content, luring crawlers into wasting their time and resources as they happily process an endless buffet of diverse facts unrelated to the site being crawled, all while Cloudflare learns as much about them as possible.

That’s an important point: the content generated by the Labyrinth might be pointless and irrelevant, but it isn’t nonsense. After all, the content generated by the Labyrinth can plausibly end up in training data, and fraudulent data would essentially be increasing the amount of misinformation online as a side effect. For that reason, the human-looking data making up the Labyrinth isn’t wrong, it’s just useless.

It’s certainly a clever method of dealing with crawlers, but the way things are going it’ll probably be rendered obsolete sooner rather than later, as the next move in the arms race gets made.

Cloudflare Adds Block For AI Scrapers And Similar Bots

July 4, 2024 by Maya Posch 19 Comments

It’s no big secret that a lot of the internet traffic today consists out of automated requests, ranging from innocent bots like search engine indexers to data scraping bots for LLM and similar generative AI companies. With enough customers who are less than amused by this boost in useless traffic, Cloudflare has announced that it’s expanding its blocking feature for the latter category of scrapers. Initially this block was only for ‘poorly behaving’ scrapers, but now it apparently targets all of such bots.

The block seems to be based around a range of characteristics, including the user agent string. According to Cloudflare’s data on its network, over 40% of identified AI bots came from ByteDance (Bytespider), followed by GPTBot at over 35% and ClaudeBot with 11% and a whole gaggle of smaller bots. Assuming that Imperva’s claims of bots taking up over half of today’s internet traffic are somewhat correct, that means that even if these bots follow robots.txt, that is still a lot of bandwidth being drained and the website owner effectively subsidizing the training of some company’s models. Unsurprisingly, Cloudflare notes that many website owners have already taken measures to block these bots in some fashion.

Naturally, not all of these scraper bots are well-behaved. Spoofing the user agent is an obvious way to dodge blocks, but scraper bot activity has many tell-tale signs which Cloudflare uses, as well as statistical data across its global network to compute a ‘bot score‘ for any requests. Although it remains to be seen whether false positives become an issue with Cloudflare’s approach, it’s definitely a sign of the times that more and more website owners are choosing to choke off unwanted, AI-related traffic.

AI Helps Make Web Scraping Faster And Easier

May 8, 2024 by Donald Papp 12 Comments

Web scraping is usually only a first step towards extracting meaningful data. Once you’ve got everything pulled down, you’ve still got to process it into something useful. Here to assist with that is Scrapegraph-ai, a Python tool that promises to automate the process using a selection of large language models (LLMs).

Scrapegraph-ai is able to accept a URL as well as a prompt, which is a plain-English instruction on what to do with the data. Examples include summarizing, describing images, and more. In other words, gathering the data and analyzing or formatting it can now be done as one.

The project is actually pretty flexible in terms of the AI back-end. It’s able to work with locally-installed AI tools (via ollama) or with API keys for services like OpenAI and more. If you have an OpenAI API key, there’s an online demo that will show you the capabilities pretty effectively. Otherwise, local installation is only a few operations away.

This isn’t the first time we have seen the flexibility of AI tools like large language models leveraged to ease the notoriously-fiddly task of web scraping, and it’s great to see the results have only gotten better.

Tired Of Web Scraping? Make The AI Do It

April 9, 2023 by Donald Papp 19 Comments

[James Turk] has a novel approach to the problem of scraping web content in a structured way without needing to write the kind of page-specific code web scrapers usually have to deal with. How? Just enlist the help of a natural language AI. Scrapeghost relies on OpenAI’s GPT API to parse a web page’s content, pull out and classify any salient bits, and format it in a useful way.

What makes Scrapeghost different is how data gets organized. For example, when instantiating scrapeghost one defines the data one wishes to extract. For example:

from scrapeghost import SchemaScraper
scrape_legislators = SchemaScraper(
schema={
"name": "string",
"url": "url",
"district": "string",
"party": "string",
"photo_url": "url",
"offices": [{"name": "string", "address": "string", "phone": "string"}],
}
)

The kicker is that this format is entirely up to you! The GPT models are very, very good at processing natural language, and scrapeghost uses GPT to process the scraped data and find (using the example above) whatever looks like a name, district, party, photo, and office address and format it exactly as requested.

It’s an experimental tool and you’ll need an API key from OpenAI to use it, but it has useful features and is certainly a novel approach. There’s a tutorial and even a command-line interface, so check it out.

“Lazier” Web Scraping Is Better Web Scraping

February 7, 2022 by Donald Papp 13 Comments

Ever needed to get data from a web page? Parsing the content for data is called web scraping, and [Doug Guthrie] has a few tips for making the process of digging data out of a web page simpler and more efficient, complete with code examples in Python. He uses getting data from Yahoo Finance as an example, because it’s apparently a pretty common use case judging by how often questions about it pop up on Stack Overflow. The general concepts are pretty widely applicable, however.

[Doug] shows that while parsing a web page for a specific piece of data (for example, a stock price) is not difficult, there are sometimes easier and faster ways to go about it. In the case of Yahoo Finance, the web page most of us look at isn’t really the actual source of the data being displayed, it’s just a front end.

One way to more efficiently scrape data is to get to the data’s source. In the case of Yahoo Finance, the data displayed on a web page comes from a JavaScript variable that is perfectly accessible to the end user, and much easier to parse and work with. Another way is to go one level lower, and retrieve JSON-formatted data from the same place that the front-end web page does; ignoring the front end altogether and essentially treating it as an unofficial API. Either way is not only easier than parsing the end result, but faster and more reliable, to boot.

How does one find these resources? [Doug] gives some great tips on how exactly to do so, including how to use a web browser’s developer tools to ferret out XHR requests. These methods won’t work for everything, but they are definitely worth looking into to see if they are an option. Another resource to keep in mind is woob (web outside of browsers), which has an impressive list of back ends available for reading and interacting with web content. So if you need data for your program, but it’s on a web page? Don’t let that stop you!

Hack The Web Without A Browser

January 17, 2022 by Al Williams 20 Comments

It is a classic problem. You want data for use in your program but it is on a webpage. Some websites have an API, of course, but usually, you are on your own. You can load the whole page via HTTP and parse it. Or you can use some tools to “scrape” the site. One interesting way to do this is woob — web outside of browsers.

The system uses a series of backends tailored at particular sites. There’s a collection of official backends, and you can also create your own. Once you have a backend, you can configure it and use it from Python. Here’s an example of finding a bank account balance:

>>> from woob.core import Woob
>>> from woob.capabilities.bank import CapBank
>>> w = Woob()
>>> w.load_backends(CapBank)
{'societegenerale': <Backend 'societegenerale'>, 'creditmutuel': <Backend 'creditmutuel'>}
>>> pprint(list(w.iter_accounts()))
[<Account id='7418529638527412' label=u'Compte de ch\xe8ques'>,
<Account id='9876543216549871' label=u'Livret A'>,
<Account id='123456789123456789123EUR' label=u'C/C Eurocompte Confort M Roger Philibert'>]
>>> acc = next(iter(w.iter_accounts()))
>>> acc.balance
Decimal('87.32')

The list of available backends is impressive, but eventually, you’ll want to create your own modules. Thankfully, there’s plenty of documentation about how to do that. The framework allows you to post data to the website and easily read the results. Each backend also has a test which can detect if a change in the website breaks the code, which is a common problem with such schemes.

We didn’t see a Hackaday backend. Too bad. There are, however, many application examples, both console-based and using QT. For example, you can search for movies, manage recipes, or dating sites.

Of course, there are many approaches possible to this problem. Maybe you need to find out when the next train is leaving.

Job Application Script Automates The Boring Stuff With Python

May 30, 2020 by Sven Gregori 15 Comments

Job hunting can certainly require a good amount of hoop-jumping in today’s age. Even if you’re lucky enough to have your application read by an actual human, there’s no guarantee the person on the other end has much of an understanding about your skill set. Oftentimes, the entire procedure is futile from the start, and as a recent graduate, [harshibar] is well aware of the soul-crushing experience investing a lot of time in it can be. Well, as the saying goes: if you can’t beat them, join them — and if you can’t join them, automate the hell out of the application process.

As the final piece of a “5 Python Projects in 5 Days” challenge [harshibar] set for herself — which also spawned a “Tinder for Netflix” for the web development section of it — she essentially created a web-scraper that gathers job openings for a specific search term, and automatically sends an application to each and every one of them. Using Beautiful Soup to parse the scraped pages of a certain job portal, Selenium’s browser automation functionality to fill out the online application forms, she can get all her information into the form saving countless hours in comparison to the manual alternative. The program even hits the apply button.

While the quantity-over-quality approach may not be for everyone, there’s of course room for more filtering and being more selective about the job openings beforehand, which [harshibar] also addresses in her video about the project (embedded below). And while this won’t fix the application process itself, we can definitely see the satisfaction a beating-them-at-their-own-game might provide — plus, it can’t have a worse miss rate than your typical LinkedIn “recruiter”. Still, if you’re looking for a more systematic approach, have a look at [Lewin Day]’s view on the subject, he even has advice job hunting is still further down the road for you.

Continue reading “Job Application Script Automates The Boring Stuff With Python” →