Hackaday Links Column Banner

Hackaday Links: March 22, 2026

On Friday, Reuters reported that Amazon is going to try to get into the smartphone game…again. The Fire Phone was perhaps Amazon’s biggest commercial misstep, and was only on the market for about a year before it was discontinued in the summer of 2015. But now industry sources are saying that a new phone code-named “Transformer” is in the works from the e-commerce giant.

At this point, there’s no word on how much the phone would cost or when it would hit the market. The only information Reuters was able to squeeze out of their contacts was that the device would feature AI heavily. Real shocker there — anyone with an Echo device in their kitchen could tell you that Amazon is desperate to get you talking to their gadgets, presumably so they can convince you to buy something. While a smartphone with even more AI features we didn’t ask for certainly won’t be on our Wish List, if history is any indicator, we might be able to pick these things up cheap on the second-hand market.

On the subject of AI screwing everything up, earlier this week, the Electronic Frontier Foundation reported that The New York Times had started blocking the Internet Archive’s crawlers, citing concerns over their content being scraped up by bots for training data. The EFF likens this to a newspaper asking libraries to stop storing copies of their old editions, and warns that in an era where most people get their news via the Internet, not having an archived copy of sites like The Times will put holes in the digital record. They also point out that mirroring web pages for the purposes of making them more easily searchable is a widely accepted practice (ask Google) and has been legally recognized as fair use in court.

Assuming we take the NYT’s side of the story at face value, there’s a tiny part of our cold robotic heart that feels some sympathy for them. Over the last year or so, we’ve noticed some suspicious activity that we believe to be bots siphoning up content from the blog and Hackaday.io, and it’s resulted in a few technical headaches for us. On the other hand, what’s Hackaday here for if not to share information? Surely the same could be said for any newspaper, be it the local rag or The New York Times. If a chatbot learning some new phrases from us is the cost of doing business in 2026, so be it. Can’t stop the signal.

Continue reading “Hackaday Links: March 22, 2026”

Cloudflare’s AI Labyrinth Wants Bad Bots To Get Endlessly Lost

Cloudflare has gotten more active in its efforts to identify and block unauthorized bots and AI crawlers that don’t respect boundaries. Their solution? AI Labyrinth, which uses generative AI to efficiently create a diverse maze of data as a defensive measure.

This is an evolution of efforts to thwart bots and AI scrapers that don’t respect things like “no crawl” directives, which accounts for an ever-growing amount of traffic. Last year we saw Cloudflare step up their game in identifying and blocking such activity, but the whole thing is akin to an arms race. Those intent on hoovering up all the data they can are constantly shifting tactics in response to mitigations, and simply identifying bad actors with honeypots and blocking them doesn’t really do the job any more. In fact, blocking requests mainly just alerts the baddies to the fact they’ve been identified.

Instead of blocking requests, Cloudflare goes in the other direction and creates an all-you-can-eat sprawl of linked AI-generated content, luring crawlers into wasting their time and resources as they happily process an endless buffet of diverse facts unrelated to the site being crawled, all while Cloudflare learns as much about them as possible.

That’s an important point: the content generated by the Labyrinth might be pointless and irrelevant, but it isn’t nonsense. After all, the content generated by the Labyrinth can plausibly end up in training data, and fraudulent data would essentially be increasing the amount of misinformation online as a side effect. For that reason, the human-looking data making up the Labyrinth isn’t wrong, it’s just useless.

It’s certainly a clever method of dealing with crawlers, but the way things are going it’ll probably be rendered obsolete sooner rather than later, as the next move in the arms race gets made.

Cloudflare Adds Block For AI Scrapers And Similar Bots

It’s no big secret that a lot of the internet traffic today consists out of automated requests, ranging from innocent bots like search engine indexers to data scraping bots for LLM and similar generative AI companies. With enough customers who are less than amused by this boost in useless traffic, Cloudflare has announced that it’s expanding its blocking feature for the latter category of scrapers. Initially this block was only for ‘poorly behaving’ scrapers, but now it apparently targets all of such bots.

The block seems to be based around a range of characteristics, including the user agent string. According to Cloudflare’s data on its network, over 40% of identified AI bots came from ByteDance (Bytespider), followed by GPTBot at over 35% and ClaudeBot with 11% and a whole gaggle of smaller bots. Assuming that Imperva’s claims of bots taking up over half of today’s internet traffic are somewhat correct, that means that even if these bots follow robots.txt, that is still a lot of bandwidth being drained and the website owner effectively subsidizing the training of some company’s models. Unsurprisingly, Cloudflare notes that many website owners have already taken measures to block these bots in some fashion.

Naturally, not all of these scraper bots are well-behaved. Spoofing the user agent is an obvious way to dodge blocks, but scraper bot activity has many tell-tale signs which Cloudflare uses, as well as statistical data across its global network to compute a ‘bot score‘ for any requests. Although it remains to be seen whether false positives become an issue with Cloudflare’s approach, it’s definitely a sign of the times that more and more website owners are choosing to choke off unwanted, AI-related traffic.

AI Helps Make Web Scraping Faster And Easier

Web scraping is usually only a first step towards extracting meaningful data. Once you’ve got everything pulled down, you’ve still got to process it into something useful. Here to assist with that is Scrapegraph-ai, a Python tool that promises to automate the process using a selection of large language models (LLMs).

Scrapegraph-ai is able to accept a URL as well as a prompt, which is a plain-English instruction on what to do with the data. Examples include summarizing, describing images, and more. In other words, gathering the data and analyzing or formatting it can now be done as one.

The project is actually pretty flexible in terms of the AI back-end. It’s able to work with locally-installed AI tools (via ollama) or with API keys for services like OpenAI and more. If you have an OpenAI API key, there’s an online demo that will show you the capabilities pretty effectively. Otherwise, local installation is only a few operations away.

This isn’t the first time we have seen the flexibility of AI tools like large language models leveraged to ease the notoriously-fiddly task of web scraping, and it’s great to see the results have only gotten better.

Tired Of Web Scraping? Make The AI Do It

[James Turk] has a novel approach to the problem of scraping web content in a structured way without needing to write the kind of page-specific code web scrapers usually have to deal with. How? Just enlist the help of a natural language AI. Scrapeghost relies on OpenAI’s GPT API to parse a web page’s content, pull out and classify any salient bits, and format it in a useful way.

What makes Scrapeghost different is how data gets organized. For example, when instantiating scrapeghost one defines the data one wishes to extract. For example:

from scrapeghost import SchemaScraper
scrape_legislators = SchemaScraper(
schema={
"name": "string",
"url": "url",
"district": "string",
"party": "string",
"photo_url": "url",
"offices": [{"name": "string", "address": "string", "phone": "string"}],
}
)

The kicker is that this format is entirely up to you! The GPT models are very, very good at processing natural language, and scrapeghost uses GPT to process the scraped data and find (using the example above) whatever looks like a name, district, party, photo, and office address and format it exactly as requested.

It’s an experimental tool and you’ll need an API key from OpenAI to use it, but it has useful features and is certainly a novel approach. There’s a tutorial and even a command-line interface, so check it out.

“Lazier” Web Scraping Is Better Web Scraping

Ever needed to get data from a web page? Parsing the content for data is called web scraping, and [Doug Guthrie] has a few tips for making the process of digging data out of a web page simpler and more efficient, complete with code examples in Python. He uses getting data from Yahoo Finance as an example, because it’s apparently a pretty common use case judging by how often questions about it pop up on Stack Overflow. The general concepts are pretty widely applicable, however.

[Doug] shows that while parsing a web page for a specific piece of data (for example, a stock price) is not difficult, there are sometimes easier and faster ways to go about it. In the case of Yahoo Finance, the web page most of us look at isn’t really the actual source of the data being displayed, it’s just a front end.

One way to more efficiently scrape data is to get to the data’s source. In the case of Yahoo Finance, the data displayed on a web page comes from a JavaScript variable that is perfectly accessible to the end user, and much easier to parse and work with. Another way is to go one level lower, and retrieve JSON-formatted data from the same place that the front-end web page does; ignoring the front end altogether and essentially treating it as an unofficial API. Either way is not only easier than parsing the end result, but faster and more reliable, to boot.

How does one find these resources? [Doug] gives some great tips on how exactly to do so, including how to use a web browser’s developer tools to ferret out XHR requests. These methods won’t work for everything, but they are definitely worth looking into to see if they are an option. Another resource to keep in mind is woob (web outside of browsers), which has an impressive list of back ends available for reading and interacting with web content. So if you need data for your program, but it’s on a web page? Don’t let that stop you!

Hack The Web Without A Browser

It is a classic problem. You want data for use in your program but it is on a webpage. Some websites have an API, of course, but usually, you are on your own. You can load the whole page via HTTP and parse it. Or you can use some tools to “scrape” the site. One interesting way to do this is woob — web outside of browsers.

The system uses a series of backends tailored at particular sites. There’s a collection of official backends, and you can also create your own. Once you have a backend, you can configure it and use it from Python. Here’s an example of finding a bank account balance:

>>> from woob.core import Woob
>>> from woob.capabilities.bank import CapBank
>>> w = Woob()
>>> w.load_backends(CapBank)
{'societegenerale': <Backend 'societegenerale'>, 'creditmutuel': <Backend 'creditmutuel'>}
>>> pprint(list(w.iter_accounts()))
[<Account id='7418529638527412' label=u'Compte de ch\xe8ques'>,
<Account id='9876543216549871' label=u'Livret A'>,
<Account id='123456789123456789123EUR' label=u'C/C Eurocompte Confort M Roger Philibert'>]
>>> acc = next(iter(w.iter_accounts()))
>>> acc.balance
Decimal('87.32')

The list of available backends is impressive, but eventually, you’ll want to create your own modules. Thankfully, there’s plenty of documentation about how to do that. The framework allows you to post data to the website and easily read the results. Each backend also has a test which can detect if a change in the website breaks the code, which is a common problem with such schemes.

We didn’t see a Hackaday backend. Too bad. There are, however, many application examples, both console-based and using QT. For example, you can search for movies, manage recipes, or dating sites.

Of course, there are many approaches possible to this problem. Maybe you need to find out when the next train is leaving.