Trap Naughty Web Crawlers In Digestive Juices With Nepenthes

In the olden days of the WWW you could just put a robots.txt file in the root of your website and crawling bots from search engines and kin would (generally) respect the rules in it. These days, however, we have especially web crawlers from large language model (LLM) companies happily ignoring such signs on the lawn before proceeding to hover up every scrap of content on websites. Naturally this makes a lot of people very angry, but what can you do about it? The answer by [Aaron B] is Nepenthes, described on the project page as a ‘tar pit for catching web crawlers’.

More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. /nepenthes), any web crawler that accesses it will be presented with an endless number of (randomly generated) pages with many URLs to follow. Page generating is deliberately quite slow to not soak up significant CPU time, while still giving the LLM scrapers plenty of random nonsense to chew on.

Considering that these web crawlers deemed adhering to the friendly sign on the lawn beneath them, the least we can do in response, is to hasten model collapse by feeding these LLM scrapers whatever rolls out of a simple (optionally Markov-based) text generator.

This Machine Learning Algorithm Is Meta

Suppose you ran a website releasing many articles per day about various topics, all following a general theme. And suppose that your website allowed for a comments section for discussion on those topics. Unless you are brand new to the Internet, you’ll also imagine that the comments section needs at least a little bit of moderation to filter out spam, off topic, or even toxic comments. If you don’t want to employ any people for this task, you could try this machine learning algorithm instead.

[Ladvien] goes through a general overview of how to set up a convolutional neural network (CNN) which can be programmed to do many things, but this one crawls a web page, gathers data, and also makes decisions regarding that data. In this case, the task is to identify toxic comments but the goal is not to achieve the sharpest sword in the comment moderator’s armory, but to learn more about how CNNs work.

Written in Python, the process outlines the code itself and how it behaves, setting up a small server to host the neural network, and finally creating the webservice. As with any machine learning, you need a reliable dataset to use for training and this one came from Wikipedia comments previously flagged by humans. Trolling nuance is thrown aside, as the example homes in on blatant insults and vulgarity.

While [Ladvien] notes that his guide isn’t meant to be comprehensive, but rather to fill in some gaps that he noticed within other guides like this, we find this to be an interesting read. He also mentioned that, in theory, this tool could be used to predict the number of comments following an article like this very one based on the language in the article. We’ll leave that one as an academic exercise for now, probably.