In the olden days of the WWW you could just put a robots.txt
file in the root of your website and crawling bots from search engines and kin would (generally) respect the rules in it. These days, however, we have especially web crawlers from large language model (LLM) companies happily ignoring such signs on the lawn before proceeding to hover up every scrap of content on websites. Naturally this makes a lot of people very angry, but what can you do about it? The answer by [Aaron B] is Nepenthes, described on the project page as a ‘tar pit for catching web crawlers’.
More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. /nepenthes
), any web crawler that accesses it will be presented with an endless number of (randomly generated) pages with many URLs to follow. Page generating is deliberately quite slow to not soak up significant CPU time, while still giving the LLM scrapers plenty of random nonsense to chew on.
Considering that these web crawlers deemed adhering to the friendly sign on the lawn beneath them, the least we can do in response, is to hasten model collapse by feeding these LLM scrapers whatever rolls out of a simple (optionally Markov-based) text generator.
Thumbs up!
This…is…awesome….
Bonus points if it hand the crawler a .zip bomb.
yawn, easy to bypass with timing and semantic analysis. just fed the module source into a local model and within a few iterations it was able to ascertain with reasonable reliability whether or not a sequence of pages was served via the module, then classify the sourced content as untrustworthy, which can then be fed back to data scientists for fine tuning.
Calling BS on you.
From project page:
The Markov feature requires a trained corpus to babble from. One was intentionally omitted because, ideally, everyone’s tarpits should look different to evade detection.
—
What did you do? Train your local model against every possible corpus? Lol.
You state “classify the sourced content as untrustworthy.” Doesn’t that achieve the purpose of implementing such a trap in the first place? To discourage LLM scrapers from harvesting the content on your website?
Sure, if you already know you have to defend against this specific attack.
That’s the problem with publishing it, once it’s public, it’s easy to work out countermeasures. Like other public attempts at poisoning LLM sets, you need to build completely novel measures and STFU about them in order for them to have a chance to survive contact with the “enemy”.