Trap Naughty Web Crawlers In Digestive Juices With Nepenthes

In the olden days of the WWW you could just put a robots.txt file in the root of your website and crawling bots from search engines and kin would (generally) respect the rules in it. These days, however, we have especially web crawlers from large language model (LLM) companies happily ignoring such signs on the lawn before proceeding to hover up every scrap of content on websites. Naturally this makes a lot of people very angry, but what can you do about it? The answer by [Aaron B] is Nepenthes, described on the project page as a ‘tar pit for catching web crawlers’.

More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. /nepenthes), any web crawler that accesses it will be presented with an endless number of (randomly generated) pages with many URLs to follow. Page generating is deliberately quite slow to not soak up significant CPU time, while still giving the LLM scrapers plenty of random nonsense to chew on.

Considering that these web crawlers deemed adhering to the friendly sign on the lawn beneath them, the least we can do in response, is to hasten model collapse by feeding these LLM scrapers whatever rolls out of a simple (optionally Markov-based) text generator.

44 thoughts on “Trap Naughty Web Crawlers In Digestive Juices With Nepenthes

    1. The main effect of this software will be crawlers quickly decide the sites not worth crawling ( if it works). A crawler so dumb it actually gets stuck is unlikely to be running at scale to teach your website.

      I’m not convinced itd even work (most of the time you only want a link or two from a website which places like Had show case).

      More importantly, this will make Google etc consider your website “spam AI nonsense”. No one will read your content and some of those crappy browser plugins will probably mark your site as up to no good.

      The price of this? You DOS your own website…

      It’s a fun idea, but I doubt it’ll work in practice.. I couldn’t see any evidence of what actually happens.

      1. Google respects robots.txt. This is for crawlers that don’t respect it.

        And it works. It generates so many fake content that the real content of your site is just a fraction of the total, and the crawler will probably miss it in the flood of nonsense.

        1. Seems like you should read the actual information, I did and the software’s author lists the two (clear and obvious) problems I list as major concerns.

          i.e. explict to be delisted by search engines, and expect continuous high load.

          It’s also just common sense, spam websites have existed forever with useless AI generated junk. Including on the fly markov chains.

          Again cute idea, in practice not really useful.

          1. It’s not meant to be useful, it’s to annoy crawlers.

            I have one server with a SSH listening on port 22 that rejects all and any authentication. Costs me money, CPU, bandwidth, but I do it anyway. I have an Apache identifying as IIS. I have OpenSSH running on Ubuntu identifying as Windows 9. If it’s fun it isn’t useless.

            And if you configure robots.txt correctly, Google and any crawler that honors it won’t visit the poison pit, and those who don’t will be fed garbage.

        2. So, I don’t understand the point of it … If you can detect a crawler – which is apparently required in order to know when to generate random content to serve to it – why don’t you just disconnect the crawler by killing it’s connections? … That would save your bandwidth/processing better than letting it continue to slurp down content slowly. Your connections are like a door to your house. You lock and unlock the door, selectively. You open it to let your friends and family in, but close and lock it to keep strangers out. This solution to me seems like a) overkill, b) anger and/or malice, c) vengeance and/or punishment, much like road rage. “I don’t like what that guy just did so I’m going to teach him a lesson!” Every reckless act has a price.

      2. If you WANT to be search indexed, then obviously you would not employ this technique. This is for when you DON’T want a bunch of undesired attention. Personal portals for example. It’s the opposite of SEO.

  1. yawn, easy to bypass with timing and semantic analysis. just fed the module source into a local model and within a few iterations it was able to ascertain with reasonable reliability whether or not a sequence of pages was served via the module, then classify the sourced content as untrustworthy, which can then be fed back to data scientists for fine tuning.

    1. Calling BS on you.

      From project page:

      The Markov feature requires a trained corpus to babble from. One was intentionally omitted because, ideally, everyone’s tarpits should look different to evade detection.

      What did you do? Train your local model against every possible corpus? Lol.

    2. You state “classify the sourced content as untrustworthy.” Doesn’t that achieve the purpose of implementing such a trap in the first place? To discourage LLM scrapers from harvesting the content on your website?

      1. That’s the problem with publishing it, once it’s public, it’s easy to work out countermeasures. Like other public attempts at poisoning LLM sets, you need to build completely novel measures and STFU about them in order for them to have a chance to survive contact with the “enemy”.

    3. To be honest we all suffer this when we are Rick rolled.
      I doubt it would take more than several iterations of fine tuning the trash pages using existing llm crawlers and page access stats to get a positive outcome for the hosts.

  2. That is a tad naive and out of date, the AI training data gold rush has moved on to the actual user interactions with AI systems, because watching humans guide LLMs toward useful real-world solutions is incredibly valuable.

  3. Common sense states: If you don’t want something out there, then don’t put it on the web for public display.

    Now I know I’m missing something as I just don’t understand (since WWW just isn’t my level of expertise). So I ask it here hoping that someone can explain it to me in simple words WHY it is a problem that webscrapers scrape the web.

    1. It is a problem because LLM scrapers in particular are incredibly aggressive, soaking up a lot of bandwidth, internet traffic and server CPU time, yet few of them will be deterred by a simple robots.txt rule.

      As explained in the summary (and in the linked article and on the project page), this is a defense mechanism for bots that are not playing nice and costing incredible amounts of money for the hosted website owner.

    1. Not really useful if you do not provide a link or even a name to search for?
      But you missed the point, it was not to prevent webscrapers at all, it was to make hide his own content in a sea of gibberish, so much that the site would be excluded from the training data.

    2. imagine believing you’re smart and not realizing that nowadays you can spin up 10 thousand crawlers, each with a different up from 10 different cloud providers from all over the world

      good luck blocking that

      you should apologize for being such a smug idiot

    1. The issue is not the bandwidth, it’s plain copyright & attribution.
      A LLM is a derived work of all it’s training data (it is debatable if this is fair use or not), so at least it should attribute, but as a creator, I’d want a piece of the pie too.

      1. “Art is a derived work of all it’s training data (it is debatable if this is fair use or not), so at least it should attribute, but as a creator, I’d want a piece of the pie too.”

        Fixed it for you, please cite every piece of media you’ve ever consumed. Yes AI is trash, computers doing art while humans work is not the future I was hoping for.

        1. fwiw a lot of people earnestly wrestle with his exact question. artists are all the time publicly acknowledging and debating their influences. and sometimes artists are maligned for obviously ripping off something, but refusing to list that something as one of their influences when they’re directly asked about it. and if you go looking for it, there are a ton of interviews with musicians who are confronted with a bit of riff or melody that showed up in their work and they say “oh! i didn’t realize that’s where i got it, but now i hear it, you’re absolutely right.” it’s a super well-known phenomenon in music that there’s nothing new under the sun, everyone’s borrowing whether they notice it at the time or not. and most are pretty honest about it i think. “i was sure i invented that, it was just stuck in my head when i got out of the shower one morning and i put a song together around it! now i know”

    2. meh…i have a good amount of content up that is not well-advertised but is also not exactly hidden. ‘security through obscurity’. it’s good enough for me because i don’t care if any one particular person sees it — if someone stalks me and bothers to find it, i’m perfectly fine with that. but i don’t want it indexed by a search engine and i suppose i don’t want AI to be spewing it back at people. so robots.txt has actually served me just fine in this role, and i’ll be a little bummed if robots.txt becomes useless in the future (i may have to do actual security!), so i’m glad people are working on countermeasures.

      my point is that there are legitimate gray areas where security or privacy matters ‘only a little bit’, and relying on robots.txt is ‘good enough so far’ even though it’s obviously not real security or real throttling or real anything.

  4. Years ago I made a trap using a very similar idea: an endless maze of random links:
    https://bicyclesonthemoon.info/git-projects/?p=botm/www-trap
    (added to git in 2022, but was created much earlier)
    It just generates a page with links to more pages forever.
    I made the URL listed only in the robots.txt file and not linked from anywhere else.
    This way it will:
    – not affect bots that respect robots.txt (it is forbidden there)
    – not affect bots that ignore robots.txt (they will never find it)
    – affect bots that disrespect robots.txt on purpose.

  5. There’s no reason for all these AI bots to be crawling websites.
    If you’re getting multiple requests from the same IP over and over, it’s probably software.
    I remember reading a story on cankles. That night while I was reading the story, a news reporter
    was reading the exact same story word for word and was reading it like it was breaking news
    or the start of World War III. You look at the front of Yahoo and most if not all of the stories are
    either stuff from years ago or AI generated. Now you add all the online ads of stuff you’ll never
    buy and the spam that you know is just garbage and not specifically targeted to you, and you have
    one large mess. And let’s not forget the emails from the Nigerian price and all the ministers of
    finance that want to give you a million dollars. AI would be great at stopping that nonsense.

  6. If detected, should serve copyrighted content from a source that will enforce. Need to own a license and need to be sure not to serve it to any default agents. Bot is hacking the site and soaking up fair use content without permission.

  7. I tried nepenthes, but it wasn’t quite what I needed, so I created my own in PHP. It returns an endless chunked reply (in chunks) at a very slow trickle. I originally made it to trap the relentless bots unsuccessfully attempting to spam my web feedback form, but I’ve since expanded it to catch web scrapers, particularly petalbot, turnitin, and bytespider.

    It traps some of them for days. I’m amazed how many of them don’t appear to possess any timeout functionality.

    I also run a PHP tarpit on an xinetd-invoked port to do basically the same thing, then I redirect any SSH bruteforcers to that port with dynamic iptables rules handled by a daemon written in PHP.

    PHP is ridiculously useful. For efficiency and speed I’ll rewrite them in C eventually, I’m just lazy and they’re doing the job. 🤷‍♂️

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.