Trap Naughty Web Crawlers In Digestive Juices With Nepenthes

January 23, 2025

In the olden days of the WWW you could just put a robots.txt file in the root of your website and crawling bots from search engines and kin would (generally) respect the rules in it. These days, however, we have especially web crawlers from large language model (LLM) companies happily ignoring such signs on the lawn before proceeding to hover up every scrap of content on websites. Naturally this makes a lot of people very angry, but what can you do about it? The answer by [Aaron B] is Nepenthes, described on the project page as a ‘tar pit for catching web crawlers’.

More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. /nepenthes), any web crawler that accesses it will be presented with an endless number of (randomly generated) pages with many URLs to follow. Page generating is deliberately quite slow to not soak up significant CPU time, while still giving the LLM scrapers plenty of random nonsense to chew on.

Considering that these web crawlers deemed adhering to the friendly sign on the lawn beneath them, the least we can do in response, is to hasten model collapse by feeding these LLM scrapers whatever rolls out of a simple (optionally Markov-based) text generator.

44 thoughts on “Trap Naughty Web Crawlers In Digestive Juices With Nepenthes”

CityZen says:

January 23, 2025 at 7:16 pm

Thumbs up!

Report comment

Reply
bruce.desertrat says:

January 23, 2025 at 7:59 pm

This…is…awesome….

Report comment

Reply
Thinkerer says:

January 23, 2025 at 8:10 pm

Bonus points if it hand the crawler a .zip bomb.

Report comment

Reply
1. alialiali says:
  
  January 24, 2025 at 12:04 am
  
  The main effect of this software will be crawlers quickly decide the sites not worth crawling ( if it works). A crawler so dumb it actually gets stuck is unlikely to be running at scale to teach your website.
  
  I’m not convinced itd even work (most of the time you only want a link or two from a website which places like Had show case).
  
  More importantly, this will make Google etc consider your website “spam AI nonsense”. No one will read your content and some of those crappy browser plugins will probably mark your site as up to no good.
  
  The price of this? You DOS your own website…
  
  It’s a fun idea, but I doubt it’ll work in practice.. I couldn’t see any evidence of what actually happens.
  
  Report comment
  
  Reply
  1. ThoriumBR says:
    
    January 24, 2025 at 3:07 am
    
    Google respects robots.txt. This is for crawlers that don’t respect it.
    
    And it works. It generates so many fake content that the real content of your site is just a fraction of the total, and the crawler will probably miss it in the flood of nonsense.
    
    Report comment
    
    Reply
    1. alialiali says:
      
      January 24, 2025 at 4:34 am
      
      Seems like you should read the actual information, I did and the software’s author lists the two (clear and obvious) problems I list as major concerns.
      
      i.e. explict to be delisted by search engines, and expect continuous high load.
      
      It’s also just common sense, spam websites have existed forever with useless AI generated junk. Including on the fly markov chains.
      
      Again cute idea, in practice not really useful.
      
      Report comment
      
      Reply
      1. ThoriumBR says:
        
        January 24, 2025 at 9:50 am
        
        It’s not meant to be useful, it’s to annoy crawlers.
        
        I have one server with a SSH listening on port 22 that rejects all and any authentication. Costs me money, CPU, bandwidth, but I do it anyway. I have an Apache identifying as IIS. I have OpenSSH running on Ubuntu identifying as Windows 9. If it’s fun it isn’t useless.
        
        And if you configure robots.txt correctly, Google and any crawler that honors it won’t visit the poison pit, and those who don’t will be fed garbage.
        
        Report comment
    2. fhunter says:
      
      January 24, 2025 at 1:32 pm
      
      Bing doesn’t respect it, and does not allow removal of content
      
      Report comment
      
      Reply
    3. Anon says:
      
      January 24, 2025 at 3:20 pm
      
      So, I don’t understand the point of it … If you can detect a crawler – which is apparently required in order to know when to generate random content to serve to it – why don’t you just disconnect the crawler by killing it’s connections? … That would save your bandwidth/processing better than letting it continue to slurp down content slowly. Your connections are like a door to your house. You lock and unlock the door, selectively. You open it to let your friends and family in, but close and lock it to keep strangers out. This solution to me seems like a) overkill, b) anger and/or malice, c) vengeance and/or punishment, much like road rage. “I don’t like what that guy just did so I’m going to teach him a lesson!” Every reckless act has a price.
      
      Report comment
      
      Reply
  2. Doctor Wizard says:
    
    January 24, 2025 at 5:55 am
    
    If you WANT to be search indexed, then obviously you would not employ this technique. This is for when you DON’T want a bunch of undesired attention. Personal portals for example. It’s the opposite of SEO.
    
    Report comment
    
    Reply
Daev says:

January 23, 2025 at 8:29 pm

yawn, easy to bypass with timing and semantic analysis. just fed the module source into a local model and within a few iterations it was able to ascertain with reasonable reliability whether or not a sequence of pages was served via the module, then classify the sourced content as untrustworthy, which can then be fed back to data scientists for fine tuning.

Report comment

Reply
1. Jorgus says:
  
  January 23, 2025 at 8:46 pm
  
  Calling BS on you.
  
  From project page:
  
  The Markov feature requires a trained corpus to babble from. One was intentionally omitted because, ideally, everyone’s tarpits should look different to evade detection.
  
  —
  
  What did you do? Train your local model against every possible corpus? Lol.
  
  Report comment
  
  Reply
2. ChinShu says:
  
  January 23, 2025 at 8:57 pm
  
  You state “classify the sourced content as untrustworthy.” Doesn’t that achieve the purpose of implementing such a trap in the first place? To discourage LLM scrapers from harvesting the content on your website?
  
  Report comment
  
  Reply
3. pelrun says:
  
  January 23, 2025 at 8:58 pm
  
  Sure, if you already know you have to defend against this specific attack.
  
  Report comment
  
  Reply
  1. Daev says:
    
    January 23, 2025 at 10:13 pm
    
    That’s the problem with publishing it, once it’s public, it’s easy to work out countermeasures. Like other public attempts at poisoning LLM sets, you need to build completely novel measures and STFU about them in order for them to have a chance to survive contact with the “enemy”.
    
    Report comment
    
    Reply
4. Jc says:
  
  January 24, 2025 at 2:46 am
  
  To be honest we all suffer this when we are Rick rolled.
  I doubt it would take more than several iterations of fine tuning the trash pages using existing llm crawlers and page access stats to get a positive outcome for the hosts.
  
  Report comment
  
  Reply
𐂀 𐂅 says:

January 23, 2025 at 9:34 pm

That is a tad naive and out of date, the AI training data gold rush has moved on to the actual user interactions with AI systems, because watching humans guide LLMs toward useful real-world solutions is incredibly valuable.

Report comment

Reply
1. Ostracus says:
  
  January 24, 2025 at 8:29 am
  
  First is feeding an LLM a fish. Second is teaching an LLM to fish.
  
  Report comment
  
  Reply
  1. Jello says:
    
    January 26, 2025 at 5:03 am
    
    When the LLM escapes the laboratory and proceeds to eat New York, feed it “carp.”
    
    Report comment
    
    Reply
Cs says:

January 23, 2025 at 11:13 pm

I’m sort of convinced LLM scrapers will poison themselves in less than a year, dredging through endless blogs of AI slop

Report comment

Reply
Jan says:

January 23, 2025 at 11:57 pm

Common sense states: If you don’t want something out there, then don’t put it on the web for public display.

Now I know I’m missing something as I just don’t understand (since WWW just isn’t my level of expertise). So I ask it here hoping that someone can explain it to me in simple words WHY it is a problem that webscrapers scrape the web.

Report comment

Reply
1. Maya Posch says:
  
  January 24, 2025 at 12:35 am
  
  It is a problem because LLM scrapers in particular are incredibly aggressive, soaking up a lot of bandwidth, internet traffic and server CPU time, yet few of them will be deterred by a simple robots.txt rule.
  
  As explained in the summary (and in the linked article and on the project page), this is a defense mechanism for bots that are not playing nice and costing incredible amounts of money for the hosted website owner.
  
  Report comment
  
  Reply
  1. Elliot Williams says:
    
    January 24, 2025 at 4:13 am
    
    Hackaday.io has been hit by these bots, and it caused actual downtime. Getting “crawled” by them amounts to a DoS attack.
    
    Google’s crawler is known and gentle in comparison.
    
    Report comment
    
    Reply
    1. Jan says:
      
      January 24, 2025 at 7:16 am
      
      okay, thanks, it didn’t know it could be that bad/intense
      
      Report comment
      
      Reply
strawberrymortallyb0bcea48e7 says:

January 24, 2025 at 1:44 am

If only there was a community network that detects mass connection on big scale from single IP and block it, oh wait there is.

Report comment

Reply
1. Jelle says:
  
  January 24, 2025 at 3:36 am
  
  Not really useful if you do not provide a link or even a name to search for?
  But you missed the point, it was not to prevent webscrapers at all, it was to make hide his own content in a sea of gibberish, so much that the site would be excluded from the training data.
  
  Report comment
  
  Reply
2. hjf says:
  
  January 25, 2025 at 9:45 am
  
  imagine believing you’re smart and not realizing that nowadays you can spin up 10 thousand crawlers, each with a different up from 10 different cloud providers from all over the world
  
  good luck blocking that
  
  you should apologize for being such a smug idiot
  
  Report comment
  
  Reply
strawberrymortallyb0bcea48e7 says:

January 24, 2025 at 1:54 am

Similar to slow SSH that the article doesn’t mention.. https://github.com/skeeto/endlessh

Report comment

Reply
Chris DeBoy says:

January 24, 2025 at 1:57 am

This just sounds like more ludditism from people who hate AI.
If a client sucks up too much bandwidth and CPU, just throttle their connection.

Report comment

Reply
1. Jelle says:
  
  January 24, 2025 at 3:50 am
  
  The issue is not the bandwidth, it’s plain copyright & attribution.
  A LLM is a derived work of all it’s training data (it is debatable if this is fair use or not), so at least it should attribute, but as a creator, I’d want a piece of the pie too.
  
  Report comment
  
  Reply
  1. John says:
    
    January 24, 2025 at 7:15 am
    
    “Art is a derived work of all it’s training data (it is debatable if this is fair use or not), so at least it should attribute, but as a creator, I’d want a piece of the pie too.”
    
    Fixed it for you, please cite every piece of media you’ve ever consumed. Yes AI is trash, computers doing art while humans work is not the future I was hoping for.
    
    Report comment
    
    Reply
    1. Greg A says:
      
      January 24, 2025 at 7:29 am
      
      fwiw a lot of people earnestly wrestle with his exact question. artists are all the time publicly acknowledging and debating their influences. and sometimes artists are maligned for obviously ripping off something, but refusing to list that something as one of their influences when they’re directly asked about it. and if you go looking for it, there are a ton of interviews with musicians who are confronted with a bit of riff or melody that showed up in their work and they say “oh! i didn’t realize that’s where i got it, but now i hear it, you’re absolutely right.” it’s a super well-known phenomenon in music that there’s nothing new under the sun, everyone’s borrowing whether they notice it at the time or not. and most are pretty honest about it i think. “i was sure i invented that, it was just stuck in my head when i got out of the shower one morning and i put a song together around it! now i know”
      
      Report comment
      
      Reply
      1. Ostracus says:
        
        January 24, 2025 at 8:34 am
        
        Hence “Steal Like an Artist: 10 Things Nobody Told You About Being Creative”
        
        https://www.amazon.com/Steal-Like-Artist-Things-Creative-ebook/dp/B0074QGGK6/
        
        Report comment
2. Greg A says:
  
  January 24, 2025 at 7:24 am
  
  meh…i have a good amount of content up that is not well-advertised but is also not exactly hidden. ‘security through obscurity’. it’s good enough for me because i don’t care if any one particular person sees it — if someone stalks me and bothers to find it, i’m perfectly fine with that. but i don’t want it indexed by a search engine and i suppose i don’t want AI to be spewing it back at people. so robots.txt has actually served me just fine in this role, and i’ll be a little bummed if robots.txt becomes useless in the future (i may have to do actual security!), so i’m glad people are working on countermeasures.
  
  my point is that there are legitimate gray areas where security or privacy matters ‘only a little bit’, and relying on robots.txt is ‘good enough so far’ even though it’s obviously not real security or real throttling or real anything.
  
  Report comment
  
  Reply
3. hjf says:
  
  January 25, 2025 at 9:47 am
  
  it’s not “a” client, you dumbass. it’s 10 thousand of them. they can spin up as many as they want from different providers in different regions
  
  Report comment
  
  Reply
adobeflashhater again says:

January 24, 2025 at 2:39 am

just bring back the “goatse” links?

NSFW imagery involved, so be advised.

Report comment

Reply
Truth says:

January 24, 2025 at 2:44 am

I’ll leave this here: https://www.benzedrine.ch/relaydb.html (Annoying spammers with pf and spamd Introduction)

Report comment

Reply
kovo says:

January 24, 2025 at 3:51 am

hmm how about nginx/apachce plugin which will on request ask AI for some text? so we will offload it back to them? :D

Report comment

Reply
1. Anonymous says:
  
  January 24, 2025 at 6:06 am
  
  Ha, that’s keen!
  
  Report comment
  
  Reply
bicyclesonthemoon says:

January 24, 2025 at 6:50 am

Years ago I made a trap using a very similar idea: an endless maze of random links:
https://bicyclesonthemoon.info/git-projects/?p=botm/www-trap
(added to git in 2022, but was created much earlier)
It just generates a page with links to more pages forever.
I made the URL listed only in the robots.txt file and not linked from anywhere else.
This way it will:
– not affect bots that respect robots.txt (it is forbidden there)
– not affect bots that ignore robots.txt (they will never find it)
– affect bots that disrespect robots.txt on purpose.

Report comment

Reply
John Q. Public says:

January 24, 2025 at 12:07 pm

There’s no reason for all these AI bots to be crawling websites.
If you’re getting multiple requests from the same IP over and over, it’s probably software.
I remember reading a story on cankles. That night while I was reading the story, a news reporter
was reading the exact same story word for word and was reading it like it was breaking news
or the start of World War III. You look at the front of Yahoo and most if not all of the stories are
either stuff from years ago or AI generated. Now you add all the online ads of stuff you’ll never
buy and the spam that you know is just garbage and not specifically targeted to you, and you have
one large mess. And let’s not forget the emails from the Nigerian price and all the ministers of
finance that want to give you a million dollars. AI would be great at stopping that nonsense.

Report comment

Reply
hartl says:

January 24, 2025 at 1:39 pm

Credit should be given to Ronald F. Guilmette’s “wpoison”.
https://web.archive.org/web/20160821195248/http://www.monkeys.com:80/wpoison/

Report comment

Reply
Christian says:

January 24, 2025 at 2:29 pm

If detected, should serve copyrighted content from a source that will enforce. Need to own a license and need to be sure not to serve it to any default agents. Bot is hacking the site and soaking up fair use content without permission.

Report comment

Reply
Weasel says:

January 26, 2025 at 8:44 am

I tried nepenthes, but it wasn’t quite what I needed, so I created my own in PHP. It returns an endless chunked reply (in chunks) at a very slow trickle. I originally made it to trap the relentless bots unsuccessfully attempting to spam my web feedback form, but I’ve since expanded it to catch web scrapers, particularly petalbot, turnitin, and bytespider.

It traps some of them for days. I’m amazed how many of them don’t appear to possess any timeout functionality.

I also run a PHP tarpit on an xinetd-invoked port to do basically the same thing, then I redirect any SSH bruteforcers to that port with dynamic iptables rules handled by a daemon written in PHP.

PHP is ridiculously useful. For efficiency and speed I’ll rewrite them in C eventually, I’m just lazy and they’re doing the job. 🤷‍♂️

Report comment

Reply

Hackaday

Trap Naughty Web Crawlers In Digestive Juices With Nepenthes

44 thoughts on “Trap Naughty Web Crawlers In Digestive Juices With Nepenthes”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

The Death Of Industrial Design And The Era Of Dull Electronics

Power Grid Stability: From Generators To Reactive Power

Why Apple Dumped 2,700 Computers In A Landfill In 1989

A Field Guide To The North American Cold Chain

The DEW Line Remembered

Our Columns

The Epochalypse: It’s Y2K, But 38 Years Later

Fixing Human Sleep With Air Under Pressure

Hackaday Links: July 20, 2025

Hackaday Podcast Episode 329: AI Surgery, A Prison Camp Lathe, And A One Hertz Four-Fer

This Week In Security: Trains, Fake Homebrew, And AI Auto-Hacking

44 thoughts on “Trap Naughty Web Crawlers In Digestive Juices With Nepenthes”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns