The Internet is fighting over whether robots.txt
applies to AI agents. It all started when Cloudflare published a blog post, detailing what the company was seeing from Perplexity crawlers. Of course, automated web crawling is part of how the modern Internet works, and almost immediately after the first web crawler was written, one managed to DoS (Denial of Service) a web site back in 1994. And the robots.txt
file was first designed.
Make no mistake, robots.txt
on its own is nothing more than a polite request for someone else on the Internet to not index your site. The more aggressive approach is to add rules to a Web Application Firewall (WAF) that detects and blocks a web crawler based on the user-agent string and source IP address. Cloudflare makes the case that Perplexity is not only intentionally ignoring robots.txt
, but also actively disguising their webcrawling traffic by using IP addresses outside their normal range for these requests.
This isn’t the first time Perplexity has landed in hot water over their web scraping, AI learning endeavors. But Perplexity has published a blog post, explaining that this is different!
And there’s genuinely an interesting argument to be made,that robots.txt
is aimed at indexing and AI training traffic, and that agentic AI requests are a different category. Put simply, perplexity bots ignore robots.txt
when a live user asks them to. Is that bad behavior, or what we should expect? This question will have to be settled as AI agents become more common.
Continue reading “This Week In Security: Perplexity V Cloudflare, GreedyBear, And HashiCorp”