Billy Hoffman has built a site crawler that can hide its activity within normal web traffic. Crawling a website is one of the easiest ways to find exploitable pages, but the systematic nature of the crawl makes it stand out in logs. Billy set out to design a crawler that would behave like a normal web browser. It follows more popular links first (think “news”, not “legal notice”) and it doesn’t hit deep linked pages directly without first creating an appropriate Google referrer. There are tons of other tricks involved in making the crawler look “human” which you’ll find in Billy’s slides over at SPI Labs. You can also read about the talk on Wired News.
looks to me like the “most commented on (past 60 days) isn’t working properly. as of today, it’s been 4 months since the psp 2.0 to 1.5 downgrade was posted, and no one has commented on it since Oct 16th, 2005.
Also, very interesting article!
No, it isn’t broken because I see at least two comments show up in my email everyday. What is broken is the post not showing more then 250 comments. Sounds like a good enough excuse to me to lock the thread.
Is there a legitimate use for this?
I am not usually one of those people that criticises stuff like this, but do we really want to make email address harvesting and exploit finding easier?
cuba: No point worrying about that now, is there?
A lot of people have legitimate needs to crawl a site. Think about a site that carries the text of a book but has a strict “no spiders” policy (so they can shut you off when you stop paying, for example.) If you’re a legitimate user but need an offline copy of the book (for field work or whatever), you’re out of luck. Their server will spot a spider instantly, and shut you down.
But if you have a smart spider that skips around, reads chapters here and pages there, then they’re not likely to notice you or ban you. And you can still get the text of the book you need. Just make sure you delete your local copies of the book once your subscription has ended.