Scraping Blogs For Fun And Profit

June 13, 2012

Sometimes when you’re working on a problem, a solution is thrown right at your face. We found ourselves in this exact situation a few days ago while putting together Hackaday’s new retro edition; a way to select a random Hackaday article was needed and [Alexander van Teijlingen] of codepanel.net just handed us the solution.

To grab every Hackaday URL ever, [Alex] wrote a small Python script using the Beautiful Soup screen scraping library. The program starts on Hackaday’s main page and grabs every link to a Hackaday post before going to the next page. It’s not a terribly complex build, but we’re gobsmacked a solution to a problem we’re working on would magically show up in our inbox.

Thanks to [Alex], writing a cron job to automatically update our new retro edition just got a whole lot easier. If you’d like to check out a list of every Hackaday post ever (or at least through two days ago), you can grab 10,693 line text file here.

23 thoughts on “Scraping Blogs For Fun And Profit”

RobinJood says:

June 13, 2012 at 7:28 am

Not to put a downer on this but there’s a tonne of apps out there that do this exact same thing. Many of them open source and/or free.

Did you try google?

Report comment

Reply
Oliver Heaviside says:

June 13, 2012 at 7:53 am

Brian, and associated Gentlemen of Hackaday:

Lately I’ve noticed a growing number of off-topic posts in threads. Perhaps the forum is broken?

These posts are not off-topic as in I-hate-X or that-was-the-style-in-my-day posts, but posts which talk about something that might belong on HAD but has no possible relevance to the thread at hand. For example:

http://hackaday.com/2012/06/12/hackaday-links-june-12-2012/comment-page-1/#comment-680010

This is a post talking about the onyx, in a recent story thread that has no relationship to it whatsoever. I started to wonder if it was a spammer setup in progress, or if you have a broken indexing system that puts posts in the wrong thread on accident.

Spammers often establish history on a site for several months before they start adding subtle link nonsense.

I urge you to make full and complete backups of your databases and copy them off the site – each and every social media site, from slashdot all the way up to stack exchange seems to have run one-by-one into the same problems and at least one major database corruption that lost a bunch of information.

PS – Good luck on the monetization process, I have high hopes that HAD will be a success for you and make all your efforts pay off.

Report comment

Reply
1. Caleb Kraft says:
  
  June 13, 2012 at 7:56 am
  
  Interesting.
  
  just a clarification, we’re not adding monitization. We survive by the existing un-obtrusive ads that we have. Don’t worry. The only way we can get more is to get more readers!
  
  Report comment
  
  Reply
GarethC says:

June 13, 2012 at 8:04 am

Is there a reason you can’t just look at the amount of entries in your database, generate a random number in that range and then display that article?

Am I missing something?

Report comment

Reply
John Bokma says:

June 13, 2012 at 8:16 am

Sounds like a lot of work for data that should be extractable from your database, assuming you’re using a database (otherwise it’s just running a small script over your fs).

Report comment

Reply
1. Deg says:
  
  June 13, 2012 at 8:25 am
  
  Indeed, look at the bottom of the page:
  Powered by WordPress.com VIP
  Wordpress uses MySQL. Should be fairly easy to query that for a list of posts!!
  
  Report comment
  
  Reply
soopergooman says:

June 13, 2012 at 8:30 am

you should send him a t shirt and buttons and stickers.

Report comment

Reply
Eliot says:

June 13, 2012 at 1:27 pm

http://codex.wordpress.org/Function_Reference/get_posts#Random_posts

You’re also WordPress VIP; just file a support ticket.

Report comment

Reply
1. Caleb Kraft says:
  
  June 13, 2012 at 2:44 pm
  
  retro site is on another server. though I haven’t verified, I doubt that VIP would be keen to let us connect to their Mysql remotely. Besides, the retro site has a very niche purpose that I think it fulfills static. Any more than that is simply icing on the cake.
  
  Report comment
  
  Reply
  1. metai says:
    
    June 17, 2012 at 10:40 am
    
    Still no need to go brute-force over the HTML.
    
    If you’re just exporting all posts once (or, at least, manually), WordPress offers a handy export function. Parse the resulting XML for permalinks, presto.
    
    If you want to do this on a regular basis, even automated, raping hundreds of HTML pages is an even worse solution. Some very minor additions to hackaday’s theme would allow hackaday-retro to retrieve (and cache) all or select pages via a special JSON or XML feed from the main site. This would even qualify as template “hacking”, to some extent.
    
    Report comment
    
    Reply
    1. Brian Benchoff says:
      
      June 17, 2012 at 11:03 am
      
      I’ve got the exported HaD XML sitting in front of me right now, and *yes*, it’s possible, but I think I can do cooler stuff by raping the HTML.
      
      I’m running a Python script to scrape the actual, unique posts from each HaD post (and thus getting the full text of every HaD post ever). Those are going into unique HTML files, so it’ll be REALLY easy to run a script to automagically update the retro site.
      
      A neat bonus to doing it this way is I have a full-text archive of everything ever posted on HaD. Ever wonder when the first mention of Arduino was? I’ll be able to tell you in a few hours.
      
      I’ll release all this data this week. Should be interesting.
      
      Report comment
      
      Reply
  2. metai says:
    
    June 17, 2012 at 2:50 pm
    
    Are you being serious right now? You seriously prefer scraping frontend output from thousands of single web pages to a single-file, neatly formatted, categorized data file containing everyhthing the web pages contain, and a great deal more?
    
    If you’re going for a proof of concept, really, there’s nothing to learn here, just about everybody is able to scrape websites (be it with Python, Perl, PHP … hell, even a set of small shell scripts). Ignoring that you have access to the source data proves just one thing: That you are able to ignore you have access to the source data.
    
    However, if you are going about it this way because you are unfamiliar with WordPress and its API, drop me a note, I’ll be happy to give you some pointers.
    
    Report comment
    
    Reply
Oliver Heaviside says:

June 13, 2012 at 1:53 pm

ABOUT THE RETRO SITE..

You might want to replace the character you’re using for apostrophes with the usual html entity, or just use the normal ‘ – some users will see question marks in strings like “what’s this?”

Secondly, python, much as i love it, may not be the beast for this job. WordPress is php, and you can easily display random stories with a simply mysql query.

Rather than random pages, have you considered simply pulling up 5 stories in increasing sequence from the beginning of the site?

And maybe checking the link to make sure it exists? Lots of bit rot in the hacker world.

Report comment

Reply
barryronaldo says:

June 14, 2012 at 3:46 am

For profit? I thought that “back-patting” was the blogosphere’s bread and butter lol. HaD isn’t bad about it, but BoingBoing is a horrid offender of having to go through four “sister blogs” before ever getting to the original article, all getting their clicky$. It reminds me of a highway crew where 4 people stand around watching 1 guy dig a hole lol. Why not just pay the one guy and send the other four to substance abuse and risk reduction programs?
Anyway, best of luck with the retro page :) I am excited to hopefully return to when less was more :)

Report comment

Reply
Christopher Mitchell says:

June 14, 2012 at 9:24 am

Is there something in the rulebook against using whatever form of “SELECT * FROM article_db ORDER BY RAND() LIMIT 0,1” is applicable to your database setup? I don’t see why you need to scrape your own site to build a database of articles.

Report comment

Reply
1. Christopher Mitchell says:
  
  June 14, 2012 at 9:26 am
  
  Colleagues of mine were kind enough to point to Caleb’s post above. With that in mind, wouldn’t it be easier to scrape the RSS feed? Or is this about collecting the back-archives of the website, in which case I withdraw any snarky objections?
  
  Report comment
  
  Reply
nes says:

June 14, 2012 at 9:42 am

If I include a link in a comment on HaD, invariably within minutes it gets scraped by an oddball client which reports itself as Safari 1.3 on Linux, resolution 30720 x 768. The address block belongs to a server rental outfit in Tx.

This sort of content scraping is not unusual, it’s just that evidently someone has seen fit to actively monitor the comments pages on here, and it isn’t Google. Perhaps it’s WordPress themselves doing it for some reason.

Report comment

Reply
1. Caleb Kraft says:
  
  June 14, 2012 at 9:59 am
  
  WordPress does (kismet, the spam scanner does too).
  
  Report comment
  
  Reply
nes says:

June 14, 2012 at 11:30 am

Ah, I see. That’s fair enough. I wonder why the unlikely combination of OS, client and resolution though.

Report comment

Reply
Joe says:

June 14, 2012 at 8:09 pm

I think it bears repeating. Use the CMS, not brute force. You say you don’t want the other site connecting to your database? Why not just export the URL list? You could do it from the commandline in one command. Wrong approach to the wrong problem.

Oh, heck, make a public service that serves random HaD URLs and let your servers and other use it.

Report comment

Reply
bunedoggle says:

June 15, 2012 at 9:10 am

while language==”Python”:
whitespace = “suddenly matters”
braces = “notably missing”
semicolons.whereAreYou
sadness++

Report comment

Reply
1. anonumus says:
  
  June 15, 2012 at 7:23 pm
  
  Like… +1
  
  Report comment
  
  Reply
Drone says:

June 16, 2012 at 8:11 am

PERL

Report comment

Reply