Scraping blogs for fun and profit

Sometimes when you’re working on a problem, a solution is thrown right at your face. We found ourselves in this exact situation a few days ago while putting together Hackaday’s new retro edition; a way to select a random Hackaday article was needed and [Alexander van Teijlingen] of codepanel.net just handed us the solution.

To grab every Hackaday URL ever, [Alex] wrote a small Python script using the Beautiful Soup screen scraping library. The program starts on Hackaday’s main page and grabs every link to a Hackaday post before going to the next page. It’s not a terribly complex build, but we’re gobsmacked a solution to a problem we’re working on would magically show up in our inbox.

Thanks to [Alex], writing a cron job to automatically update our new retro edition just got a whole lot easier. If you’d like to check out a list of every Hackaday post ever (or at least through two days ago), you can grab 10,693 line text file here.

Comments

  1. RobinJood says:

    Not to put a downer on this but there’s a tonne of apps out there that do this exact same thing. Many of them open source and/or free.

    Did you try google?

  2. Oliver Heaviside says:

    Brian, and associated Gentlemen of Hackaday:

    Lately I’ve noticed a growing number of off-topic posts in threads. Perhaps the forum is broken?

    These posts are not off-topic as in I-hate-X or that-was-the-style-in-my-day posts, but posts which talk about something that might belong on HAD but has no possible relevance to the thread at hand. For example:

    http://hackaday.com/2012/06/12/hackaday-links-june-12-2012/comment-page-1/#comment-680010

    This is a post talking about the onyx, in a recent story thread that has no relationship to it whatsoever. I started to wonder if it was a spammer setup in progress, or if you have a broken indexing system that puts posts in the wrong thread on accident.

    Spammers often establish history on a site for several months before they start adding subtle link nonsense.

    I urge you to make full and complete backups of your databases and copy them off the site – each and every social media site, from slashdot all the way up to stack exchange seems to have run one-by-one into the same problems and at least one major database corruption that lost a bunch of information.

    PS – Good luck on the monetization process, I have high hopes that HAD will be a success for you and make all your efforts pay off.

  3. GarethC says:

    Is there a reason you can’t just look at the amount of entries in your database, generate a random number in that range and then display that article?

    Am I missing something?

  4. John Bokma says:

    Sounds like a lot of work for data that should be extractable from your database, assuming you’re using a database (otherwise it’s just running a small script over your fs).

  5. soopergooman says:

    you should send him a t shirt and buttons and stickers.

  6. Eliot says:

    http://codex.wordpress.org/Function_Reference/get_posts#Random_posts

    You’re also WordPress VIP; just file a support ticket.

    • Caleb Kraft says:

      retro site is on another server. though I haven’t verified, I doubt that VIP would be keen to let us connect to their Mysql remotely. Besides, the retro site has a very niche purpose that I think it fulfills static. Any more than that is simply icing on the cake.

      • metai says:

        Still no need to go brute-force over the HTML.

        If you’re just exporting all posts once (or, at least, manually), WordPress offers a handy export function. Parse the resulting XML for permalinks, presto.

        If you want to do this on a regular basis, even automated, raping hundreds of HTML pages is an even worse solution. Some very minor additions to hackaday’s theme would allow hackaday-retro to retrieve (and cache) all or select pages via a special JSON or XML feed from the main site. This would even qualify as template “hacking”, to some extent.

        • I’ve got the exported HaD XML sitting in front of me right now, and *yes*, it’s possible, but I think I can do cooler stuff by raping the HTML.

          I’m running a Python script to scrape the actual, unique posts from each HaD post (and thus getting the full text of every HaD post ever). Those are going into unique HTML files, so it’ll be REALLY easy to run a script to automagically update the retro site.

          A neat bonus to doing it this way is I have a full-text archive of everything ever posted on HaD. Ever wonder when the first mention of Arduino was? I’ll be able to tell you in a few hours.

          I’ll release all this data this week. Should be interesting.

      • metai says:

        Are you being serious right now? You seriously prefer scraping frontend output from thousands of single web pages to a single-file, neatly formatted, categorized data file containing everyhthing the web pages contain, and a great deal more?

        If you’re going for a proof of concept, really, there’s nothing to learn here, just about everybody is able to scrape websites (be it with Python, Perl, PHP … hell, even a set of small shell scripts). Ignoring that you have access to the source data proves just one thing: That you are able to ignore you have access to the source data.

        However, if you are going about it this way because you are unfamiliar with WordPress and its API, drop me a note, I’ll be happy to give you some pointers.

  7. Oliver Heaviside says:

    ABOUT THE RETRO SITE..

    You might want to replace the character you’re using for apostrophes with the usual html entity, or just use the normal ‘ – some users will see question marks in strings like “what’s this?”

    Secondly, python, much as i love it, may not be the beast for this job. WordPress is php, and you can easily display random stories with a simply mysql query.

    Rather than random pages, have you considered simply pulling up 5 stories in increasing sequence from the beginning of the site?

    And maybe checking the link to make sure it exists? Lots of bit rot in the hacker world.

  8. barryronaldo says:

    For profit? I thought that “back-patting” was the blogosphere’s bread and butter lol. HaD isn’t bad about it, but BoingBoing is a horrid offender of having to go through four “sister blogs” before ever getting to the original article, all getting their clicky$. It reminds me of a highway crew where 4 people stand around watching 1 guy dig a hole lol. Why not just pay the one guy and send the other four to substance abuse and risk reduction programs?
    Anyway, best of luck with the retro page :) I am excited to hopefully return to when less was more :)

  9. Is there something in the rulebook against using whatever form of “SELECT * FROM article_db ORDER BY RAND() LIMIT 0,1″ is applicable to your database setup? I don’t see why you need to scrape your own site to build a database of articles.

    • Colleagues of mine were kind enough to point to Caleb’s post above. With that in mind, wouldn’t it be easier to scrape the RSS feed? Or is this about collecting the back-archives of the website, in which case I withdraw any snarky objections?

  10. nes says:

    If I include a link in a comment on HaD, invariably within minutes it gets scraped by an oddball client which reports itself as Safari 1.3 on Linux, resolution 30720 x 768. The address block belongs to a server rental outfit in Tx.

    This sort of content scraping is not unusual, it’s just that evidently someone has seen fit to actively monitor the comments pages on here, and it isn’t Google. Perhaps it’s WordPress themselves doing it for some reason.

  11. nes says:

    Ah, I see. That’s fair enough. I wonder why the unlikely combination of OS, client and resolution though.

  12. Joe says:

    I think it bears repeating. Use the CMS, not brute force. You say you don’t want the other site connecting to your database? Why not just export the URL list? You could do it from the commandline in one command. Wrong approach to the wrong problem.

    Oh, heck, make a public service that serves random HaD URLs and let your servers and other use it.

  13. bunedoggle says:

    while language==”Python”:
    whitespace = “suddenly matters”
    braces = “notably missing”
    semicolons.whereAreYou
    sadness++

  14. Drone says:

    PERL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 91,297 other followers