Scraping Blogs For Fun And Profit

Sometimes when you’re working on a problem, a solution is thrown right at your face. We found ourselves in this exact situation a few days ago while putting together Hackaday’s new retro edition; a way to select a random Hackaday article was needed and [Alexander van Teijlingen] of codepanel.net just handed us the solution.

To grab every Hackaday URL ever, [Alex] wrote a small Python script using the Beautiful Soup screen scraping library. The program starts on Hackaday’s main page and grabs every link to a Hackaday post before going to the next page. It’s not a terribly complex build, but we’re gobsmacked a solution to a problem we’re working on would magically show up in our inbox.

Thanks to [Alex], writing a cron job to automatically update our new retro edition just got a whole lot easier. If you’d like to check out a list of every Hackaday post ever (or at least through two days ago), you can grab 10,693 line text file here.

23 thoughts on “Scraping Blogs For Fun And Profit

  1. Brian, and associated Gentlemen of Hackaday:

    Lately I’ve noticed a growing number of off-topic posts in threads. Perhaps the forum is broken?

    These posts are not off-topic as in I-hate-X or that-was-the-style-in-my-day posts, but posts which talk about something that might belong on HAD but has no possible relevance to the thread at hand. For example:

    http://hackaday.com/2012/06/12/hackaday-links-june-12-2012/comment-page-1/#comment-680010

    This is a post talking about the onyx, in a recent story thread that has no relationship to it whatsoever. I started to wonder if it was a spammer setup in progress, or if you have a broken indexing system that puts posts in the wrong thread on accident.

    Spammers often establish history on a site for several months before they start adding subtle link nonsense.

    I urge you to make full and complete backups of your databases and copy them off the site – each and every social media site, from slashdot all the way up to stack exchange seems to have run one-by-one into the same problems and at least one major database corruption that lost a bunch of information.

    PS – Good luck on the monetization process, I have high hopes that HAD will be a success for you and make all your efforts pay off.

  2. Is there a reason you can’t just look at the amount of entries in your database, generate a random number in that range and then display that article?

    Am I missing something?

    1. retro site is on another server. though I haven’t verified, I doubt that VIP would be keen to let us connect to their Mysql remotely. Besides, the retro site has a very niche purpose that I think it fulfills static. Any more than that is simply icing on the cake.

      1. Still no need to go brute-force over the HTML.

        If you’re just exporting all posts once (or, at least, manually), WordPress offers a handy export function. Parse the resulting XML for permalinks, presto.

        If you want to do this on a regular basis, even automated, raping hundreds of HTML pages is an even worse solution. Some very minor additions to hackaday’s theme would allow hackaday-retro to retrieve (and cache) all or select pages via a special JSON or XML feed from the main site. This would even qualify as template “hacking”, to some extent.

        1. I’ve got the exported HaD XML sitting in front of me right now, and *yes*, it’s possible, but I think I can do cooler stuff by raping the HTML.

          I’m running a Python script to scrape the actual, unique posts from each HaD post (and thus getting the full text of every HaD post ever). Those are going into unique HTML files, so it’ll be REALLY easy to run a script to automagically update the retro site.

          A neat bonus to doing it this way is I have a full-text archive of everything ever posted on HaD. Ever wonder when the first mention of Arduino was? I’ll be able to tell you in a few hours.

          I’ll release all this data this week. Should be interesting.

      2. Are you being serious right now? You seriously prefer scraping frontend output from thousands of single web pages to a single-file, neatly formatted, categorized data file containing everyhthing the web pages contain, and a great deal more?

        If you’re going for a proof of concept, really, there’s nothing to learn here, just about everybody is able to scrape websites (be it with Python, Perl, PHP … hell, even a set of small shell scripts). Ignoring that you have access to the source data proves just one thing: That you are able to ignore you have access to the source data.

        However, if you are going about it this way because you are unfamiliar with WordPress and its API, drop me a note, I’ll be happy to give you some pointers.

  3. ABOUT THE RETRO SITE..

    You might want to replace the character you’re using for apostrophes with the usual html entity, or just use the normal ‘ – some users will see question marks in strings like “what’s this?”

    Secondly, python, much as i love it, may not be the beast for this job. WordPress is php, and you can easily display random stories with a simply mysql query.

    Rather than random pages, have you considered simply pulling up 5 stories in increasing sequence from the beginning of the site?

    And maybe checking the link to make sure it exists? Lots of bit rot in the hacker world.

  4. For profit? I thought that “back-patting” was the blogosphere’s bread and butter lol. HaD isn’t bad about it, but BoingBoing is a horrid offender of having to go through four “sister blogs” before ever getting to the original article, all getting their clicky$. It reminds me of a highway crew where 4 people stand around watching 1 guy dig a hole lol. Why not just pay the one guy and send the other four to substance abuse and risk reduction programs?
    Anyway, best of luck with the retro page :) I am excited to hopefully return to when less was more :)

  5. If I include a link in a comment on HaD, invariably within minutes it gets scraped by an oddball client which reports itself as Safari 1.3 on Linux, resolution 30720 x 768. The address block belongs to a server rental outfit in Tx.

    This sort of content scraping is not unusual, it’s just that evidently someone has seen fit to actively monitor the comments pages on here, and it isn’t Google. Perhaps it’s WordPress themselves doing it for some reason.

  6. I think it bears repeating. Use the CMS, not brute force. You say you don’t want the other site connecting to your database? Why not just export the URL list? You could do it from the commandline in one command. Wrong approach to the wrong problem.

    Oh, heck, make a public service that serves random HaD URLs and let your servers and other use it.

Leave a Reply to EliotCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.