Web scraping tutorial

Web scraping is the act of programmatically harvesting data from a webpage. It consists of finding a way to format the URLs to pages containing useful information, and then parsing the DOM tree to get at the data. It’s a bit finicky, but our experience is that this is easier than it sounds. That’s especially true if you take some of the tips from this web scraping tutorial.

It is more of an intermediate tutorial as it doesn’t feature any code. But if you can bring yourself up to speed on using BeautifulSoup and Python the rest is not hard to implement by trial and error. [Hartley Brody] discusses investigating how the GET requests are formed on your webpage of choice. Once that URL syntax has been figured out just look through the source code for tags (css or otherwise) that can be used as hooks to get at your target data.

So what can this be used for? A lot of things. We’d suggest reading the Reddit comments as there are several real world uses discussed there. But one that immediately pops to mind is the picture harvesting [Mark Zuckerburg] used when he created Facemash.

Hacking Hack a Day with Greasemonkey


Ever since Hack a Day first emerged on the scene in 2004, the site’s design has been pretty consistent. The black background with its green and white text, while a bit dubious looking at work, is fine by me. For others however, the site’s design is a constant eyesore both figuratively and literally. [James Litton] is one of those readers, and he wrote in to share a tip that helps him read up on the latest hacks without killing his eyes.

[James] uses Firefox to browse the web, so he whipped up a small Greasemonkey script that tweaks Hack a Day’s style sheet once it reaches his browser. His script inverts the background while changing a few other items, making for a much more comfortable read. Overall we found the change to be pretty reasonable, but go ahead and judge for yourself – you can see the before and after screen shots in greater detail on his site.

[James] also points out that the script should work just fine in Chrome, for those of you who prefer that browser instead.

So if your eyes are a bit on the sensitive side, feel free to grab his script and customize away – I don’t think we’ll be changing the theme any time soon.