Web Scraping Tutorial

Web scraping is the act of programmatically harvesting data from a webpage. It consists of finding a way to format the URLs to pages containing useful information, and then parsing the DOM tree to get at the data. It’s a bit finicky, but our experience is that this is easier than it sounds. That’s especially true if you take some of the tips from this web scraping tutorial.

It is more of an intermediate tutorial as it doesn’t feature any code. But if you can bring yourself up to speed on using BeautifulSoup and Python the rest is not hard to implement by trial and error. [Hartley Brody] discusses investigating how the GET requests are formed on your webpage of choice. Once that URL syntax has been figured out just look through the source code for tags (css or otherwise) that can be used as hooks to get at your target data.

So what can this be used for? A lot of things. We’d suggest reading the Reddit comments as there are several real world uses discussed there. But one that immediately pops to mind is the picture harvesting [Mark Zuckerburg] used when he created Facemash.

36 thoughts on “Web Scraping Tutorial

    1. Are you referring to grabbing the text out of the image (computationally expensive) or just grabbing the image? Grabbing the image is so basic I wouldn’t really consider that parsing and scraping (you’re just transferring a file that is already formatted in a precise way).

    2. You are referring to scraping any webpage that loads content through additional ajax calls. This is also trivial to do (sometimes even easier) when you have tools such as Firebug that lets you see all of the ajax calls that a page performs (and their POST (if present) format)…

    1. I didn’t include any code samples cause I didn’t want to limit the techniques to any one language community. Some people chimed in in the comments with libraries for all sorts of languages I’ve never used.

      I’d be happy to share some code samples if you’d like to read some. In fact, here’s a sample to start with: https://github.com/hartleybrody/py/blob/master/get_inc_5000.py#L66

      It’s a bit complex cause there’s a lot going on (it was my first time using RabbitMQ), but around line 66, there’s some pretty easy to read scraping code. Hope that helps!

      1. Don’t pay any mind to them calling you a “startup lover”… everybody has to start dating some time, and when they do, they’re ALL startup Lovers.

        Now… back in the day, before this “Ajax” thing…

        ASPTear — yes, it was cheating, I didn’t write it myself, but the son-of-a -gun worked and worked well. I wound up making my own “news page”, scraping comics and stock quotes and tech stories…

        A little ASP, a little HTML, and some Regex, and voila…

        Wound up actually using it for WORK, which took a lot of the fun out of it. But still, it was scraping.

        Don’t judge me!

  1. You can actually prevent people from scrapping your website. All you have to do is add some php code or jquery code to generate a rendom number and add rendom numbers to your classes and id`s. Amazon does that. This number can change with every refresh and you can push this to your css to match you php file.

    1. if that prevented you from scraping a site, you’re not really good at scraping, i’d say. it’s something like security through obscurity… there are always ways to find the data you’re looking for (css-classes, xpath, regex or using html parsers like beautiful soup)…

    1. Sometimes, regular expressions work great to get some information from an html page.
      As long as I don’t want to do something where the regex approach doesn’t work I don’t see any reason not to do it.
      And so far it always worked for me.

    2. What is the obsession with Python? There is no one right tool for this job, but the best tool IMO is Perl with LWP and the likes of HTML::Parser. Beautiful Soup is really just an abstraction; if not an excuse for not properly learning and using Regular Expressions. Properly applied RegEx will cleverly parse even mal-formed mark-up mixed among the good. ORielly has a nice “Perl & LWP” book for free download that’ll get you started in next to no time.

    1. Except trying to learn goutte is next to impossible for the beginner. The only tutorial that I can find is their readme.md. I have been searching everywhere on how to properly use goutte. What options are available. How do you tell it to fill in a radio button on a form. How do you know when you are at the next page of a submitted form. Nothing anywhere. Goutte may be great for people that already know or are involved in it but not so good for the person trying to figure out how to use it.

  2. The company I work for scrape websites for comercial purposes. lets see, the law is publicly viewable, but the website sucks. low searchresults, even not all are found. Responses are slow and exporting to pdf/xml kinda works. Lawyers are allowed to use a laptop, but no interwebs. So take all the books or download the complete website? no, they go to us and buy a CD. We scrape all the content, wrap it in our XML database, do some magic (like the thinks most of you just understand, but lawyers and judges mostly not) and profit for us :D. A cd with xml data (viewable in a browser, from CD) searchable, very quick. and access to our online version aswell :D and we crawl every night, so online is always uptodate (yes, if website down, we are not down ;) ) this is done for more then 10 years already and very awesome to do. Just wanted to share this, there is money to be made of free info, and its legal.

  3. Hell I’ve had my boss walk in to me and ask for all of the data from a competitors site before, and he wanted it yesterday, and in excel, and with the ability to update it himself.

    It taught me that you can even scrape specific sites with VB script through an excel form if that is your kick and you have 10 minutes to get it done.

  4. I’m going to be the oddest one here probably. I semi-regularly scrape sites for info and the language I’ve started to use the most is not Ruby, Python, PHP, C/C++ or anything close. I’ve been using REXX, the old mainframe language, actually ooREXX to be specific. Its parser does exactly what I need and does it fast.

  5. Probably the cheapest way I’ve ever done scraping is through a combination of lynx scripts and awk…BY FAR not the best way to do it, but it was stupid simple and did the trick at the time.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.