Web scraping tutorial

web-scraping-tutorial

Web scraping is the act of programmatically harvesting data from a webpage. It consists of finding a way to format the URLs to pages containing useful information, and then parsing the DOM tree to get at the data. It’s a bit finicky, but our experience is that this is easier than it sounds. That’s especially true if you take some of the tips from this web scraping tutorial.

It is more of an intermediate tutorial as it doesn’t feature any code. But if you can bring yourself up to speed on using BeautifulSoup and Python the rest is not hard to implement by trial and error. [Hartley Brody] discusses investigating how the GET requests are formed on your webpage of choice. Once that URL syntax has been figured out just look through the source code for tags (css or otherwise) that can be used as hooks to get at your target data.

So what can this be used for? A lot of things. We’d suggest reading the Reddit comments as there are several real world uses discussed there. But one that immediately pops to mind is the picture harvesting [Mark Zuckerburg] used when he created Facemash.

Comments

  1. joshuac says:

    Amusingly “any content that can be viewed as a webpage” is embedded as a graphic, which is difficult/computationally very expensive to scrape.

  2. mohonri says:

    It’s all nice and easy (and I’ve done it a number of times) unless the site loads the page dynamically, at which point it can become considerably more difficult.

    • joshuac says:

      Are you referring to grabbing the text out of the image (computationally expensive) or just grabbing the image? Grabbing the image is so basic I wouldn’t really consider that parsing and scraping (you’re just transferring a file that is already formatted in a precise way).

    • g19fanatic says:

      You are referring to scraping any webpage that loads content through additional ajax calls. This is also trivial to do (sometimes even easier) when you have tools such as Firebug that lets you see all of the ajax calls that a page performs (and their POST (if present) format)…

  3. SoMuchYiff! says:

    I sure do love how-to programming tutorials with zero code boy-howdy!

    I guess that’s what to be expected from a “growth hacker” or “startup lover”…

    • hartleybrody says:

      I didn’t include any code samples cause I didn’t want to limit the techniques to any one language community. Some people chimed in in the comments with libraries for all sorts of languages I’ve never used.

      I’d be happy to share some code samples if you’d like to read some. In fact, here’s a sample to start with: https://github.com/hartleybrody/py/blob/master/get_inc_5000.py#L66

      It’s a bit complex cause there’s a lot going on (it was my first time using RabbitMQ), but around line 66, there’s some pretty easy to read scraping code. Hope that helps!

      • JoeSponge says:

        Don’t pay any mind to them calling you a “startup lover”… everybody has to start dating some time, and when they do, they’re ALL startup Lovers.

        Now… back in the day, before this “Ajax” thing…

        ASPTear — yes, it was cheating, I didn’t write it myself, but the son-of-a -gun worked and worked well. I wound up making my own “news page”, scraping comics and stock quotes and tech stories…

        A little ASP, a little HTML, and some Regex, and voila…

        Wound up actually using it for WORK, which took a lot of the fun out of it. But still, it was scraping.

        Don’t judge me!

  4. You can actually prevent people from scrapping your website. All you have to do is add some php code or jquery code to generate a rendom number and add rendom numbers to your classes and id`s. Amazon does that. This number can change with every refresh and you can push this to your css to match you php file.

  5. kitsune361 says:

    Good, he mentions using BeautifulSoup and not regex. As you may already know, using regexp to parse HTML is a sure fire way to summon the Great Old Ones from their slumber eternal.

    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

    • T4b says:

      Sometimes, regular expressions work great to get some information from an html page.
      As long as I don’t want to do something where the regex approach doesn’t work I don’t see any reason not to do it.
      And so far it always worked for me.

    • Drone says:

      What is the obsession with Python? There is no one right tool for this job, but the best tool IMO is Perl with LWP and the likes of HTML::Parser. Beautiful Soup is really just an abstraction; if not an excuse for not properly learning and using Regular Expressions. Properly applied RegEx will cleverly parse even mal-formed mark-up mixed among the good. ORielly has a nice “Perl & LWP” book for free download that’ll get you started in next to no time.

  6. jc says:

    try goutte, it’s awesome for web scraping

    • Wayne says:

      Except trying to learn goutte is next to impossible for the beginner. The only tutorial that I can find is their readme.md. I have been searching everywhere on how to properly use goutte. What options are available. How do you tell it to fill in a radio button on a form. How do you know when you are at the next page of a submitted form. Nothing anywhere. Goutte may be great for people that already know or are involved in it but not so good for the person trying to figure out how to use it.

  7. gr0wlithe says:

    Yahoo’s YQL is also an awesome tool to dynamically scrape pages and export it as JSON feed :)

  8. This doesnt actually show you how to do it, would be good if it did

  9. James says:

    On the fairly unusual occasion where I’ve had to screen-scrape something, I’ve found SimpleHTMLDom (PHP) to be handy, parses (potentially invalid/badly formed) HTML in a best effort attempt and presents a DOM like object structure.

    http://sourceforge.net/projects/simplehtmldom/

    Easier than messing about digging with various regexps.

  10. COde says:

    The company I work for scrape websites for comercial purposes. lets see, the law is publicly viewable, but the website sucks. low searchresults, even not all are found. Responses are slow and exporting to pdf/xml kinda works. Lawyers are allowed to use a laptop, but no interwebs. So take all the books or download the complete website? no, they go to us and buy a CD. We scrape all the content, wrap it in our XML database, do some magic (like the thinks most of you just understand, but lawyers and judges mostly not) and profit for us :D. A cd with xml data (viewable in a browser, from CD) searchable, very quick. and access to our online version aswell :D and we crawl every night, so online is always uptodate (yes, if website down, we are not down ;) ) this is done for more then 10 years already and very awesome to do. Just wanted to share this, there is money to be made of free info, and its legal.

  11. hardcorefs says:

    Python for web scraping…….. LOL
    It would be hard to choose a slower language.

  12. Morden Tral says:

    Hell I’ve had my boss walk in to me and ask for all of the data from a competitors site before, and he wanted it yesterday, and in excel, and with the ability to update it himself.

    It taught me that you can even scrape specific sites with VB script through an excel form if that is your kick and you have 10 minutes to get it done.

  13. ehrichweiss says:

    I’m going to be the oddest one here probably. I semi-regularly scrape sites for info and the language I’ve started to use the most is not Ruby, Python, PHP, C/C++ or anything close. I’ve been using REXX, the old mainframe language, actually ooREXX to be specific. Its parser does exactly what I need and does it fast.

  14. jpspadaro says:

    Probably the cheapest way I’ve ever done scraping is through a combination of lynx scripts and awk…BY FAR not the best way to do it, but it was stupid simple and did the trick at the time.

  15. sahil says:

    There is a nice eBook on web scraping for PHP programmers here:

    http://codediesel.com/products/web-scraping/

  16. Michael Moranto says:

    Just found that there is a new website for PHP web scraping and spider

    http://php8legs.com/

  17. solarscourge says:

    You could always use on on-line tool like GrabzIt http://grabz.it/scraper it comes with a powerful toolset to help you get and format your data!

  18. Den Ryan says:

    Web scrapping industrialized, following video demonstrates how easy it is to pull financial data from Internet no coding.

    https://appliedalgo.com/appliedalgoweb/doc/AppliedAlgo%20Application%20Scenario%20Internet%20Download.pdf

  19. I found something like this. You can see the blog for Data scraping using CURL in PHP:

    http://www.codefire.org/blogs/item/data-scraping-using-curl-in-php.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 96,401 other followers