Web Scraping Tutorial

December 10, 2012

Web scraping is the act of programmatically harvesting data from a webpage. It consists of finding a way to format the URLs to pages containing useful information, and then parsing the DOM tree to get at the data. It’s a bit finicky, but our experience is that this is easier than it sounds. That’s especially true if you take some of the tips from this web scraping tutorial.

It is more of an intermediate tutorial as it doesn’t feature any code. But if you can bring yourself up to speed on using BeautifulSoup and Python the rest is not hard to implement by trial and error. [Hartley Brody] discusses investigating how the GET requests are formed on your webpage of choice. Once that URL syntax has been figured out just look through the source code for tags (css or otherwise) that can be used as hooks to get at your target data.

So what can this be used for? A lot of things. We’d suggest reading the Reddit comments as there are several real world uses discussed there. But one that immediately pops to mind is the picture harvesting [Mark Zuckerburg] used when he created Facemash.

36 thoughts on “Web Scraping Tutorial”

joshuac says:

December 10, 2012 at 2:05 pm

Amusingly “any content that can be viewed as a webpage” is embedded as a graphic, which is difficult/computationally very expensive to scrape.

Report comment

Reply
mohonri says:

December 10, 2012 at 2:07 pm

It’s all nice and easy (and I’ve done it a number of times) unless the site loads the page dynamically, at which point it can become considerably more difficult.

Report comment

Reply
1. joshuac says:
  
  December 10, 2012 at 2:09 pm
  
  Are you referring to grabbing the text out of the image (computationally expensive) or just grabbing the image? Grabbing the image is so basic I wouldn’t really consider that parsing and scraping (you’re just transferring a file that is already formatted in a precise way).
  
  Report comment
  
  Reply
2. g19fanatic says:
  
  December 10, 2012 at 2:14 pm
  
  You are referring to scraping any webpage that loads content through additional ajax calls. This is also trivial to do (sometimes even easier) when you have tools such as Firebug that lets you see all of the ajax calls that a page performs (and their POST (if present) format)…
  
  Report comment
  
  Reply
  1. Matt says:
    
    December 10, 2012 at 2:35 pm
    
    Especially when the ajax calls return serialized data and the uses its own JS to format it. It’s like having an (undocumented) API
    
    Report comment
    
    Reply
    1. devsnd says:
      
      December 10, 2012 at 6:39 pm
      
      yeah, but you dont really want to look at the traffic going on for each site you scrape, unless you’re only interested in a smallish dataset. a headless browser like phantomjs is the way to go.
      
      Report comment
      
      Reply
SoMuchYiff! says:

December 10, 2012 at 2:27 pm

I sure do love how-to programming tutorials with zero code boy-howdy!

I guess that’s what to be expected from a “growth hacker” or “startup lover”…

Report comment

Reply
1. hartleybrody says:
  
  December 10, 2012 at 6:24 pm
  
  I didn’t include any code samples cause I didn’t want to limit the techniques to any one language community. Some people chimed in in the comments with libraries for all sorts of languages I’ve never used.
  
  I’d be happy to share some code samples if you’d like to read some. In fact, here’s a sample to start with: https://github.com/hartleybrody/py/blob/master/get_inc_5000.py#L66
  
  It’s a bit complex cause there’s a lot going on (it was my first time using RabbitMQ), but around line 66, there’s some pretty easy to read scraping code. Hope that helps!
  
  Report comment
  
  Reply
  1. JoeSponge says:
    
    December 11, 2012 at 8:43 am
    
    Don’t pay any mind to them calling you a “startup lover”… everybody has to start dating some time, and when they do, they’re ALL startup Lovers.
    
    Now… back in the day, before this “Ajax” thing…
    
    ASPTear — yes, it was cheating, I didn’t write it myself, but the son-of-a -gun worked and worked well. I wound up making my own “news page”, scraping comics and stock quotes and tech stories…
    
    A little ASP, a little HTML, and some Regex, and voila…
    
    Wound up actually using it for WORK, which took a lot of the fun out of it. But still, it was scraping.
    
    Don’t judge me!
    
    Report comment
    
    Reply
Virgil Tudorancea says:

December 10, 2012 at 2:29 pm

You can actually prevent people from scrapping your website. All you have to do is add some php code or jquery code to generate a rendom number and add rendom numbers to your classes and id`s. Amazon does that. This number can change with every refresh and you can push this to your css to match you php file.

Report comment

Reply
1. rendom contributor says:
  
  December 10, 2012 at 4:21 pm
  
  Nah, a simple re call with a a splicing offset takes care of all that fuss nicely.
  
  Report comment
  
  Reply
2. sneakypoo says:
  
  December 10, 2012 at 4:25 pm
  
  Then you can use other methods like finding a particular pattern of HTML elements or things along those lines.
  
  Report comment
  
  Reply
3. fomori0rg says:
  
  December 10, 2012 at 6:44 pm
  
  if that prevented you from scraping a site, you’re not really good at scraping, i’d say. it’s something like security through obscurity… there are always ways to find the data you’re looking for (css-classes, xpath, regex or using html parsers like beautiful soup)…
  
  Report comment
  
  Reply
4. pguardiario says:
  
  December 11, 2012 at 3:36 pm
  
  Amazon doesn’t do this but some airlines do. If you want to prevent scraping bad enough you can hire someone to change your layout slightly every day or so. The Amazon solution to actually make the data available through an API
  
  Report comment
  
  Reply
kitsune361 says:

December 10, 2012 at 2:37 pm

Good, he mentions using BeautifulSoup and not regex. As you may already know, using regexp to parse HTML is a sure fire way to summon the Great Old Ones from their slumber eternal.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Report comment

Reply
1. T4b says:
  
  December 10, 2012 at 4:09 pm
  
  Sometimes, regular expressions work great to get some information from an html page.
  As long as I don’t want to do something where the regex approach doesn’t work I don’t see any reason not to do it.
  And so far it always worked for me.
  
  Report comment
  
  Reply
2. Drone says:
  
  December 14, 2012 at 2:14 am
  
  What is the obsession with Python? There is no one right tool for this job, but the best tool IMO is Perl with LWP and the likes of HTML::Parser. Beautiful Soup is really just an abstraction; if not an excuse for not properly learning and using Regular Expressions. Properly applied RegEx will cleverly parse even mal-formed mark-up mixed among the good. ORielly has a nice “Perl & LWP” book for free download that’ll get you started in next to no time.
  
  Report comment
  
  Reply
jc says:

December 10, 2012 at 2:59 pm

try goutte, it’s awesome for web scraping

Report comment

Reply
1. Wayne says:
  
  March 26, 2013 at 11:08 am
  
  Except trying to learn goutte is next to impossible for the beginner. The only tutorial that I can find is their readme.md. I have been searching everywhere on how to properly use goutte. What options are available. How do you tell it to fill in a radio button on a form. How do you know when you are at the next page of a submitted form. Nothing anywhere. Goutte may be great for people that already know or are involved in it but not so good for the person trying to figure out how to use it.
  
  Report comment
  
  Reply
gr0wlithe says:

December 10, 2012 at 5:15 pm

Yahoo’s YQL is also an awesome tool to dynamically scrape pages and export it as JSON feed :)

Report comment

Reply
1. josepht88 says:
  
  June 29, 2015 at 5:58 pm
  
  As an addendum to this, you can paste the JSON output from YQL into https://json-csv.com and it will spit out a CSV file. You can then open the data up in Excel.
  
  Report comment
  
  Reply
jaken gardner says:

December 10, 2012 at 10:13 pm

This doesnt actually show you how to do it, would be good if it did

Report comment

Reply
James says:

December 11, 2012 at 1:34 am

On the fairly unusual occasion where I’ve had to screen-scrape something, I’ve found SimpleHTMLDom (PHP) to be handy, parses (potentially invalid/badly formed) HTML in a best effort attempt and presents a DOM like object structure.

http://sourceforge.net/projects/simplehtmldom/

Easier than messing about digging with various regexps.

Report comment

Reply
COde says:

December 11, 2012 at 2:44 am

The company I work for scrape websites for comercial purposes. lets see, the law is publicly viewable, but the website sucks. low searchresults, even not all are found. Responses are slow and exporting to pdf/xml kinda works. Lawyers are allowed to use a laptop, but no interwebs. So take all the books or download the complete website? no, they go to us and buy a CD. We scrape all the content, wrap it in our XML database, do some magic (like the thinks most of you just understand, but lawyers and judges mostly not) and profit for us :D. A cd with xml data (viewable in a browser, from CD) searchable, very quick. and access to our online version aswell :D and we crawl every night, so online is always uptodate (yes, if website down, we are not down ;) ) this is done for more then 10 years already and very awesome to do. Just wanted to share this, there is money to be made of free info, and its legal.

Report comment

Reply
hardcorefs says:

December 11, 2012 at 6:12 am

Python for web scraping…….. LOL
It would be hard to choose a slower language.

Report comment

Reply
1. freax says:
  
  December 11, 2012 at 9:47 am
  
  If your webscraping is CPU-limited, you really have an odd hardware configuration.
  
  Report comment
  
  Reply
Morden Tral says:

December 11, 2012 at 6:24 am

Hell I’ve had my boss walk in to me and ask for all of the data from a competitors site before, and he wanted it yesterday, and in excel, and with the ability to update it himself.

It taught me that you can even scrape specific sites with VB script through an excel form if that is your kick and you have 10 minutes to get it done.

Report comment

Reply
ehrichweiss says:

December 11, 2012 at 6:36 am

I’m going to be the oddest one here probably. I semi-regularly scrape sites for info and the language I’ve started to use the most is not Ruby, Python, PHP, C/C++ or anything close. I’ve been using REXX, the old mainframe language, actually ooREXX to be specific. Its parser does exactly what I need and does it fast.

Report comment

Reply
jpspadaro says:

December 11, 2012 at 9:45 am

Probably the cheapest way I’ve ever done scraping is through a combination of lynx scripts and awk…BY FAR not the best way to do it, but it was stupid simple and did the trick at the time.

Report comment

Reply
sahil says:

March 27, 2013 at 12:05 am

There is a nice eBook on web scraping for PHP programmers here:
http://codediesel.com/products/web-scraping/

Report comment

Reply
Narayan Prusty says:

October 9, 2013 at 3:11 am

https://www.udemy.com/building-a-search-engine/ learn more here

Report comment

Reply
Michael Moranto says:

October 15, 2013 at 12:30 am

Just found that there is a new website for PHP web scraping and spider
http://php8legs.com/

Report comment

Reply
solarscourge says:

December 6, 2013 at 8:57 am

You could always use on on-line tool like GrabzIt http://grabz.it/scraper it comes with a powerful toolset to help you get and format your data!

Report comment

Reply
Den Ryan says:

January 14, 2014 at 7:19 pm

Web scrapping industrialized, following video demonstrates how easy it is to pull financial data from Internet no coding.
http://www.youtube.com/watch?v=BvMeL6c14ak
http://www.youtube.com/watch?v=L0Etul5kHuc
https://appliedalgo.com/appliedalgoweb/doc/AppliedAlgo%20Application%20Scenario%20Internet%20Download.pdf

Report comment

Reply
Offshore Software Development says:

February 25, 2014 at 12:06 am

I found something like this. You can see the blog for Data scraping using CURL in PHP:
http://www.codefire.org/blogs/item/data-scraping-using-curl-in-php.html

Report comment

Reply
FIKI HAFANA says:

January 26, 2016 at 3:56 am

How do you scrape AJAX pages, for example(facebook and twitter)?

Report comment

Reply

Hackaday

Web Scraping Tutorial

36 thoughts on “Web Scraping Tutorial”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

MXM: Powerful, Misused, Hackable

VCF East 2024 Was Bigger And Better Than Ever

Microsoft Killed My Favorite Keyboard, And I’m Mad About It

Remembering Peter Higgs And The Gravity Of His Contributions To Physics

Chandra X-ray Observatory Threatened By Budget Cuts

Our Columns

Hackaday Podcast Episode 267: Metal Casting, Plasma Cutting, And A Spicy 555

This Week In Security: Putty Keys, Libarchive, And Palo Alto

Human-Interfacing Devices: HID Over I2C

Fail Of The Week: Can An Ultrasonic Cleaner Remove Bubbles From Resin?

Linux Fu: Stupid Systemd Tricks

36 thoughts on “Web Scraping Tutorial”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns