Web Scraping Amazon And Rotten Tomatoes

January 23, 2013

web-scraping-amazon-and-rotten-tomatos

[Rajesh] put web scraping to good use in order to gather the information important to him. He’s published two posts about it. One scrapes Amazon daily to see if the books he wants to read have reached a certain price threshold. The other scrapes Rotten Tomatoes in order to display the audience score next to the critics score for the top renting movies.

Web scraping uses scripts to gather information programmatically from HTML rather than using an API to access data. We recently featured a conceptual tutorial on the topic, and even came across a hack that scraped all of our own posts. [Rajesh’s] technique is pretty much the same.

He’s using Python scripts with the Beautiful Soup module to parse the DOM tree for the information he’s after. In the case of the Amazon script he sets a target price for a specific book he’s after and will get an email automatically when it gets there. With Rotten Tomatoes he sometimes likes to see the audience score when considering a movie, but you can’t get it on the list at the website; you have to click through to each movie. His script keeps a database so that it doesn’t continually scrape the same information. The collected numbers are displayed alongside the critics scores as seen above.

29 thoughts on “Web Scraping Amazon And Rotten Tomatoes”

Andrew M (@HuskerTX) says:

January 23, 2013 at 1:14 pm

I would do the same thing with NewEgg and HDD prices, so that I could easily sort by $/GB.

Report comment

Reply
1. cyrozap says:
  
  January 23, 2013 at 5:20 pm
  
  With Newegg, you don’t even need to scrape! There’s an unofficial API that the Newegg.com app uses to get product data that other people have discovered and documented. This page describes it well: http://www.bemasher.net/archives/1002
  
  Report comment
  
  Reply
Mike says:

January 23, 2013 at 1:14 pm

http://camelcamelcamel.com does this for free
much nicer user interface too

Report comment

Reply
1. Kemp says:
  
  January 24, 2013 at 3:46 am
  
  This is the post I was going to make. While I appreciate the hackery involved with web scraping, it’s always best to let someone else do it if you can ;-)
  
  Report comment
  
  Reply
Darío Clavijo says:

January 23, 2013 at 1:27 pm

http://scrapy.org/

Report comment

Reply
Kelly says:

January 23, 2013 at 1:32 pm

Web scraping can be used for an endless number of possibilities. I personally code in PHP and use SIMPLE HTML DOM

Report comment

Reply
Kris Lee says:

January 23, 2013 at 2:12 pm

I’m wondering how they come up with those silly names.

Report comment

Reply
kELal1862 says:

January 23, 2013 at 2:14 pm

Using a whole framework for scraping…
tsk… Real Gurls use plain Rexx and do all the scraping with just ONE gigantic PARSE VAR statement :-P

Report comment

Reply
Trent says:

January 23, 2013 at 3:22 pm

Considering Amazon has known to drop or inflate price based on browser and location, he should modify his script to pass various agent strings and also check the same through a few proxies. You never know what deals could be had.

Report comment

Reply
1. Kris Lee says:
  
  January 23, 2013 at 7:16 pm
  
  Do you have any reference or this is a personal experience?
  
  Report comment
  
  Reply
  1. targetdrone says:
    
    January 23, 2013 at 7:33 pm
    
    They adjust book prices on this guy pretty much after he’s bought the book.
    http://www.tbray.org/ongoing/When/201x/2012/10/17/Sandman-Pricing
    They also are known to adjust prices throughout the day, with noon pricing higher than 2PM pricing, etc.
    
    And here’s an interesting opinion from a lawyer on scraping Amazon’s prices:
    http://storefrontbacktalk.com/e-commerce/window-shopping-felonies/
    
    Report comment
    
    Reply
  2. Dave says:
    
    January 24, 2013 at 2:12 am
    
    Not a direct answer to your question, Kris, but it’s pretty much the same thing:
    http://verdict.justia.com/2012/07/03/the-orbitz-controversy-why-steering-mac-users-toward-higher-priced-hotels-is-arguably-wrong-and-what-might-be-done-about-it
    
    Report comment
    
    Reply
James Corbett says:

January 23, 2013 at 3:24 pm

You know RottenTomatoes has a free api?

http://developer.rottentomatoes.com/

Report comment

Reply
Rajesh Verma says:

January 23, 2013 at 3:30 pm

Camelcamelcamel does not track digital items such as Amazon Kindle. That was the reason to make the script.

Report comment

Reply
Bryan Baker (@XBrav) says:

January 23, 2013 at 4:02 pm

1) We took down his site.

2) I can’t seem to get the code to run in Ubuntu Server 12.04. I get this error:

Traceback (most recent call last):
File “amazon.py”, line 54, in
con = title.contents
AttributeError: ‘NoneType’ object has no attribute ‘contents’

Beautifulsoup is not returning any results on either query. I checked the data being passed, and it is indeed in there. Any advice?

Report comment

Reply
1. Rajes Verma says:
  
  January 23, 2013 at 4:13 pm
  
  Ok, I’m working on getting it back up. I just tried with 3 books and 1 item and it was working fine. Screenshot:http://imgur.com/grThF5e . I remember I had to handle the title differently than I do normally. I usually use element.string. For the title I had to use element.contents
  
  Report comment
  
  Reply
  1. Bryan Baker (@XBrav) says:
    
    January 23, 2013 at 6:52 pm
    
    This is the output from urllib2: http://pastebin.com/pHa58nF9
    
    Report comment
    
    Reply
2. Alexander S says:
  
  January 23, 2013 at 7:06 pm
  
  It’s the version of beautifulsoup you are using causing that problem. Later versions are less forgiving. There’s a lot of talk about it on one of the BS lists.
  
  Report comment
  
  Reply
Panikos says:

January 23, 2013 at 4:37 pm

yahoo pipes is also an easy, quick option

Report comment

Reply
supershwa says:

January 23, 2013 at 4:47 pm

Hooray for scraping! (Big fan here)

Careful – while the laws are grey, some businesses hate when people scrape their data…they tend to try to enforce “using an approved web browser” on their sites, although it’s difficult and expensive for them to actually pursue litigation.

And uh, mind those applications that hammer sites with multiple, simultaneous requests for data…they REALLY hate that one since it’s similar to a DoS attack. ;oP

Report comment

Reply
Cosmic R says:

January 23, 2013 at 4:52 pm

I’ve been using scraping for heaps of things. I wrote a price tracker for a local computer store in Australia (MSY), I made an XBMC addon that displays sports scores while watching TV, I also wrote an IMDB scraper for XBMC ratings.

The possibilities are endless… but like previously mentioned, be careful not to hammer the site in question!

Report comment

Reply
fonz says:

January 23, 2013 at 5:00 pm

just wait for the black helicopters to come pick you up to get eaten by lawyers for violating some obscure ToS

Report comment

Reply
cianof says:

January 23, 2013 at 5:21 pm

The sites dead….

google cache is here:
http://webcache.googleusercontent.com/search?q=cache:rawdust.com/amazon/amazon-kindle-price-alerts.htm

useful tip
http://webcache.googleusercontent.com/search?q=cache:ADDYOURSITEWITHOUTHTTP

Report comment

Reply
xorpunk says:

January 23, 2013 at 7:01 pm

…thinks of how many times he’s done this as a freelancer for ebay and amazon with and without APIs…

Google are the only ones you have to use proxies with, you use to be able to use their toolbar query with timeouts…

Report comment

Reply
Willrandship says:

January 23, 2013 at 8:10 pm

One does not make Beautiful Soup from Rotten Tomatoes.

Report comment

Reply
Rajesh Verma says:

January 23, 2013 at 8:44 pm

Looks like it is from http://www.amazon.com/The-Elegant-Universe-Superstrings-Dimensions/dp/0375708111/ . It is not in stock at Amazon, so it does not have a price available to parse. Try the Kindle edition http://www.amazon.com/The-Elegant-Universe-Superstrings-ebook/dp/B001P7GGRS/ . Also, I take all the referral information from the end of the link (was “ref=tmm_kin_title_0″). I don’t know if it makes a difference, but I like to work with the most direct link.

Report comment

Reply
Tyson says:

January 23, 2013 at 11:10 pm

I’d love to do an app for this on Ebay…I looked at their API awhile ago but it seemed really annoying to use. I just wanted to type in a search and immediately pop up the average price of completed listings over the past week. Anyone point me to a good place to learn how to do this?

Report comment

Reply
1. xorpunk says:
  
  January 24, 2013 at 7:59 am
  
  Their query restrictions and query types make that only possible with a caching mechanism…
  
  I actually wrote a PHP+curl daemon scraper using their API once, not sure if the company still uses it.
  
  Report comment
  
  Reply
George says:

January 24, 2013 at 6:47 am

Another nice python alternative is http://packages.python.org/pyquery/

If you prefer Javascript phantomjs is a headless webkit based browser.

Report comment

Reply