Web Scraping Amazon And Rotten Tomatoes


[Rajesh] put web scraping to good use in order to gather the information important to him. He’s published two posts about it. One scrapes Amazon daily to see if the books he wants to read have reached a certain price threshold. The other scrapes Rotten Tomatoes in order to display the audience score next to the critics score for the top renting movies.

Web scraping uses scripts to gather information programmatically from HTML rather than using an API to access data. We recently featured a conceptual tutorial on the topic, and even came across a hack that scraped all of our own posts. [Rajesh’s] technique is pretty much the same.

He’s using Python scripts with the Beautiful Soup module to parse the DOM tree for the information he’s after. In the case of the Amazon script he sets a target price for a specific book he’s after and will get an email automatically when it gets there. With Rotten Tomatoes he sometimes likes to see the audience score when considering a movie, but you can’t get it on the list at the website; you have to click through to each movie. His script keeps a database so that it doesn’t continually scrape the same information. The collected numbers are displayed alongside the critics scores as seen above.

29 thoughts on “Web Scraping Amazon And Rotten Tomatoes

  1. Considering Amazon has known to drop or inflate price based on browser and location, he should modify his script to pass various agent strings and also check the same through a few proxies. You never know what deals could be had.

  2. 1) We took down his site.

    2) I can’t seem to get the code to run in Ubuntu Server 12.04. I get this error:

    Traceback (most recent call last):
    File “amazon.py”, line 54, in
    con = title.contents
    AttributeError: ‘NoneType’ object has no attribute ‘contents’

    Beautifulsoup is not returning any results on either query. I checked the data being passed, and it is indeed in there. Any advice?

  3. Hooray for scraping! (Big fan here)

    Careful – while the laws are grey, some businesses hate when people scrape their data…they tend to try to enforce “using an approved web browser” on their sites, although it’s difficult and expensive for them to actually pursue litigation.

    And uh, mind those applications that hammer sites with multiple, simultaneous requests for data…they REALLY hate that one since it’s similar to a DoS attack. ;oP

  4. I’ve been using scraping for heaps of things. I wrote a price tracker for a local computer store in Australia (MSY), I made an XBMC addon that displays sports scores while watching TV, I also wrote an IMDB scraper for XBMC ratings.

    The possibilities are endless… but like previously mentioned, be careful not to hammer the site in question!

  5. …thinks of how many times he’s done this as a freelancer for ebay and amazon with and without APIs…

    Google are the only ones you have to use proxies with, you use to be able to use their toolbar query with timeouts…

  6. Looks like it is from http://www.amazon.com/The-Elegant-Universe-Superstrings-Dimensions/dp/0375708111/ . It is not in stock at Amazon, so it does not have a price available to parse. Try the Kindle edition http://www.amazon.com/The-Elegant-Universe-Superstrings-ebook/dp/B001P7GGRS/ . Also, I take all the referral information from the end of the link (was “ref=tmm_kin_title_0″). I don’t know if it makes a difference, but I like to work with the most direct link.

  7. I’d love to do an app for this on Ebay…I looked at their API awhile ago but it seemed really annoying to use. I just wanted to type in a search and immediately pop up the average price of completed listings over the past week. Anyone point me to a good place to learn how to do this?

    1. Their query restrictions and query types make that only possible with a caching mechanism…

      I actually wrote a PHP+curl daemon scraper using their API once, not sure if the company still uses it.

