Web scraping Amazon and Rotten Tomatoes

web-scraping-amazon-and-rotten-tomatos

[Rajesh] put web scraping to good use in order to gather the information important to him. He’s published two posts about it. One scrapes Amazon daily to see if the books he wants to read have reached a certain price threshold. The other scrapes Rotten Tomatoes in order to display the audience score next to the critics score for the top renting movies.

Web scraping uses scripts to gather information programmatically from HTML rather than using an API to access data. We recently featured a conceptual tutorial on the topic, and even came across a hack that scraped all of our own posts. [Rajesh's] technique is pretty much the same.

He’s using Python scripts with the Beautiful Soup module to parse the DOM tree for the information he’s after. In the case of the Amazon script he sets a target price for a specific book he’s after and will get an email automatically when it gets there. With Rotten Tomatoes he sometimes likes to see the audience score when considering a movie, but you can’t get it on the list at the website; you have to click through to each movie. His script keeps a database so that it doesn’t continually scrape the same information. The collected numbers are displayed alongside the critics scores as seen above.

Comments

  1. I would do the same thing with NewEgg and HDD prices, so that I could easily sort by $/GB.

  2. Mike says:

    http://camelcamelcamel.com does this for free
    much nicer user interface too

  3. Kelly says:

    Web scraping can be used for an endless number of possibilities. I personally code in PHP and use SIMPLE HTML DOM

  4. Kris Lee says:

    I’m wondering how they come up with those silly names.

  5. kELal1862 says:

    Using a whole framework for scraping…
    tsk… Real Gurls use plain Rexx and do all the scraping with just ONE gigantic PARSE VAR statement :-P

  6. Trent says:

    Considering Amazon has known to drop or inflate price based on browser and location, he should modify his script to pass various agent strings and also check the same through a few proxies. You never know what deals could be had.

  7. You know RottenTomatoes has a free api?

    http://developer.rottentomatoes.com/

  8. Rajesh Verma says:

    Camelcamelcamel does not track digital items such as Amazon Kindle. That was the reason to make the script.

  9. 1) We took down his site.

    2) I can’t seem to get the code to run in Ubuntu Server 12.04. I get this error:

    Traceback (most recent call last):
    File “amazon.py”, line 54, in
    con = title.contents
    AttributeError: ‘NoneType’ object has no attribute ‘contents’

    Beautifulsoup is not returning any results on either query. I checked the data being passed, and it is indeed in there. Any advice?

  10. Panikos says:

    yahoo pipes is also an easy, quick option

  11. supershwa says:

    Hooray for scraping! (Big fan here)

    Careful – while the laws are grey, some businesses hate when people scrape their data…they tend to try to enforce “using an approved web browser” on their sites, although it’s difficult and expensive for them to actually pursue litigation.

    And uh, mind those applications that hammer sites with multiple, simultaneous requests for data…they REALLY hate that one since it’s similar to a DoS attack. ;oP

  12. Cosmic R says:

    I’ve been using scraping for heaps of things. I wrote a price tracker for a local computer store in Australia (MSY), I made an XBMC addon that displays sports scores while watching TV, I also wrote an IMDB scraper for XBMC ratings.

    The possibilities are endless… but like previously mentioned, be careful not to hammer the site in question!

  13. fonz says:

    just wait for the black helicopters to come pick you up to get eaten by lawyers for violating some obscure ToS

  14. cianof says:
  15. xorpunk says:

    …thinks of how many times he’s done this as a freelancer for ebay and amazon with and without APIs…

    Google are the only ones you have to use proxies with, you use to be able to use their toolbar query with timeouts…

  16. Willrandship says:

    One does not make Beautiful Soup from Rotten Tomatoes.

  17. Rajesh Verma says:

    Looks like it is from http://www.amazon.com/The-Elegant-Universe-Superstrings-Dimensions/dp/0375708111/ . It is not in stock at Amazon, so it does not have a price available to parse. Try the Kindle edition http://www.amazon.com/The-Elegant-Universe-Superstrings-ebook/dp/B001P7GGRS/ . Also, I take all the referral information from the end of the link (was “ref=tmm_kin_title_0″). I don’t know if it makes a difference, but I like to work with the most direct link.

  18. Tyson says:

    I’d love to do an app for this on Ebay…I looked at their API awhile ago but it seemed really annoying to use. I just wanted to type in a search and immediately pop up the average price of completed listings over the past week. Anyone point me to a good place to learn how to do this?

    • xorpunk says:

      Their query restrictions and query types make that only possible with a caching mechanism…

      I actually wrote a PHP+curl daemon scraper using their API once, not sure if the company still uses it.

  19. George says:

    Another nice python alternative is http://packages.python.org/pyquery/

    If you prefer Javascript phantomjs is a headless webkit based browser.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 92,041 other followers