Web scraping is the act of programmatically harvesting data from a webpage. It consists of finding a way to format the URLs to pages containing useful information, and then parsing the DOM tree to get at the data. It’s a bit finicky, but our experience is that this is easier than it sounds. That’s especially true if you take some of the tips from this web scraping tutorial.
It is more of an intermediate tutorial as it doesn’t feature any code. But if you can bring yourself up to speed on using BeautifulSoup and Python the rest is not hard to implement by trial and error. [Hartley Brody] discusses investigating how the GET requests are formed on your webpage of choice. Once that URL syntax has been figured out just look through the source code for tags (css or otherwise) that can be used as hooks to get at your target data.
So what can this be used for? A lot of things. We’d suggest reading the Reddit comments as there are several real world uses discussed there. But one that immediately pops to mind is the picture harvesting [Mark Zuckerburg] used when he created Facemash.
Amusingly “any content that can be viewed as a webpage” is embedded as a graphic, which is difficult/computationally very expensive to scrape.
It’s all nice and easy (and I’ve done it a number of times) unless the site loads the page dynamically, at which point it can become considerably more difficult.
Are you referring to grabbing the text out of the image (computationally expensive) or just grabbing the image? Grabbing the image is so basic I wouldn’t really consider that parsing and scraping (you’re just transferring a file that is already formatted in a precise way).
You are referring to scraping any webpage that loads content through additional ajax calls. This is also trivial to do (sometimes even easier) when you have tools such as Firebug that lets you see all of the ajax calls that a page performs (and their POST (if present) format)…
Especially when the ajax calls return serialized data and the uses its own JS to format it. It’s like having an (undocumented) API
yeah, but you dont really want to look at the traffic going on for each site you scrape, unless you’re only interested in a smallish dataset. a headless browser like phantomjs is the way to go.
I sure do love how-to programming tutorials with zero code boy-howdy!
I guess that’s what to be expected from a “growth hacker” or “startup lover”…
I didn’t include any code samples cause I didn’t want to limit the techniques to any one language community. Some people chimed in in the comments with libraries for all sorts of languages I’ve never used.
I’d be happy to share some code samples if you’d like to read some. In fact, here’s a sample to start with: https://github.com/hartleybrody/py/blob/master/get_inc_5000.py#L66
It’s a bit complex cause there’s a lot going on (it was my first time using RabbitMQ), but around line 66, there’s some pretty easy to read scraping code. Hope that helps!
Don’t pay any mind to them calling you a “startup lover”… everybody has to start dating some time, and when they do, they’re ALL startup Lovers.
Now… back in the day, before this “Ajax” thing…
ASPTear — yes, it was cheating, I didn’t write it myself, but the son-of-a -gun worked and worked well. I wound up making my own “news page”, scraping comics and stock quotes and tech stories…
A little ASP, a little HTML, and some Regex, and voila…
Wound up actually using it for WORK, which took a lot of the fun out of it. But still, it was scraping.
Don’t judge me!
You can actually prevent people from scrapping your website. All you have to do is add some php code or jquery code to generate a rendom number and add rendom numbers to your classes and id`s. Amazon does that. This number can change with every refresh and you can push this to your css to match you php file.
Nah, a simple re call with a a splicing offset takes care of all that fuss nicely.
Then you can use other methods like finding a particular pattern of HTML elements or things along those lines.
if that prevented you from scraping a site, you’re not really good at scraping, i’d say. it’s something like security through obscurity… there are always ways to find the data you’re looking for (css-classes, xpath, regex or using html parsers like beautiful soup)…
Amazon doesn’t do this but some airlines do. If you want to prevent scraping bad enough you can hire someone to change your layout slightly every day or so. The Amazon solution to actually make the data available through an API
Good, he mentions using BeautifulSoup and not regex. As you may already know, using regexp to parse HTML is a sure fire way to summon the Great Old Ones from their slumber eternal.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Sometimes, regular expressions work great to get some information from an html page.
As long as I don’t want to do something where the regex approach doesn’t work I don’t see any reason not to do it.
And so far it always worked for me.
What is the obsession with Python? There is no one right tool for this job, but the best tool IMO is Perl with LWP and the likes of HTML::Parser. Beautiful Soup is really just an abstraction; if not an excuse for not properly learning and using Regular Expressions. Properly applied RegEx will cleverly parse even mal-formed mark-up mixed among the good. ORielly has a nice “Perl & LWP” book for free download that’ll get you started in next to no time.
try goutte, it’s awesome for web scraping
Except trying to learn goutte is next to impossible for the beginner. The only tutorial that I can find is their readme.md. I have been searching everywhere on how to properly use goutte. What options are available. How do you tell it to fill in a radio button on a form. How do you know when you are at the next page of a submitted form. Nothing anywhere. Goutte may be great for people that already know or are involved in it but not so good for the person trying to figure out how to use it.
Yahoo’s YQL is also an awesome tool to dynamically scrape pages and export it as JSON feed :)
As an addendum to this, you can paste the JSON output from YQL into https://json-csv.com and it will spit out a CSV file. You can then open the data up in Excel.
This doesnt actually show you how to do it, would be good if it did
On the fairly unusual occasion where I’ve had to screen-scrape something, I’ve found SimpleHTMLDom (PHP) to be handy, parses (potentially invalid/badly formed) HTML in a best effort attempt and presents a DOM like object structure.
http://sourceforge.net/projects/simplehtmldom/
Easier than messing about digging with various regexps.
The company I work for scrape websites for comercial purposes. lets see, the law is publicly viewable, but the website sucks. low searchresults, even not all are found. Responses are slow and exporting to pdf/xml kinda works. Lawyers are allowed to use a laptop, but no interwebs. So take all the books or download the complete website? no, they go to us and buy a CD. We scrape all the content, wrap it in our XML database, do some magic (like the thinks most of you just understand, but lawyers and judges mostly not) and profit for us :D. A cd with xml data (viewable in a browser, from CD) searchable, very quick. and access to our online version aswell :D and we crawl every night, so online is always uptodate (yes, if website down, we are not down ;) ) this is done for more then 10 years already and very awesome to do. Just wanted to share this, there is money to be made of free info, and its legal.
Python for web scraping…….. LOL
It would be hard to choose a slower language.
If your webscraping is CPU-limited, you really have an odd hardware configuration.
Hell I’ve had my boss walk in to me and ask for all of the data from a competitors site before, and he wanted it yesterday, and in excel, and with the ability to update it himself.
It taught me that you can even scrape specific sites with VB script through an excel form if that is your kick and you have 10 minutes to get it done.
I’m going to be the oddest one here probably. I semi-regularly scrape sites for info and the language I’ve started to use the most is not Ruby, Python, PHP, C/C++ or anything close. I’ve been using REXX, the old mainframe language, actually ooREXX to be specific. Its parser does exactly what I need and does it fast.
Probably the cheapest way I’ve ever done scraping is through a combination of lynx scripts and awk…BY FAR not the best way to do it, but it was stupid simple and did the trick at the time.
There is a nice eBook on web scraping for PHP programmers here:
http://codediesel.com/products/web-scraping/
https://www.udemy.com/building-a-search-engine/ learn more here
Just found that there is a new website for PHP web scraping and spider
http://php8legs.com/
You could always use on on-line tool like GrabzIt http://grabz.it/scraper it comes with a powerful toolset to help you get and format your data!
Web scrapping industrialized, following video demonstrates how easy it is to pull financial data from Internet no coding.
http://www.youtube.com/watch?v=BvMeL6c14ak
http://www.youtube.com/watch?v=L0Etul5kHuc
https://appliedalgo.com/appliedalgoweb/doc/AppliedAlgo%20Application%20Scenario%20Internet%20Download.pdf
I found something like this. You can see the blog for Data scraping using CURL in PHP:
http://www.codefire.org/blogs/item/data-scraping-using-curl-in-php.html
How do you scrape AJAX pages, for example(facebook and twitter)?