Hack The Web Without A Browser

January 17, 2022 by Al Williams 20 Comments

It is a classic problem. You want data for use in your program but it is on a webpage. Some websites have an API, of course, but usually, you are on your own. You can load the whole page via HTTP and parse it. Or you can use some tools to “scrape” the site. One interesting way to do this is woob — web outside of browsers.

The system uses a series of backends tailored at particular sites. There’s a collection of official backends, and you can also create your own. Once you have a backend, you can configure it and use it from Python. Here’s an example of finding a bank account balance:

>>> from woob.core import Woob
>>> from woob.capabilities.bank import CapBank
>>> w = Woob()
>>> w.load_backends(CapBank)
{'societegenerale': <Backend 'societegenerale'>, 'creditmutuel': <Backend 'creditmutuel'>}
>>> pprint(list(w.iter_accounts()))
[<Account id='7418529638527412' label=u'Compte de ch\xe8ques'>,
<Account id='9876543216549871' label=u'Livret A'>,
<Account id='123456789123456789123EUR' label=u'C/C Eurocompte Confort M Roger Philibert'>]
>>> acc = next(iter(w.iter_accounts()))
>>> acc.balance
Decimal('87.32')

The list of available backends is impressive, but eventually, you’ll want to create your own modules. Thankfully, there’s plenty of documentation about how to do that. The framework allows you to post data to the website and easily read the results. Each backend also has a test which can detect if a change in the website breaks the code, which is a common problem with such schemes.

We didn’t see a Hackaday backend. Too bad. There are, however, many application examples, both console-based and using QT. For example, you can search for movies, manage recipes, or dating sites.

Of course, there are many approaches possible to this problem. Maybe you need to find out when the next train is leaving.

This Week In Security: The Facebook Leak, The YouTube Leak, And File Type Confusion

April 9, 2021 by Jonathan Bennett 21 Comments

Facebook had a problem, way back in the simpler times that was 2019. Something like 533 million accounts had the cell phone number associated with the account leaked. It’s making security news this week, because that database has now been released for free in its entirety. The dataset consists of Facebook ID, cell number, name, location, birthday, bio, and email address. Facebook has pointed out that the data was not a hack or breach, but was simply scraped prior to a vulnerability being fixed in 2019.

The vulnerability was in Facebook’s contact import service, also known as the “Find Friends” feature. The short explanation is that anyone could punch a random phone number in, and get a bit of information about the FB account that claimed that number. The problem was that some interfaces to that service didn’t have appropriate rate limiting features. Combine that with Facebook’s constant urging that everyone link a cell number to their account, and the default privacy setting that lets anyone locate you by your cell number, and the data scraping was all but inevitable. The actual technique used may have been to spoof that requests were coming from the official Facebook app.

[Troy Hunt]’s Have i been pwned service has integrated this breach, and now allows searching by phone number, so go check to see if you’re one of the exposed. If you are, keep the leaked data in mind every time an email or phone call comes from someone you don’t know. Continue reading “This Week In Security: The Facebook Leak, The YouTube Leak, And File Type Confusion” →

Sticking With The Script For Cheap Plane Tickets

January 30, 2017 by James Hobson 27 Comments

When [Zeke Gabrielse] needed to book a flight, the Internet hive-mind recommended that he look into traveling with Southwest airlines due to a drop in fares late Thursday nights. Not one to stay up all night refreshing the web page indefinitely, he opted to write a script to take care of the tedium for him.

Settling on Node.js as his web scraper of choice, numerous avenues of getting the flight pricing failed before he finally had to cobble together a script that would fill out and submit the search form for him. With the numbers coming in, [Grabrielse] set up a Twilio account to text him once fares dropped below a certain price point — because, again, why not automate?

Continue reading “Sticking With The Script For Cheap Plane Tickets” →

Crowdsourcing Reference Designs From Github

November 19, 2015 by Eric Evenchick 24 Comments

A ton of open source hardware projects make their way onto Github, and Eagle is one of the most popular tools for these designs. [TomKeddie] came up with the idea of searching Github for Eagle files containing specific parts at Hacker Camp Shenzhen, and a method of scraping useful ones.

The folks over at Dangerous Prototypes used this to build the Github Hardware Search tool. Simply enter a part number, like “ATmega328P”, and you’ll receive a list of the designs using that part. You can then study the design and use it as a reference for your own project. You can also snag library files for the parts.

Of course, there are some limitations to this. The most obvious one is the lack of quality control. There’s no guarantee that the design you find works, or has even been built. Also, it only works for Eagle 6+ files, since prior versions were not XML. You can read more about the design of the tool over on Dangerous Prototypes.

Web Scraping Amazon And Rotten Tomatoes

January 23, 2013 by Mike Szczys 29 Comments

web-scraping-amazon-and-rotten-tomatos

[Rajesh] put web scraping to good use in order to gather the information important to him. He’s published two posts about it. One scrapes Amazon daily to see if the books he wants to read have reached a certain price threshold. The other scrapes Rotten Tomatoes in order to display the audience score next to the critics score for the top renting movies.

Web scraping uses scripts to gather information programmatically from HTML rather than using an API to access data. We recently featured a conceptual tutorial on the topic, and even came across a hack that scraped all of our own posts. [Rajesh’s] technique is pretty much the same.

He’s using Python scripts with the Beautiful Soup module to parse the DOM tree for the information he’s after. In the case of the Amazon script he sets a target price for a specific book he’s after and will get an email automatically when it gets there. With Rotten Tomatoes he sometimes likes to see the audience score when considering a movie, but you can’t get it on the list at the website; you have to click through to each movie. His script keeps a database so that it doesn’t continually scrape the same information. The collected numbers are displayed alongside the critics scores as seen above.