Hack The Web Without A Browser

It is a classic problem. You want data for use in your program but it is on a webpage. Some websites have an API, of course, but usually, you are on your own. You can load the whole page via HTTP and parse it. Or you can use some tools to “scrape” the site. One interesting way to do this is woob — web outside of browsers.

The system uses a series of backends tailored at particular sites. There’s a collection of official backends, and you can also create your own. Once you have a backend, you can configure it and use it from Python. Here’s an example of finding a bank account balance:

>>> from woob.core import Woob
>>> from woob.capabilities.bank import CapBank
>>> w = Woob()
>>> w.load_backends(CapBank)
{'societegenerale': <Backend 'societegenerale'>, 'creditmutuel': <Backend 'creditmutuel'>}
>>> pprint(list(w.iter_accounts()))
[<Account id='7418529638527412' label=u'Compte de ch\xe8ques'>,
<Account id='9876543216549871' label=u'Livret A'>,
<Account id='123456789123456789123EUR' label=u'C/C Eurocompte Confort M Roger Philibert'>]
>>> acc = next(iter(w.iter_accounts()))
>>> acc.balance
Decimal('87.32')

The list of available backends is impressive, but eventually, you’ll want to create your own modules. Thankfully, there’s plenty of documentation about how to do that. The framework allows you to post data to the website and easily read the results. Each backend also has a test which can detect if a change in the website breaks the code, which is a common problem with such schemes.

We didn’t see a Hackaday backend. Too bad. There are, however, many application examples, both console-based and using QT. For example, you can search for movies, manage recipes, or dating sites.

Of course, there are many approaches possible to this problem. Maybe you need to find out when the next train is leaving.

20 thoughts on “Hack The Web Without A Browser

    1. From HAD TOS:

      (e) introduce software or automated agents or scripts to the SupplyFrame Offerings so as to produce multiple accounts, generate automated searches, requests and queries, or to strip or mine data from the SupplyFrame Offerings;

      So, STOP DOING THAT!

      1. I am the author. This is MY DATA. And the goal is not to take the server down to its knee, I included random delays to prevent DOS.

        BTW what does that mean ? ” introduce software or automated agents or scripts to the SupplyFrame Offerings” Do lawyers even know computers ? Is sending a HTTP request “introducting software” ?

        C’m’on.

        1. What I would like to see/develop is a tool to read “Hackaday Offline”. If you had an API for the blog like the one you listed above for IO, that would be fantastic. End goal is to be able to “print” a weeks worth of previous articles at a time to a PDF that I can read from anywhere while offline.

  1. “Here’s a script I wrote… put your bank account info into it and it will do nice things for you. And let me update the script as I see fit…”

    Nice project for non-sensitive data but a big fat watering hole target for financial info.

  2. I was really excited about this but then I looked into how it works and each site is a python script. There are ZERO security measures taken to ensure a module can’t just steal your info.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.