Hack The Web Without A Browser

It is a classic problem. You want data for use in your program but it is on a webpage. Some websites have an API, of course, but usually, you are on your own. You can load the whole page via HTTP and parse it. Or you can use some tools to “scrape” the site. One interesting way to do this is woob — web outside of browsers.

The system uses a series of backends tailored at particular sites. There’s a collection of official backends, and you can also create your own. Once you have a backend, you can configure it and use it from Python. Here’s an example of finding a bank account balance:

>>> from woob.core import Woob
>>> from woob.capabilities.bank import CapBank
>>> w = Woob()
>>> w.load_backends(CapBank)
{'societegenerale': <Backend 'societegenerale'>, 'creditmutuel': <Backend 'creditmutuel'>}
>>> pprint(list(w.iter_accounts()))
[<Account id='7418529638527412' label=u'Compte de ch\xe8ques'>,
<Account id='9876543216549871' label=u'Livret A'>,
<Account id='123456789123456789123EUR' label=u'C/C Eurocompte Confort M Roger Philibert'>]
>>> acc = next(iter(w.iter_accounts()))
>>> acc.balance
Decimal('87.32')

The list of available backends is impressive, but eventually, you’ll want to create your own modules. Thankfully, there’s plenty of documentation about how to do that. The framework allows you to post data to the website and easily read the results. Each backend also has a test which can detect if a change in the website breaks the code, which is a common problem with such schemes.

We didn’t see a Hackaday backend. Too bad. There are, however, many application examples, both console-based and using QT. For example, you can search for movies, manage recipes, or dating sites.

Of course, there are many approaches possible to this problem. Maybe you need to find out when the next train is leaving.

Extracting Data From Smart Scale Gives Rube Goldberg A Run For His Money

[Kevin Norman] got himself a smart body scale with the intention of logging data for his own analysis, but discovered that extracting data from the device was anything but easy. It turns out that the only way to access data from his scale is by viewing it in a mobile app. Screen-scraping is a time-honored method of pulling data from uncooperative systems, so [Kevin] committed to regularly taking a full-height screenshot from the app and using optical character recognition (OCR) to get the numbers, but making that work was a surprisingly long process full of dead ends.

First of all, while OCR can be reliable, it needs the right conditions. One thing that ended up being a big problem was the way the app appends units (kg, %) after the numbers. Not only are they tucked in very close, but they’re about half the height of the numbers themselves. It turns out that mixing and matching character height, in addition to snugging them up against one another, is something tailor-made to give OCR reliability problems.

The solution for this particular issue came from an unexpected angle. [Kevin] was using an open-source OCR program called Tesseract, and joined an IRC community #tesseract to ask for advice after exhausting his own options. The bemused members of the online community informed [Kevin] that they had nothing to do with OCR; #tesseract was actually a community for an open-source 3D FPS shooter of the same name. But as luck would have it, one of the members actually had OCR experience and suggested the winning approach: pre-process the image with OpenCV, using cv2.findContours() to detect and create a bounding box around each element. If an element is taller than a decimal point but shorter than everything else, throw it out. With that done, there were still a few more tweaks required, but the finish line was finally in sight.

Now [Kevin] can use the scale in the morning, take a screenshot, and in less than half a minute the results are imported into a database and visualizations generated. The resulting workflow might look like something Rube Goldberg would approve of, but it works!

Scraping Blogs For Fun And Profit

Sometimes when you’re working on a problem, a solution is thrown right at your face. We found ourselves in this exact situation a few days ago while putting together Hackaday’s new retro edition; a way to select a random Hackaday article was needed and [Alexander van Teijlingen] of codepanel.net just handed us the solution.

To grab every Hackaday URL ever, [Alex] wrote a small Python script using the Beautiful Soup screen scraping library. The program starts on Hackaday’s main page and grabs every link to a Hackaday post before going to the next page. It’s not a terribly complex build, but we’re gobsmacked a solution to a problem we’re working on would magically show up in our inbox.

Thanks to [Alex], writing a cron job to automatically update our new retro edition just got a whole lot easier. If you’d like to check out a list of every Hackaday post ever (or at least through two days ago), you can grab 10,693 line text file here.