Filters Are In Bloom

If you are a fan of set theory, you might agree there are two sets of people who write computer programs: those who know what a Bloom filter is and those who don’t. How could you efficiently test to see if someone is one set or another? Well, you could use a Bloom filter.  [SamWho] takes us through the whole thing in general terms that you could apply in any situation.

The Bloom filter does perform a trade-off for its speed. It is subject to false positives but not false negatives. That is, if a Bloom filter algorithm tells you that X is not part of a set, it is correct. But if it tells you it is, you may have to investigate more to see if that’s true.

If it can’t tell you that something is definitely in a set, why bother? Usually, when you use a Bloom filter, you want to reduce searching through a huge amount of data. The example in the post talks about having a 20-megabyte database of “bad” URLs. You want to warn users if they enter one, but downloading that database is prohibitive. But a Bloom filter could be as small as 1.8 megabytes. However, there would be a 1 in 1000 chance of a false positive.

Increase the database size to 3.59 megabytes, and you can reduce false positives to one in a million. Presumably, if you got a positive, you could accept the risk it is false, or you could do more work to search further.

Imagine, for example, a web cache device or program. Many web pages are loaded one time and never again. If you cache all of them, you’ll waste a lot of time and push other things out of the cache. But if you test a page URL with a Bloom filter, you can improve things quite a bit. If the URL may exist in the Bloom filter, then you’ve probably seen it before, so you might want to cache it.

If it says you haven’t, you can add it to the filter so if it is ever accessed again, it will cache. Sure, sometimes a page will show a false positive. So what? You’ll just cache the page on the first time, which is what you did before, anyway. If that happens only 0.1% of the time, you still win.

In simple terms, the Bloom filter hashes each item using three different algorithms and sets bits in an array based on the result. To test an item, you compute the same hashes and see if any of the corresponding bits are set to zero. If so, the item can’t be in the set. Of course, there’s no assurance that all three bits being set means the set contains the item. Those three bits might be set for totally different items.

Why does increasing the number of bits help? The post answers that and looks at other optimizations like a different number of hash functions and counting.

The post does a great job of explaining the filter but if you want a more concrete example in C, you might want to read this post next. Or search for code in your favorite language. We’ve talked about Python string handling with Bloom filters before. We’ve even seen a proposal to add them to the transit bus.

Steampipe: All SQL All The Time

Although modern Linux has slightly shifted, the old Unix mantra was: everything’s a file. With Steampipe, a better saying might be: everything’s a SQL table. The official tagline is “select * from cloud” which also works. The open-source program relies on plugins, and there are currently 140 sources ranging from GitHub to Google Sheets and more.

There are command line interfaces for the major platforms. You can also add the system to PostgresSQL or SQLite for even more SQL goodness. Continue reading “Steampipe: All SQL All The Time”

The Latest Windows 11 Release Might Not Work On Your Oldest Machines

Everybody knows you can’t install Windows XP on a 386, or Windows 95 on an original IBM PC. But for Windows 11, the goalposts seem to be changing with newer releases of the existing OS. As covered by The Register, it appears the latest Windows 11 24H2 update might be incompatible with older machines.

It’s all down to the POPCNT CPU instruction. As shared on Twitter by [TheBobPony], the instruction appears in a number of Windows 11 system files, including kernel and USB XHCI drivers. Thus, it appears that any CPU not able to run this instruction will not be able to boot Windows 11. POPCNT was first included in AMD’s Barcelona architecture in 2007, and Intel’s Core processors in 2008. It’s an instruction for counting set bits in a word.

Ultimately, the effect is that computers with older CPUs will no longer be able to run the latest version of Windows 11. It could be as simple as Microsoft engineers enabling more modern CPU instructions at compilation time. However, given affected hardware is more than 15 years old, it’s perhaps likely that Microsoft is perfectly willing to cut these machines off from using the latest versions of its main operating system. We’ve talked about this phenomenon before, too.

In any case, keep a close eye on Windows update if you’re running super-old hardware. Let us know if you’ll be affected in the comments.

Thanks to [Stephen Walters] for the tip!

Make Your Bookshelf Clickable

We’ll confess that we have a fondness for real books and plenty of them. So does [James], and he decided he needed a way to take a picture of his bookshelves and make each book clickable to find more information. This is one of those things that sounds fairly simple until you decide to do it. You can try an example of the results and then go back and read about the journey it took to get there.

There are several subtasks involved. First, you want to identify each book’s envelope. It wouldn’t do to click on the Joy of Cooking and get information about Remembrance of Things Past.

The next challenge is reading the title of the book. This can be tricky. Fonts differ. The book could be upside down. Some titles go cross the spine, but most go vertically. The remainder of the task is fairly easy. If you know the region and the title, you can easily find a link (for Google Books, in this case) and build an SVG overlay that maps the areas for each book to the right link.

Continue reading “Make Your Bookshelf Clickable”

Breaking Through The 1 MB Barrier In DOS With Unreal Mode And More

The memory map of the original 8086 computer with its base and extended memory made the original PC rather straightforward, but also posed countless issues for DOS-based applications as they tried to make use of memory beyond the legacy 1 MB address space. The initial ways to deal with this like EMS, XMS and UMB were rather cumbersome and often impractical, but with the arrival of the 80286 and 80386 processors more options opened up, including protected mode. More interestingly, this led to unreal mode, DOS extenders and the somewhat more obscure LOADALL instruction, as covered by [Julio Merino] in a new article.

This article builds on the first one which covered the older methods and covered the basics of protected mode. Where protected mode is convenient compared to real mode is that with the former the memory accesses go via the MMU and thus allows for access to 16 MB on the 80286 and 4 GB on the 80386. The segment descriptors and resolving of these that make this possible can be (ab)used on the 80286 and up by realizing that these segment descriptors are also used in real mode. Unreal mode is thus about switching to protected mode, loading arbitrary segment descriptors and switching back to real mode. As this is outside the original processor spec, it is commonly called ‘unreal mode’.

Continue reading “Breaking Through The 1 MB Barrier In DOS With Unreal Mode And More”

Human-Interfacing Devices: The Descriptor Heist

Today, we’ll build our own input devices. And they will be easy to create and write firmware for, they will work perfectly, and they will be cross-platform. We can do that with help of the Human Interface Device (HID) standard, and by way of introduction, so that you never get confused by what a “descriptor” means, and we’ll build our own HID device — a Human Interface Device device. The way we build them won’t require reading specifications – instead, I’ll teach your how to steal HID descriptors from existing devices, tweak them for our purposes, and use them in our devices to harness the power of HID.

For decades now, it’s been possible to build a HID mouse or keyboard by using a library or two, and it’s been a godsend for hackers all around the world. However, these libraries are typically confined to a certain template and inflexible, and we hackers often go outside of what’s expected. HID allows for much more than a simple keyboard or a mouse. That’s why today we’re building a touchscreen – something not yet covered online or by libraries.

HID lets you build devices that are friendly. They don’t need drivers, they are plug and play, and they do what you expect them to do. At its core, the HID standard is as simple as is ubiquitous. You can tunnel HID over USB, Bluetooth, I2C, and modern-day operating systems support all three of these. Today, let’s go through the basics of HID, and then build a USB touchscreen out of a SPI-connected resistive touchscreen, with help of the usual RP2040+MicroPython combo. I will also give you a toolkit for how to debug a Human Interface Device device as thoroughly as possible – specifically on Linux, showing all the HID debug and introspection capabilities that Linux gives you. But it’ll work on Windows too through the beauty of standardization.

Continue reading “Human-Interfacing Devices: The Descriptor Heist”

Need A Serial Data Plotter? Better Write Your Own

When you’re working with a development team, especially in a supporting capacity, you can often find yourself having to invent tools and support systems that are fairly involved, but don’t add to the system’s functionality. Still, without them, it’d be a dead duck. [Aidan Chandra] was clearly in a similar situation, working with a bunch of postgrads at Stanford, on an exoskeleton project, and needed an accurate data plotter to watch measurements in real-time.

This particular problem has been solved many times over, but [Aidan] laments that many solutions available seem to be too complex, hard to extend, or just have broken dependencies. This happens a lot, and it simply leads to yet another project to get going, before you can do the real work it supports. Based on Python and PyQT5, serial-plotter is a new beginning, with an emphasis on correct data acquisition and real-time data visualization with a little processing thrown in. Think, acquire data, show the raw values as well as the mean value, and RMS noise all on the same windows side-by-side, all of which is easily tweakable with a bit of programming using Numpy and Matplotlib.

One particularly important point to highlight is that of the handling of time-stamping. [Aidan] needed to ensure samples were logged together with a local MCU timestamp so that when displayed and possibly later post-processed, it was possible to accurately determine when a particular value or event occurred. With the amount of buffering, data loss and multiple-thread shenanigans, it is easy to forget that the data might get to the application in a non-deterministic way, and just relying on local CPU time is not so useful.

If you need to visualize data transported over the serial port, we have seen many projects to help. Like the highly configurable Serial Studio, for one. If your needs are a bit more complex, especially with multiple data transport methods, then a Supercon 2022 talk by [Alex Whittemore] might be a jolly good place to start.