Filters Are In Bloom

If you are a fan of set theory, you might agree there are two sets of people who write computer programs: those who know what a Bloom filter is and those who don’t. How could you efficiently test to see if someone is one set or another? Well, you could use a Bloom filter.  [SamWho] takes us through the whole thing in general terms that you could apply in any situation.

The Bloom filter does perform a trade-off for its speed. It is subject to false positives but not false negatives. That is, if a Bloom filter algorithm tells you that X is not part of a set, it is correct. But if it tells you it is, you may have to investigate more to see if that’s true.

If it can’t tell you that something is definitely in a set, why bother? Usually, when you use a Bloom filter, you want to reduce searching through a huge amount of data. The example in the post talks about having a 20-megabyte database of “bad” URLs. You want to warn users if they enter one, but downloading that database is prohibitive. But a Bloom filter could be as small as 1.8 megabytes. However, there would be a 1 in 1000 chance of a false positive.

Increase the database size to 3.59 megabytes, and you can reduce false positives to one in a million. Presumably, if you got a positive, you could accept the risk it is false, or you could do more work to search further.

Imagine, for example, a web cache device or program. Many web pages are loaded one time and never again. If you cache all of them, you’ll waste a lot of time and push other things out of the cache. But if you test a page URL with a Bloom filter, you can improve things quite a bit. If the URL may exist in the Bloom filter, then you’ve probably seen it before, so you might want to cache it.

If it says you haven’t, you can add it to the filter so if it is ever accessed again, it will cache. Sure, sometimes a page will show a false positive. So what? You’ll just cache the page on the first time, which is what you did before, anyway. If that happens only 0.1% of the time, you still win.

In simple terms, the Bloom filter hashes each item using three different algorithms and sets bits in an array based on the result. To test an item, you compute the same hashes and see if any of the corresponding bits are set to zero. If so, the item can’t be in the set. Of course, there’s no assurance that all three bits being set means the set contains the item. Those three bits might be set for totally different items.

Why does increasing the number of bits help? The post answers that and looks at other optimizations like a different number of hash functions and counting.

The post does a great job of explaining the filter but if you want a more concrete example in C, you might want to read this post next. Or search for code in your favorite language. We’ve talked about Python string handling with Bloom filters before. We’ve even seen a proposal to add them to the transit bus.

The Python documentation for str.strip().

Faster String Processing With Bloom Filters

At first, string processing might seem very hard to optimize. If you’re looking for a newline in some text, you have to check every character in the string against every type of newline, right? Apparently not, as [Abhinav Upadhyay] tells us how CPython does some tricks in string processing.

The trick in question is based on bloom filters, used here to quickly tell whether a character possibly matches any in a predefined set. A bloom filter works by condensing a set of more complex data to a couple of bits in an array. When an element is added, a bit is set, the index of which is determined by a hash function. To test whether an element might be in the filter, the same is done but by testing the bit instead of setting it. This effectively allows a fast check of whether an element might be in the filter.

CPython doesn’t stop optimizing there: instead of a complicated hash function, it simply uses the lowest 6 bits. It also has a relatively small bit array at only 64 bits which allows it to avoid memory all together, which in turn makes the comparisons much faster. [Abhinav] goes far into more detail in his article, definitely worth a read for any computer scientists among us.

Nowadays there is ever increasing amounts of talk about AI (specifically large language models), so why not apply an LLM to Python to fix the bugs for you?

Bus Stop Bloom Filter

Imagine you’re sitting on a nice bench, the sun shines warmly, and a bus pulls up. You’re headed to Stendal from Osnabrück, how can you tell if you should get on that bus? [Julian Vanecek] is trying to turn that from an O(n) problem to an O(1) one with a Bloom filter right at the bus stop.

In [Julian’s] sample code, each stop is a 3-bit number that can be encoded into a 192-bit array. Your ticket is just that 3-bit number encoded, so you can look at the graphic on the side of the incoming bus, match it against your ticket, and hop on. Gone are the days of waiting for the little LED screen to cycle through all the stops, waiting for yours to come up. Your ticket should have just a few boxes filled in so it is relatively quick to search against the bus’s graphic.

Of course, there is a potential for a false positive rate. [Julian] points out that this can be tuned to prevent errors and has achieved a < 0.5% false positive rate using the Deutsche Bahn bus system. The code is written in Python and available on GitHub. Perhaps buses could have a large flip-dop display on the side, to adjust to show which stops they’re headed to next. Additionally, it doesn’t encode which stops are next, just which stops the bus will eventually go to.

In the video after the break, [Julian] explains how the system works. Whether it would be ultimately adopted is somewhat beside the point. We love the seeing people re-imagining ideas and trying to apply new techniques to improve the things around them.

Continue reading “Bus Stop Bloom Filter”