Faster String Processing With Bloom Filters

The Python documentation for str.strip().

At first, string processing might seem very hard to optimize. If you’re looking for a newline in some text, you have to check every character in the string against every type of newline, right? Apparently not, as [Abhinav Upadhyay] tells us how CPython does some tricks in string processing.

The trick in question is based on bloom filters, used here to quickly tell whether a character possibly matches any in a predefined set. A bloom filter works by condensing a set of more complex data to a couple of bits in an array. When an element is added, a bit is set, the index of which is determined by a hash function. To test whether an element might be in the filter, the same is done but by testing the bit instead of setting it. This effectively allows a fast check of whether an element might be in the filter.

CPython doesn’t stop optimizing there: instead of a complicated hash function, it simply uses the lowest 6 bits. It also has a relatively small bit array at only 64 bits which allows it to avoid memory all together, which in turn makes the comparisons much faster. [Abhinav] goes far into more detail in his article, definitely worth a read for any computer scientists among us.

Nowadays there is ever increasing amounts of talk about AI (specifically large language models), so why not apply an LLM to Python to fix the bugs for you?

8 thoughts on “Faster String Processing With Bloom Filters

    1. oh, wow. they are arcane; agreed. however when I finally bit the bullet I found them to be such a godsend for things of any appreciable complexity. But I suppose text parsing is one of my least favorite things in life. (well, that and keyboard debouncing)
      if you find yourself in need of trying them out again, I do recommend a tool like “regex101 (dot) com”, where you can build them interactively and test them against positive and negative cases. It’s a great crutch, and along the way you wind up learning how to do them better.
      but yeah like any tool there is an interval of applicability. they may be overkill for simple string splitting, and underkill for complex context-sensitive alternative cases. But a useful tool in the box.

    1. Alex, I’ll take “Why you shouldn’t use the ‘is’ operator to check for equivalence in Python” for $100.

      The ‘is’ operator checks to see if the underlying objects are one and the same. If they refer to different objects, it doesn’t check whether they are equivalent. For that, the ‘==’ operator should be used.

      1. Similar fun things can happen in JS (though the specific example undersampled gave doesn’t give issue) with ‘==’ and ‘===’, except usually object equivalence (‘===’) is actually what you’re looking for.

        This is why I like strongly typed languages. Really helps avoid issues like this, because if you’re working with UTF-8 you’re working with UTF-8 and not something that may or may not be UTF-8 (and similar for other text encodings).

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.