Someone once told me that when it comes to Computer Science white papers, don’t bother with reading the whole thing. According to him as long as he understands the intuition behind it, he’ll know where to look if he ever needed to know more. This post is my attempt at giving the absolute bare bones intuition of how Bloom Filters work.
At the highest level, a Bloom Filter is used to let you know if an object exists in a set. If you are asking the question “Have I ever seen this thing before?”, then a Bloom Filter is a powerful tool to estimate the answer to that question.
The little yellow flag in the previous statement is of course the word “estimate”. What do I mean by that? Bloom Filters can guarantee no false negatives but provides no guarantees against false positives. In other words, I can tell you for sure if we haven’t seen that object before, but I can’t tell you for sure if I have. Why this characteristic? At the core is a CS 101 concept: hash collisions.
Anyone who takes a CS data structures 101 course will learn that hashes are a central component of computer science. Hashes are essentially the ID of an object. With hash maps you usually learn that if you want to create a dictionary or map of key-values, then you will start with using a hash map. With a hash map your keys are “hashed” into a memory address in a block of memory, effectively creating a 2-column table with each key-value inserting into individual rows in the “table”. The catch here is that the “hashing” can have what is called collisions, or in other words two different objects can end up hashing to the same memory address. This is because unless your key is already a memory address, transforming an arbitrary object into a memory address can never be perfectly distributed across your address space. In practice, this looks like trying to come up with some transformation scheme where your objects can be transformed or “hashed” into an integer (or hash code).
Now the core concept of bloom filters is hashing. Computers are binary systems which means numbers or integers are represented in memory as binary bits (aka 0’s and 1’s). Now imagine that every object you add to your Bloom Filter set is hashed, and the resulting number (represented as bits) are OR’ed together. Essentially what this means is that the bits in this OR result is keeping track of what hash codes you have seen before. This means that when you want to know if you’ve ever seen an object before, you just have to hash it, and ask the OR bit set result it has ever seen this hash code before. With this book keeping bit set, you can see why it can be sure if you haven’t seen an object before since if that object’s hash code has 1’s in its binary representation that aren’t set in the bit set, then obviously you’ve never seen that hash code before. But why can’t we be sure that we HAVE seen an object before?
This is where hash collisions come in. As I mentioned before, since it is possible that two different objects can end up with the same hash code, you can’t rely on hash code for exclusivity which means for the Bloom Filter even if you found that the hash code bits have all been set before, you can’t know for certain if that was the exact object that you saw last time. Therefore you can’t really guarantee that you have seen this object before.
That’s basically it, there are of course many other layers of complexity that can be added to tune accuracy, space/time constraints, etc. But at the core you can see that using the power of constant-time hashing, Bloom Filters give you a quick and simple way to estimate what objects you have seen before. Due to the inherent limitation of hashing, specifically hash collisions, Bloom Filters can only tell you for sure if it hasn’t ever seen an object, but it can’t tell you for sure if it has.