When Bloom filters don't bloom

I've known about Bloom filters (named after Burton Bloom) since university, but I haven't had an opportunity to use them in anger. Last month this changed - I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters.

While doing research about IP spoofing, I needed to examine whether the source IP addresses extracted from packets reaching our servers were legitimate, depending on the geographical location of our data centers. For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter. This problem might sound simple, but in the ever-evolving landscape of the internet this is far from easy. Suffice it to say I ended up with many large text files with data like this:

This reads as: the IP 192.0.2.1 was recorded reaching Cloudflare data center number 107 with a legitimate request. This data came from many sources, including our active and passive probes, logs of certain domains we own (like cloudflare.com), public sources (like BGP table), etc. The same line would usually be repeated across multiple files.

I ended up with a gigantic collection of data of this kind. At some point I counted 1 billion lines across all the harvested sources. I usually write bash scripts to pre-process the inputs, but at this scale this approach wasn't working. For example, removing duplicates from this tiny file of a meager 600MiB and 40M lines, took... about an eternity:

Enough to say that deduplicating lines using the usual bash commands like 'sort' in various configurations (see '--parallel', '--buffer-size' and '--unique') was not optimal for such a large data set.

Bloom filters to the rescue

Image by David Eppstein Public Domain

Then I had a brainwave - it's not necessary to sort the lines! I just need to remove duplicated lines - using some kind of "set" data structure should be much faster. Furthermore, I roughly know the cardinality of the input file (number of unique lines), and I can live with some data points being lost - using a probabilistic data structure is fine!

Bloom-filters are a perfect fit!

While you should go and read Wikipedia on Bloom Filters, here is how I look at this data structure.

How would you implement a "set"? Given a perfect hash function, and infinite memory, we could just create an infinite bit array and set a bit number 'hash(item)' for each item we encounter. This would give us a perfect "set" data structure. Right? Trivial. Sadly, hash functions have collisions and infinite memory doesn't exist, so we have to compromise in our reality. But we can calculate and manage the probability of collisions. For example, imagine we have a good hash function, and 128GiB of memory. We can calculate the probability of the second item added to the bit array colliding would be 1 in 1099511627776. The probability of collision when adding more items worsens as we fill up the bit array.

Furthermore, we could use more than one hash function, and end up with a denser bit array. This is exactly what Bloom filters optimize for. A Bloom filter is a bunch of math on top of the four variables:

'n' - The number of input elements (cardinality)
'm' - Memory used by the bit-array
'k' - Number of hash functions counted for each input
'p' - Probability of a false positive match

Given the 'n' input cardinality and the 'p' desired probability of false positive, the Bloom filter math returns the 'm' memory required and 'k' number of hash functions needed.

Check out this excellent visualization by Thomas Hurst showing how parameters influence each other:

https://hur.st/bloomfilter/

mmuniq-bloom

Guided by this intuition, I set out on a journey to add a new tool to my toolbox - 'mmuniq-bloom', a probabilistic tool that, given input on STDIN, returns only unique lines on STDOUT, hopefully much faster than 'sort' + 'uniq' combo!

Here it is:

'mmuniq-bloom.c'

For simplicity and speed I designed 'mmuniq-bloom' with a couple of assumptions. First, unless otherwise instructed, it uses 8 hash functions k=8. This seems to be a close to optimal number for the data sizes I'm working with, and the hash function can quickly output 8 decent hashes. Then we align 'm', number of bits in the bit array, to be a power of two. This is to avoid the pricey % modulo operation, which compiles down to slow assembly 'div'. With power-of-two sizes we can just do bitwise AND. (For a fun read, see how compilers can optimize some divisions by using multiplication by a magic constant.)

We can now run it against the same data file we used before:

Oh, this is so much better! 12 seconds is much more manageable than 2 minutes before. But hold on... The program is using an optimized data structure, relatively limited memory footprint, optimized line-parsing and good output buffering... 12 seconds is still eternity compared to 'wc -l' tool:

What is going on? I understand that counting lines by 'wc' is easier than figuring out unique lines, but is it really worth the 26x difference? Where does all the CPU in 'mmuniq-bloom' go?

It must be my hash function. 'wc' doesn't need to spend all this CPU performing all this strange math for each of the 40M lines on input. I'm using a pretty non-trivial 'siphash24' hash function, so it surely burns the CPU, right? Let's check by running the code computing hash function but not doing any Bloom filter operations:

This is strange. Counting the hash function indeed costs about 2s, but the program took 12s in the previous run. The Bloom filter alone takes 10 seconds? How is that possible? It's such a simple data structure...

A secret weapon - a profiler

It was time to use a proper tool for the task - let's fire up a profiler and see where the CPU goes. First, let's fire an 'strace' to confirm we are not running any unexpected syscalls:

Everything looks good. The 10 calls to 'mmap' each taking 4ms (3971 us) is intriguing, but it's fine. We pre-populate memory up front with 'MAP_POPULATE' to save on page faults later.

What is the next step? Of course Linux's 'perf'!

Then we can see the results:

Right, so we indeed burn 87.2% of cycles in our hot code. Let's see where exactly. Doing 'perf annotate process_line --source' quickly shows something I didn't expect.

You can see 26.90% of CPU burned in the 'mov', but that's not all of it! The compiler correctly inlined the function, and unrolled the loop 8-fold. Summed up that 'mov' or 'uint64_t v = *p' line adds up to a great majority of cycles!

Clearly 'perf' must be mistaken, how can such a simple line cost so much? We can repeat the benchmark with any other profiler and it will show us the same problem. For example, I like using 'google-perftools' with kcachegrind since they emit eye-candy charts:

The rendered result looks like this:

Allow me to summarise what we found so far.

The generic 'wc' tool takes 0.45s CPU time to process 600MiB file. Our optimized 'mmuniq-bloom' tool takes 12 seconds. CPU is burned on one 'mov' instruction, dereferencing memory....

Image by Jose Nicdao CC BY/2.0

Oh! I how could I have forgotten. Random memory access is slow! It's very, very, very slow!

According to the general rule "latency numbers every programmer should know about", one RAM fetch is about 100ns. Let's do the math: 40 million lines, 8 hashes counted for each line. Since our Bloom filter is 128MiB, on our older hardware it doesn't fit into L3 cache! The hashes are uniformly distributed across the large memory range - each hash generates a memory miss. Adding it together that's...

That suggests 32 seconds burned just on memory fetches. The real program is faster, taking only 12s. This is because, although the Bloom filter data does not completely fit into L3 cache, it still gets some benefit from caching. It's easy to see with 'perf stat -d':

Right, so we should have had at least 320M LLC-load-misses, but we had only 280M. This still doesn't explain why the program was running only 12 seconds. But it doesn't really matter. What matters is that the number of cache misses is a real problem and we can only fix it by reducing the number of memory accesses. Let's try tuning Bloom filter to use only one hash function:

Ouch! That really hurt! The Bloom filter required 64 GiB of memory to get our desired false positive probability ratio of 1-error-per-10k-lines. This is terrible!

Also, it doesn't seem like we improved much. It took the OS 22 seconds to prepare memory for us, but we still burned 11 seconds in userspace. I guess this time any benefits from hitting memory less often were offset by lower cache-hit probability due to drastically increased memory size. In previous runs we required only 128MiB for the Bloom filter!

Dumping Bloom filters altogether

This is getting ridiculous. To get the same false positive guarantees we either must use many hashes in Bloom filter (like 8) and therefore many memory operations, or we can have 1 hash function, but enormous memory requirements.

We aren't really constrained by available memory, instead we want to optimize for reduced memory accesses. All we need is a data structure that requires at most 1 memory miss per item, and use less than 64 Gigs of RAM...

While we could think of more sophisticated data structures like Cuckoo filter, maybe we can be simpler. How about a good old simple hash table with linear probing?

Image by Vadims Podāns

Welcome mmuniq-hash

Here you can find a tweaked version of mmuniq-bloom, but using hash table:

'mmuniq-hash.c'

Instead of storing bits as for the Bloom-filter, we are now storing 64-bit hashes from the 'siphash24' function. This gives us much stronger probability guarantees, with probability of false positives much better than one error in 10k lines.

Let's do the math. Adding a new item to a hash table containing, say 40M, entries has '40M/2^64' chances of hitting a hash collision. This is about one in 461 billion - a reasonably low probability. But we are not adding one item to a pre-filled set! Instead we are adding 40M lines to the initially empty set. As per birthday paradox this has much higher chances of hitting a collision at some point. A decent approximation is 'n^2/2m', which in our case is '(40M^2)/(2*(264))'. This is a chance of one in 23000. In other words, assuming we are using good hash function, every one in 23 thousand random sets of 40M items, will have a hash collision. This practical chance of hitting a collision is non-negligible, but it's still better than a Bloom filter and totally acceptable for my use case.

The hash table code runs faster, has better memory access patterns and better false positive probability than the Bloom filter approach.

Don't be scared about the "hash conflicts" line, it just indicates how full the hash table was. We are using linear probing, so when a bucket is already used, we just pick up the next empty bucket. In our case we had to skip over 0.7 buckets on average to find an empty slot in the table. This is fine and, since we iterate over the buckets in linear order, we can expect the memory to be nicely prefetched.

From the previous exercise we know our hash function takes about 2 seconds of this. Therefore, it's fair to say 40M memory hits take around 4 seconds.

Lessons learned

Modern CPUs are really good at sequential memory access when it's possible to predict memory fetch patterns (see Cache prefetching). Random memory access on the other hand is very costly.

Advanced data structures are very interesting, but beware. Modern computers require cache-optimized algorithms. When working with large datasets, not fitting L3, prefer optimizing for reduced number loads, over optimizing the amount of memory used.

I guess it's fair to say that Bloom filters are great, as long as they fit into the L3 cache. The moment this assumption is broken, they are terrible. This is not news, Bloom filters optimize for memory usage, not for memory access. For example, see the Cuckoo Filters paper.

Another thing is the ever-lasting discussion about hash functions. Frankly - in most cases it doesn't matter. The cost of counting even complex hash functions like 'siphash24' is small compared to the cost of random memory access. In our case simplifying the hash function will bring only small benefits. The CPU time is simply spent somewhere else - waiting for memory!

One colleague often says: "You can assume modern CPUs are infinitely fast. They run at infinite speed until they hit the memory wall".

Finally, don't follow my mistakes - everyone should start profiling with 'perf stat -d' and look at the "Instructions per cycle" (IPC) counter. If it's below 1, it generally means the program is stuck on waiting for memory. Values above 2 would be great, it would mean the workload is mostly CPU-bound. Sadly, I'm yet to see high values in the workloads I'm dealing with...

Improved mmuniq

With the help of my colleagues I've prepared a further improved version of the 'mmuniq' hash table based tool. See the code:

'mmuniq.c'

It is able to dynamically resize the hash table, to support inputs of unknown cardinality. Then, by using batching, it can effectively use the 'prefetch' CPU hint, speeding up the program by 35-40%. Beware, sprinkling the code with 'prefetch' rarely works. Instead, I specifically changed the flow of algorithms to take advantage of this instruction. With all the improvements I got the run time down to 2.1 seconds:

The end

Writing this basic tool which tries to be faster than 'sort | uniq' combo revealed some hidden gems of modern computing. With a bit of work we were able to speed it up from more than two minutes to 2 seconds. During this journey we learned about random memory access latency, and the power of cache friendly data structures. Fancy data structures are exciting, but in practice reducing random memory loads often brings better results.

The Cloudflare Blog

When Bloom filters don't bloom

Bloom filters to the rescue

mmuniq-bloom

A secret weapon - a profiler

Dumping Bloom filters altogether

Welcome mmuniq-hash

Lessons learned

Improved mmuniq

The end

Launching Cloudflare’s Gen 13 servers: trading cache for cores for 2x edge compute performance

Inside Gen 13: how we built our most powerful server yet

Code Mode: give agents an entire API in 1,000 tokens

A deep dive into BPF LPM trie performance and optimization