For future reference (mostly for myself), here's a little summary of how to use Bloom filters in real world applications.
Most references are terse and vague on how to pick the hash functions for bloom filters, so here's some detail about that: For small filters, just use a boring and fast hash function like the djb hash function and split up the 32bit result into smaller independent chunks for each of the k hash indexes you'll need. Often those 32 bits already provide enough hash bits to get enough independent bloom filter indexes. And if they don't you basically have three options:
- Use multiple different hash functions, and then MurmurHash seems to be a very good choice. It's simple, readily usable code (even in C, though the reference implementation claims to be C++), and properly licensed. It is a hash function that takes a seed parameter which can be used to create as many independent hash functions as needed.
- Use a cryptographic hash function. Most of them can be implemented really fast on modern CPUs and are already available in some library you use anyway. SHA512 for example outputs plenty bits you can split into k chunks as you need them for your k bloom filter indexes. (Of course, if you are afraid of US export regulations this might be a choice you want to avoid.)
- Use two independent hash functions and combine them linearly.
The size of the bloom filter and the number of hash functions you should be using depending on your application can be calculated using the formulas on the Wikipedia page:
m = -n*ln(p)/(ln(2)^2)
This will tell you the number of bits m to use for your filter, given the number n of elements in your filter and the false positive probability p you want to achieve. All that for the ideal number of hash functions k which you can calculate like this:
k = 0.7*m/n
And that's already everything you need to know to build good bloom filters. If you know the p and n for your use case the above will tell you the m and k, and how to choose the k hash functions.
Bloom filters are a really really useful tool, and given their simplicity something every developer should be aware of.
(And in case you were wondering what this all is about, Kay Sievers and I were discussing using bloom filters in the libudev netlink BSD socket filters, to allow monitoring a certain subset of devices that is orthogonal to the usual subsystem hierarchy, and all that in a way where the number of wakeups in listening clients is minimized)