From Wikipedia:
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
In other words, given a set of elements, bloom filter can tell you that: A) given element is definitely not in the set, or B) given element is maybe in the set. It can give you a false positive: it can say that an element is in the set when it in fact is not. But it will never give you a false negative: it will never say that an element is not in the set when in fact it is.
In my past life I used bloom filters to check whether or not I should perform an expensive database index search 🙂
In the following example I construct a bloom filter given a set of strings set1, I then verify that each element of set1 is in the set according to the bloom filter. Finally I try a different set of elements set2, and test what the bloom filter says about those elements. Given big enough bloom filter I get 100% correct answers (that non of the elements in set2 are present). Here’s the code (bloom.cpp):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
#include <iostream> #include <string> #include "bloom.hpp" using namespace std; int main() { string set1[] = {"Martin", "Vorbrodt", "C++", "Blog"}; string set2[] = {"Not", "In", "The", "Set"}; bloom_filter<string> bloom(128); for(auto& s : set1) bloom.add(s); for(auto& s : set1) cout << "Contains \"" << s << "\" : " << bloom.contains(s) << endl; for(auto& s : set2) cout << "Contains \"" << s << "\" : " << bloom.contains(s) << endl; } |
Contains “Martin” : 1
Program output.
Contains “Vorbrodt” : 1
Contains “C++” : 1
Contains “Blog” : 1
Contains “Not” : 0
Contains “In” : 0
Contains “The” : 0
Contains “Set” : 0
My implementation of the bloom filter is primitive: it uses only one hashing function which increases false positives, but it is very short and clean and can be built upon; maybe at some point I’ll write a full blown implementation; though there’s plenty of examples online; I want this post to be an introduction to the topic.
In any case, here’s my implementation of the bloom filter (bloom.hpp):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
#pragma once #include <vector> #include <functional> #include <cstddef> template<typename key, typename hash = std::hash<key>> class bloom_filter { public: bloom_filter(std::size_t size) : m_bits(size) {} void add(const key& data) { m_bits[hash{}(data) % m_bits.size()] = true; } bool contains(const key& data) { return m_bits[hash{}(data) % m_bits.size()]; } private: std::vector<bool> m_bits; }; |