DEV Community

Mihir Kulkarni
Mihir Kulkarni

Posted on

Building a demonstration of the Flajolet-Martin algorithm in Python

I was chilling in my Big Data Analysis class the other day when I ran into an interesting topic. The instructor was talking about algorithms that are used to operate on data streams. One of those algorithms is called the Flajolet-Martin algorithm, and it is used to find the number of distinct elements in a data stream. Wikipedia: https://en.wikipedia.org/wiki/Flajolet%E2%80%93Martin_algorithm

To those of you who are not in the know when it comes to data streams, you cannot do something like:

demoStream = [1,2,2,3,4]
print(len(set(demoStream)))

This is because, since we are operating on a data stream, we don't have the option of storing stuff into a list (although I did that in my code xD).

Since we're operating on a large data stream with no storage facilities to help us, what we resort to, in most cases, is probabilistic algorithms. Or, in this specific case, an approximation algorithm.At first glance, the Flajolet-Martin algorithm looked like witchcraft to me.
Alt Text
Basically, you take 'n' hash functions, of the form:

H(x) = (ax+b)%c
Where a,b,c are constant coefficients and,
x is an individual stream element.

Then, you CONVERT the output of each stream element to binary, and count the number of trailing zeros. THEN you find the maximum number of trailing zeros in that data stream. Your approximation for that specific hash is:

R = 2^r
Where R is the number of elements approximated and,
r is the maximum number of trailing zeros for that specific hash.

BUT WAIT THERE'S MORE! This next step is needed to improve accuracy. You then compute 'R' for all hashes, then bunch those together into different lists, find the mean of each list, and find the MEDIAN value of those means. THAT is the approximate number of elements in the data stream.

The Wikipedia article took away some of the "It's withcraft!" vibes I was getting, but it's still a weird algorithm nevertheless.


Okay, what happened next is probably predictable. I got bored one day and tried to write a demo of that algorithm. It became a Hacktoberfest submission to a code repo a few days after the fact. My first submission, actually. (First-timer here heheheh). You can see it here: https://github.com/sukhdeepg/Hacktoberfest/tree/master/Python

Most of the implementation was not too hard, but some parts were really fun to code. One of them being:

binElems.append(str(bin((i[0]*j+i[1])%i[2])).split("b")[1])
Alt Text
The iterator i is a list comprising each hash's coefficients (i[0] = a, i[1] = b and so on).
The iterator j is basically each element in the stream (x).

Handling binary elements in Python was a first for me, and if you're like me, here's a taste. The binary for 2 in Python is expressed as:

0b010

I had two things I needed to do here:

  1. Compute the hash
  2. Convert it to binary
  3. Put it into a list for computation (this can also be done by just processing it then and there, and only keeping the max count of trailing zeros, but I kept it for later, for the sake of readability)

But due to the binary format, I had to break the steps down to:

  1. Compute the hash
  2. Convert it to binary
  3. Convert the binary to a string
  4. Separate that string so that you get the binary you want
  5. Put it into a list for computation

This led to this one line of code. Python rocks :D

The code has some nifty tricks I picked up, please go through it. I'm not a pro, there will be mistakes, I'd love to hear from you guys and gals here! Constructive criticism appreciated!

Top comments (0)