HyperBitBit implementation

175 views
Skip to first unread message

jom...@gmail.com

unread,
Mar 23, 2017, 6:14:53 PM3/23/17
to stream-lib-user
Hello, 

Last year, in the AofA’16 conference Robert Sedgewick proposed a new algorithm for cardinality estimation. 
Robert Sedgwick is a professor at Princeton with a long track of publications on combinatorial/randomized algorithms. He was a good friend of Philippe Flajolet (creator of Hyperloglog) and HyperBitBit it's based on the same ideas. However, it uses less memory than Hyperloglog and can provide the same results. On practical data, HyperBitBit, for N < 2^64 estimates  cardinality within 10% using only 128 + 6 bits.

There are some open implementations of HyperBitBit on Github (mainly in Go) but would be great having it in Java and one of the most used libraries for streaming. 

I would like to implement it if you think that it would be useful let me know.

If you want to know more about HyperBitBit, I let you some notes about it.

Thank you,

Jordi.



P.S: I am aware of the library internals because I implemented once other algorithm called Recordinality for it (It is pending to be merged for one year now) so it would not take me so much integrate HyperBitBit. 

Matt Abrams

unread,
Mar 24, 2017, 8:52:55 AM3/24/17
to stream-...@googlegroups.com
Jordi -

Thanks for the offer, that would be fantastic! Appreciate you sharing
the notes, one of the better decks I've seen on cardinality
estimation.

It would be great if you could implement HyperBitBit.

I'm sorry that we've let your other PR sit there for so long, it has
been a busy year for us and we must have missed it. I'll take a look
as soon as possible.

Matt
> --
> You received this message because you are subscribed to the Google Groups
> "stream-lib-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to stream-lib-us...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

jom...@gmail.com

unread,
Mar 27, 2017, 4:38:32 PM3/27/17
to stream-lib-user
Hello Matt,

I implemented HyperBitBit this weekend, and I already did a pull request. Otmar Ertl pointed some important aspects of the algorithm, and I could not agree more with him.

This is an experimental algorithm with no analysis in depth (as HyperLogLog or Recordinality have). I will add some more tests on it and, although I left some "hints" in the documentation, I will make it clear for people who do not know the algorithm which is the guarantees they can expect. 


About Recordinality, do not worry at all. I will check headers and update the pull request.

Thank you for your time,

Jordi.

Otis Gospodnetić

unread,
Apr 1, 2017, 11:07:53 PM4/1/17
to stream-...@googlegroups.com
Hi,

Any idea how this compares to LogLog Beta (issue, without PR, at https://github.com/addthis/stream-lib/issues/125) ?

Thanks,
Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


--
You received this message because you are subscribed to the Google Groups "stream-lib-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stream-lib-user+unsubscribe@googlegroups.com.

jom...@gmail.com

unread,
Apr 3, 2017, 3:19:16 PM4/3/17
to stream-lib-user
Hi Otis,

LogLog Beta is going to perform MUCH better in all the cases. I could try to plot some comparations, but given the instability of HyperBitBit, the difference would be ridiculous.

What is proposed from the AOL team is create datasets randomly with different cardinalities and then, parametrize HyperLogLog to "empower" even more the algorithm in the cases where its predictions are weaker. 

LogLog Beta is a refinement of HyperLogLog which already works pretty well. There is a discussion in the pull request talking exactly about this. To be sure that people do not get confused I will move it to a new "experimental" package.

Thanks,

Jordi.
To unsubscribe from this group and stop receiving emails from it, send an email to stream-lib-us...@googlegroups.com.

Otis Gospodnetić

unread,
Apr 3, 2017, 10:29:10 PM4/3/17
to stream-...@googlegroups.com
Aha, I see.  Good pointer, thanks.  Maybe LogLog Beta would then be a more useful thing to implement?

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


To unsubscribe from this group and stop receiving emails from it, send an email to stream-lib-user+unsubscribe@googlegroups.com.

jom...@gmail.com

unread,
May 8, 2017, 7:35:53 AM5/8/17
to stream-lib-user
The LogLog Beta algorithm need some experimentation in order to adjust the beta value in each case. I agree it will be useful but I am now writing a paper for extending the K-Minimum Values algorithm.

I am going to use stream-lib for my experiments and for this reason I will implement Recodinality in it. Eventually, If I see some traction on people using LogLog beta I can implement it but now it is not one of my priorities.
Reply all
Reply to author
Forward
0 new messages