HyperLogLog register index

19 views
Skip to first unread message

Aaron Pfeifer

unread,
May 6, 2019, 5:17:38 PM5/6/19
to algebird
Hi all!  I'm trying to line up a few different implementations of HyperLogLog in different libraries.  When evaluating algebird, the logic that determines the j value seems to be different than most other libraries.  Specifically I'm referring to this code: https://github.com/twitter/algebird/blob/07b185f69f5b3b1f5da1c23b84a8983ebbd921fa/algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala#L97-L113

It seems that algebird is expecting the individual bits within each byte generated from the hashing function to be in Little Endian order, but the bytes within each Long generated to be in Big Endian order (via pairLongs2Bytes).  Am I misreading this?

I'm finding in many libraries (e.g. https://github.com/aggregateknowledge/js-hll/blob/cf1f6ac9c0b37abceae070bc4edd312ed2f67dc3/src/hll.js#L265), that the implementation is to simply do a bit AND operation on the Long.  I'm finding the register index being chosen for a specific input to be different between many libraries and Algebird even when the hashing algorithm, # of registers, and register width are identical.

I feel like I'm missing something here but it's not clear to me.  Could anyone provide some insight into the differences between how the algorithm is implemented?

Thanks!

-Aaron

Aaron Pfeifer

unread,
May 7, 2019, 2:33:04 PM5/7/19
to algebird
After reviewing several other libraries, it seems that many of them implement the algorithm for determining the bucket index differently.  I haven't found any that use the same algorithm as Algebird though.

P. Oscar Boykin

unread,
May 7, 2019, 11:06:11 PM5/7/19
to Aaron Pfeifer, algebird
Byte order doesn’t matter. We assume the hash function is uniform over all bits so any permutation would be fine.

Thus we just write the code unpacking the bits in a way that is convenient.

The hashes don’t store any meaningful integer, so endianism shouldn’t matter.

If you wanted to have two implementations share serialized HLLs then you would care. If you only want to be consistent with yourself it doesn’t matter.

Hope this is useful.

--
You received this message because you are subscribed to the Google Groups "algebird" group.
To unsubscribe from this group and stop receiving emails from it, send an email to algebird+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/algebird/ce7e6137-808e-4749-8568-2e12923ae9cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--

Aaron Pfeifer

unread,
May 8, 2019, 6:50:55 AM5/8/19
to algebird
Thanks! Yeah serialization between different languages and frameworks was exactly the problem I was trying to solve :)

Since Algebird was the only framework with this implementation, I decided to move forward with a different library that was more compatible across different language frameworks. I could have built an implementation following Algebird's algorithm, but didn't want to manage my own fork of other libraries.

Thanks for following up!

Reply all
Reply to author
Forward
0 new messages