HLL++ p and sp parameters

150 views
Skip to first unread message

Pavel Martynov

unread,
Jul 8, 2015, 11:30:37 AM7/8/15
to stream-...@googlegroups.com
Hi, everyone!

I read HLL++ whitepaper a few times but still doesnt understand how exactly `p` and `sp` parameters affect:
* threshold of Sparse -> Normal convertation
* maximum cardinality support
* std error
* ... anything else?

Some formulas or tables in docs could be very helpful.
For example `LinearCounting.Builder.withError(double eps, int maxCardinality)` very clear for me (as user without deep understanding of mathematics behind LinearCounting).

But how should I choose `p` and `sp` for HLL++ ?

Ian Barfield

unread,
Jul 8, 2015, 1:49:13 PM7/8/15
to stream-...@googlegroups.com
Yeah, we should really have a default constructor or something at least. P is your go-to for maximum cardinality and error bounds. It is basically the same as it is for HLL. In practice, it is also "better" in a "measured, but not well understood" sense, but can't give much guidance there.

The HLL constructors also have support for automatic error bound as a double -> p conversion, so we could add something like that here.

As for "sp", it will affect how the sparse set behaves in a lot of subtle ways. A decent rule of thumb might be that higher sp gives higher accuracy in exchange for more space at low cardinalities, but even that might be non-linear. More "sp" means less "collisions", and therefore a more saturated sparse set for a given "p". The space consumption is also amplified in serial form relative to in-memory because we use fixed-width integers (int[]) in memory and variable length integers for serialization.

Conversion between the two is defined strictly by "p" and sparse set saturation, so again, higher "sp" will delay conversion to normal mode, but with an interesting cost. It is clearly more accurate than a lower "sp" HLL++ when they are both in sparse mode, but since sparse mode is guaranteed to be at least as accurate as the corresponding normal mode (for less space), and the higher "sp" HLL++ will convert first, there may be an interval for which the higher "sp" HLL++ would have benefited more from degrading its "sp" value rather than doing a full -> normal conversion. However, it is likely to be a needless expense for most cases (as cardinalities will not naturally lie closer to exactly the conversion point than other places).

tl;dr: 14,25 is pretty okay

--
You received this message because you are subscribed to the Google Groups "stream-lib-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stream-lib-us...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matt Abrams

unread,
Jul 8, 2015, 10:59:48 PM7/8/15
to stream-...@googlegroups.com

Ha,  love the tl;dr

Pavel Martynov

unread,
Jul 10, 2015, 6:14:36 AM7/10/15
to stream-...@googlegroups.com
Ian, thanks for explanation! Will use tl;dr in my project :).
Nevertheless I think public API should be a little bit more userfriendly: some default constructor or something else.
with best regards, Pavel Martynov
Reply all
Reply to author
Forward
0 new messages