PFCOUNT

92 views
Skip to first unread message

Ghais Issa

unread,
Apr 17, 2014, 10:31:58 AM4/17/14
to redi...@googlegroups.com
Hi guys,

We use HLL at work a lot and we have developed our own solutions for the persistence of HyperLogLog. Our access patterns are mostly to query for the estimation resulting from the merge of N hyperloglogs (e.g. count hll_day1, hll_day2, hll_day3)

I was looking at migrating some of that code over to Redis 2.8.9 however the only way to achieve that would be to do PFMERGE dest key1 key2 key3 then PFCOUNT dest
PFMERGE ends up creating a new HLL which in our case is not really needed.

Looking at the code it seems that changing PFCOUNT key to PFCOUNT key [key ...] wouldn't add any performance penalties and it would allow for more flexible use.

Is there a reason PFCOUNT only accepts a single key?

Thanks a lot.

Salvatore Sanfilippo

unread,
Apr 17, 2014, 10:56:56 AM4/17/14
to Redis DB
Hello,

this seems like a very legitimate use case, I'm looking into the
implementation right now to check if I can support it before the merge
into 2.8.
Please, try to report ASAP any other things that are preventing you
from working your existing solution to Redis. Thank you, more news
about PFCOUNT in a few.

Salvatore
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/d/optout.



--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

To "attack a straw man" is to create the illusion of having refuted a
proposition by replacing it with a superficially similar yet
unequivalent proposition (the "straw man"), and to refute it
— Wikipedia (Straw man page)

Ghais Issa

unread,
Apr 17, 2014, 11:15:46 AM4/17/14
to redi...@googlegroups.com
Thanks Salvatore, I will prototype an implementation on top of PFMERGE for now until PFCOUNT key [key ...] is supported and report back on any issues I come across.

-Ghais

Salvatore Sanfilippo

unread,
Apr 17, 2014, 11:17:27 AM4/17/14
to Redis DB
On Thu, Apr 17, 2014 at 5:15 PM, Ghais Issa <ghais...@gmail.com> wrote:
> Thanks Salvatore, I will prototype an implementation on top of PFMERGE for
> now until PFCOUNT key [key ...] is supported and report back on any issues I
> come across.

Great, I almost finished the implementation btw, will likely be
committed in a few, so maybe it's worth to way a bit. More info ASAP.

Salvatore

Salvatore Sanfilippo

unread,
Apr 17, 2014, 11:36:14 AM4/17/14
to Redis DB
Ok, just pushed an implementation of PFCOUNT working with multiple
keys in the unstable branch.

I don't think it is the fastest we can get, especially when merging
different sparse HLLs where it is possible to skip many zero registers
without any computation.
However the implementation refactors PFMERGE into an API and uses
this, together with a special internal-only RAW encoding which is used
to at least avoid 6-bit integers computations while performing the
merge.

One problem with this approach is that PFCOUNT when the union is
performed can't cache the value of the approximated cardinality (since
this is just a temporary HLL, resulting from the merge) which was the
best speedup in the Redis implementation.

Cheers,
Salvatore

Ghais Issa

unread,
Apr 17, 2014, 11:53:48 AM4/17/14
to redi...@googlegroups.com
Thanks Salvatore

We currently have 3065251 HLLs in production, stored in approximately 35GB. We are less sensitive to count/estimation latency when counting multiple keys, however we would certainly benefit from the speedup when looking up a single HLL (we do that in our real time analytics views) 

I plan to start persisting new data in Redis 2.8.9 in the next couple of hours as well as load about 60 days worth of historic data soon after. I hope to get something measurable in the next day or 2. If I run into any issues I will certainly report them back.

Again, thanks a lot.


Salvatore Sanfilippo

unread,
Apr 17, 2014, 12:13:40 PM4/17/14
to Redis DB
On Thu, Apr 17, 2014 at 5:53 PM, Ghais Issa <ghais...@gmail.com> wrote:
> Thanks Salvatore
>
> We currently have 3065251 HLLs in production, stored in approximately 35GB.
> We are less sensitive to count/estimation latency when counting multiple
> keys, however we would certainly benefit from the speedup when looking up a
> single HLL (we do that in our real time analytics views)

Yes this is exactly how it works, PFCOUNT is very fast with single
key, a lot slower with N keys when merging is performed.
However the good news is that for sparsely populated HLLs I found (and
just committed) a very big speedup.

As far as raw speed is concerned, we have:

PFCOUNT at 800k ops/sec when it can use the cached value (almost
always possible in practice).
PFCOUNT with multiple *very* sparsely popultated keys at ~ 100k - 200k ops/sec
PFCOUNT with multiple very populated HLLs is very slow at ~ 2000 ops /
sec. It is terrible since it requires to access 16k registers into
different HLLs while working with floating point numbers...

Given this numbers, it is worth or not to work with this
implementation in your opinion / use case?

> I plan to start persisting new data in Redis 2.8.9 in the next couple of
> hours as well as load about 60 days worth of historic data soon after. I
> hope to get something measurable in the next day or 2. If I run into any
> issues I will certainly report them back.
>
> Again, thanks a lot.

You are welcome.

WARNING: All this new stuff we are talking about are only in the
*unstable* branch currently! :-)
I'll apply them into 2.8 after some more testing tomorrow.

The commits will cherry-pick without issues if you want it right now into 2.8.

Ghais Issa

unread,
Apr 17, 2014, 12:40:05 PM4/17/14
to redi...@googlegroups.com
These benchmarks look great, At least 3X performance improvement over what we have for multiple keys, and 20X+ for single keys. Most definitely supports our use case and a great reason to migrate.

Currently working with the unstable branch and we are in no hurry to get it into production. The goal of the migration is to reduce the number of code and servers we maintain and we already have multiple Redis servers in production for different use cases.

Salvatore Sanfilippo

unread,
Apr 17, 2014, 12:49:19 PM4/17/14
to Redis DB
Great news! Thanks.
Reply all
Reply to author
Forward
0 new messages