Counting using HyperLogLog

90 views
Skip to first unread message

pi...@opinary.com

unread,
Jul 31, 2017, 9:01:55 AM7/31/17
to Redis DB
Hi,

I was hoping to use redis' HyperLogLog implementation, to count user actions. My users are identified by UUID.

I did basic tests and I am confused by the results:

    $ for i in `seq 1 10000`; do redis-cli pfadd uuid_hll `uuidgen` ; done | grep 0 | wc -l
    1768

This means that for 10k of uuids counted, error rate is 17%, which seems to be really high.

I get similar error rate when adding number from  1 to 10k:

    $ for i in `seq 1 10000`; do redis-cli pfadd uuid_hll_3 $i ; done | grep 0 | wc -l
    1761


I am running my tests using redis 3.2.1 and 4.0.1

Could someone please explain me if this is expected behavior and point me in the right direction where could I learn more, how to properly use this data structure? 

Itamar Haber

unread,
Jul 31, 2017, 11:07:22 AM7/31/17
to Redis DB
Hllo Piotr,

I repeated the exercise locally w/ similar results. However, note that:

$ redis-cli pfcount uuid_hll
(integer) 10064
$ redis-cli pfcount uuid_hll_3
(integer) 10000

HLL is a probabilistic data structure with 0.81% standard error in the Redis implementation. The error rate applies to the entire structure, and the responses from individual PFADD instructions should not be used to calculate it. 

Salvatore Sanfilippo

unread,
Jul 31, 2017, 11:11:06 AM7/31/17
to redi...@googlegroups.com
Hello,

the answer to your question lies in the documentation for the return
value of PFADD:

"1 if at least 1 HyperLogLog internal register was altered. 0 otherwise."

The fact that the register was altered means that, if you call PFCOUNT
again, the result MAY be different compared to the last time, however
if it's really different or not, and in which way, is not specified.
So the number of 1s returned has nothing to do with the final count.

Cheers,
Salvatore
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at https://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/d/optout.



--
Salvatore 'antirez' Sanfilippo
open source developer - Redis Labs https://redislabs.com

"If a system is to have conceptual integrity, someone must control the
concepts."
— Fred Brooks, "The Mythical Man-Month", 1975.

pi...@opinary.com

unread,
Aug 1, 2017, 2:39:38 AM8/1/17
to Redis DB
Thank you for the explanation. I was indeed confused by PFADD return value meaning.

As I understand, the correct way to implement this would be to run:

    MULTI
    PFCOUNT myuuids
    PFADD myuuids xxx-yyy-zzz
    PFCOUNT myuuids
    EXEC

and then check both PFCOUNT results. If first PFCOUNT value is the same as the second one, then xxx-yyy-zzz was already present in myuuids. Otherwise, when first PFCOUNT returns value different than second call, xxx-yyy-zzz was added for the first time.

Alex Gusev

unread,
Aug 23, 2017, 7:16:17 AM8/23/17
to Redis DB
I don't think it will help. 0.81% is the error of PFCOUNT's scalar result. But it doesn't mean that PFCOUNT result will be displayed correctly for 99.79% of invocations after single PFADD.
Reply all
Reply to author
Forward
0 new messages