memory footprint of the HyperLogLog register

168 views
Skip to first unread message

cb

unread,
Jun 27, 2012, 9:35:23 AM6/27/12
to stream-...@googlegroups.com
hi,

OK - I'm trying to use the HyperLogLog class for cardinality estimation of a set that contains 1.5B unique items.

From my tests it seems that in order to achieve good accuracy - I have to set log2m param to 30. anything less will fail my accuracy goals (roughly below the 4% error)

For this settings - the memory footprint is 683MB ! (I used hyperLogLog.getBytes().length)

Is this reasonable? Is there any way to reduce the footprint?

I need to hold the serialized values in a key/value store and just thinking about transferring 683MB over the wire... not good for my online analytics :)

Note mentioning that I would like to store a hyperLogLog counter per daily hour and do merges in realtime per users request. It means that I will need to pull many of those in realtime. Size of each counter becomes a real bottleneck.

Any thoughts/advices on how to solve this?

Regards,
Chen.

Matt Abrams

unread,
Jun 27, 2012, 10:59:27 AM6/27/12
to stream-...@googlegroups.com
CB -

One of the things that makes HLL efficient is the use of 'small
bytes'. This helps us count to cardinalities of up to about 1B using
5 bits. For cardinalities over 1B this becomes a problem because the
number of hash collisions increases and this reduces the accuracy of
the count.

I think the solution in this case is to modify the RegisterSet
implementation to take an expected max cardinality and use a larger
value to avoid these hash collisions when cardinality >> 1B. This is
relevant to our use case at AddThis so I plan on working on a fix
soon. However, if you are interested in contributing a fix that would
be welcome. Using 20 bits I took the cardinality estimate every 10M
records from 0 to 1.5B and recorded the error. The results are below.
You can see we achieve 1% error up to about 1B, 1.5% error at 1.2 B,
and around 2% error at 1.5B. I did not see the 4% error that you
mentioned. Can you share your code example that you used?

Matt

Expect estimate: 10001221 is between 9969531.25 and 1.003046875E7
error %= -1.221E-4
Expect estimate: 19972233 is between 1.99390625E7 and 2.00609375E7
error %= 0.00138835
Expect estimate: 30001405 is between 2.990859375E7 and 3.009140625E7
error %= -4.683333333333333E-5
Expect estimate: 39976218 is between 3.9878125E7 and 4.0121875E7 error
%= 5.9455E-4
Expect estimate: 49973736 is between 4.984765625E7 and 5.015234375E7
error %= 5.2528E-4
Expect estimate: 59953664 is between 5.98171875E7 and 6.01828125E7
error %= 7.722666666666667E-4
Expect estimate: 69962678 is between 6.978671875E7 and 7.021328125E7
error %= 5.331714285714286E-4
Expect estimate: 79972560 is between 7.975625E7 and 8.024375E7 error %= 3.43E-4
Expect estimate: 89928792 is between 8.972578125E7 and 9.027421875E7
error %= 7.912E-4
Expect estimate: 99906814 is between 9.96953125E7 and 1.003046875E8
error %= 9.3186E-4
Expect estimate: 109919102 is between 1.0966484375E8 and
1.1033515625E8 error %= 7.354363636363637E-4
Expect estimate: 119890875 is between 1.19634375E8 and 1.20365625E8
error %= 9.09375E-4
Expect estimate: 129849677 is between 1.2960390625E8 and
1.3039609375E8 error %= 0.0011563307692307693
Expect estimate: 139894483 is between 1.395734375E8 and 1.404265625E8
error %= 7.536928571428572E-4
Expect estimate: 149881248 is between 1.4954296875E8 and
1.5045703125E8 error %= 7.9168E-4
Expect estimate: 159912794 is between 1.595125E8 and 1.604875E8 error
%= 5.450375E-4
Expect estimate: 169846585 is between 1.6948203125E8 and
1.7051796875E8 error %= 9.024411764705883E-4
Expect estimate: 179861731 is between 1.794515625E8 and 1.805484375E8
error %= 7.681611111111111E-4
Expect estimate: 189820093 is between 1.8942109375E8 and
1.9057890625E8 error %= 9.468789473684211E-4
Expect estimate: 199854914 is between 1.99390625E8 and 2.00609375E8
error %= 7.2543E-4
Expect estimate: 209922111 is between 2.0936015625E8 and
2.1063984375E8 error %= 3.709E-4
Expect estimate: 219889335 is between 2.193296875E8 and 2.206703125E8
error %= 5.030227272727273E-4
Expect estimate: 229942513 is between 2.2929921875E8 and
2.3070078125E8 error %= 2.499434782608696E-4
Expect estimate: 239829472 is between 2.3926875E8 and 2.4073125E8
error %= 7.105333333333333E-4
Expect estimate: 249789610 is between 2.4923828125E8 and
2.5076171875E8 error %= 8.4156E-4
Expect estimate: 259641311 is between 2.592078125E8 and 2.607921875E8
error %= 0.0013795730769230769
Expect estimate: 269508935 is between 2.6917734375E8 and
2.7082265625E8 error %= 0.0018187592592592593
Expect estimate: 279509573 is between 2.79146875E8 and 2.80853125E8
error %= 0.001751525
Expect estimate: 289401159 is between 2.8911640625E8 and
2.9088359375E8 error %= 0.002064968965517241
Expect estimate: 299477638 is between 2.990859375E8 and 3.009140625E8
error %= 0.0017412066666666667
Expect estimate: 309450373 is between 3.0905546875E8 and
3.1094453125E8 error %= 0.0017729903225806453
Expect estimate: 319477626 is between 3.19025E8 and 3.20975E8 error %=
0.00163241875
Expect estimate: 329489516 is between 3.2899453125E8 and
3.3100546875E8 error %= 0.001546921212121212
Expect estimate: 339503153 is between 3.389640625E8 and 3.410359375E8
error %= 0.001461314705882353
Expect estimate: 349586200 is between 3.4893359375E8 and
3.5106640625E8 error %= 0.0011822857142857143
Expect estimate: 359617742 is between 3.58903125E8 and 3.61096875E8
error %= 0.0010618277777777777
Expect estimate: 369765279 is between 3.6887265625E8 and
3.7112734375E8 error %= 6.34381081081081E-4
Expect estimate: 379664146 is between 3.788421875E8 and 3.811578125E8
error %= 8.838263157894737E-4
Expect estimate: 389607121 is between 3.8881171875E8 and
3.9118828125E8 error %= 0.0010073820512820513
Expect estimate: 399674063 is between 3.9878125E8 and 4.0121875E8
error %= 8.148425E-4
Expect estimate: 409591205 is between 4.0875078125E8 and
4.1124921875E8 error %= 9.97060975609756E-4
Expect estimate: 419520957 is between 4.187203125E8 and 4.212796875E8
error %= 0.0011405785714285715
Expect estimate: 429328929 is between 4.2868984375E8 and
4.3131015625E8 error %= 0.0015606302325581395
Expect estimate: 439383293 is between 4.38659375E8 and 4.41340625E8
error %= 0.0014016068181818182
Expect estimate: 449207149 is between 4.4862890625E8 and
4.5137109375E8 error %= 0.0017618911111111112
Expect estimate: 459035934 is between 4.585984375E8 and 4.614015625E8
error %= 0.002095795652173913
Expect estimate: 468980083 is between 4.6856796875E8 and
4.7143203125E8 error %= 0.002170036170212766
Expect estimate: 479171646 is between 4.785375E8 and 4.814625E8 error
%= 0.0017257375
Expect estimate: 489056787 is between 4.8850703125E8 and
4.9149296875E8 error %= 0.0019249244897959184
Expect estimate: 498878685 is between 4.984765625E8 and 5.015234375E8
error %= 0.00224263
Expect estimate: 508795307 is between 5.0844609375E8 and
5.1155390625E8 error %= 0.002362143137254902
Expect estimate: 518674952 is between 5.18415625E8 and 5.21584375E8
error %= 0.002548169230769231
Expect estimate: 528710461 is between 5.2838515625E8 and
5.3161484375E8 error %= 0.0024330924528301887
Expect estimate: 538518565 is between 5.383546875E8 and 5.416453125E8
error %= 0.002743398148148148
Expect estimate: 548583612 is between 5.4832421875E8 and
5.5167578125E8 error %= 0.002575250909090909
Expect estimate: 558597312 is between 5.5829375E8 and 5.6170625E8
error %= 0.0025048
Expect estimate: 568512330 is between 5.6826328125E8 and
5.7173671875E8 error %= 0.002609947368421053
Expect estimate: 578229611 is between 5.782328125E8 and 5.817671875E8
error %= 0.003052394827586207
Expect estimate: 587999632 is between 5.8820234375E8 and
5.9179765625E8 error %= 0.0033904542372881355
Expect estimate: 598026812 is between 5.98171875E8 and 6.01828125E8
error %= 0.003288646666666667
Expect estimate: 607964098 is between 6.0814140625E8 and
6.1185859375E8 error %= 0.003337544262295082
Expect estimate: 617897068 is between 6.181109375E8 and 6.218890625E8
error %= 0.0033918258064516127
Expect estimate: 627653883 is between 6.2808046875E8 and
6.3191953125E8 error %= 0.003723995238095238
Expect estimate: 637634015 is between 6.3805E8 and 6.4195E8 error %=
0.0036968515625
Expect estimate: 647586741 is between 6.4801953125E8 and
6.5198046875E8 error %= 0.0037127061538461538
Expect estimate: 657314115 is between 6.579890625E8 and 6.620109375E8
error %= 0.004069522727272727
Expect estimate: 667312744 is between 6.6795859375E8 and
6.7204140625E8 error %= 0.004010829850746269
Expect estimate: 677261628 is between 6.77928125E8 and 6.82071875E8
error %= 0.004027017647058823
Expect estimate: 687182675 is between 6.8789765625E8 and
6.9210234375E8 error %= 0.004083079710144927
Expect estimate: 697152125 is between 6.978671875E8 and 7.021328125E8
error %= 0.004068392857142857
Expect estimate: 706914779 is between 7.0783671875E8 and
7.1216328125E8 error %= 0.004345381690140845
Expect estimate: 716607579 is between 7.1780625E8 and 7.2219375E8
error %= 0.004711695833333333
Expect estimate: 726395954 is between 7.2777578125E8 and
7.3222421875E8 error %= 0.004937049315068493
Expect estimate: 736471549 is between 7.377453125E8 and 7.422546875E8
error %= 0.004768177027027027
Expect estimate: 746408272 is between 7.4771484375E8 and
7.5228515625E8 error %= 0.004788970666666666
Expect estimate: 756069744 is between 7.57684375E8 and 7.62315625E8
error %= 0.00517138947368421
Expect estimate: 765765852 is between 7.6765390625E8 and
7.7234609375E8 error %= 0.005498893506493506
Expect estimate: 775640259 is between 7.776234375E8 and 7.823765625E8
error %= 0.005589411538461538
Expect estimate: 785540261 is between 7.8759296875E8 and
7.9240703125E8 error %= 0.005645239240506329
Expect estimate: 795506264 is between 7.975625E8 and 8.024375E8 error
%= 0.00561717
Expect estimate: 805277185 is between 8.0753203125E8 and
8.1246796875E8 error %= 0.005830635802469136
Expect estimate: 814895060 is between 8.175015625E8 and 8.224984375E8
error %= 0.006225536585365853
Expect estimate: 824480029 is between 8.2747109375E8 and
8.3252890625E8 error %= 0.006650567469879518
Expect estimate: 834308804 is between 8.37440625E8 and 8.42559375E8
error %= 0.0067752333333333335
Expect estimate: 844103887 is between 8.4741015625E8 and
8.5258984375E8 error %= 0.006936603529411765
Expect estimate: 853779474 is between 8.573796875E8 and 8.626203125E8
error %= 0.00723316976744186
Expect estimate: 863600921 is between 8.6734921875E8 and
8.7265078125E8 error %= 0.0073552632183908045
Expect estimate: 873309338 is between 8.7731875E8 and 8.8268125E8
error %= 0.007603025
Expect estimate: 883137786 is between 8.8728828125E8 and
8.9271171875E8 error %= 0.007710352808988764
Expect estimate: 892798062 is between 8.972578125E8 and 9.027421875E8
error %= 0.008002153333333333
Expect estimate: 902012930 is between 9.0722734375E8 and
9.1277265625E8 error %= 0.008777
Expect estimate: 911920590 is between 9.17196875E8 and 9.22803125E8
error %= 0.008781967391304348
Expect estimate: 921498515 is between 9.2716640625E8 and
9.3283359375E8 error %= 0.009141381720430107
Expect estimate: 931230782 is between 9.371359375E8 and 9.428640625E8
error %= 0.009328955319148936
Expect estimate: 940993876 is between 9.4710546875E8 and
9.5289453125E8 error %= 0.00948013052631579
Expect estimate: 950603057 is between 9.57075E8 and 9.62925E8 error %=
0.009788482291666666
Expect estimate: 960429399 is between 9.6704453125E8 and
9.7295546875E8 error %= 0.009866598969072165
Expect estimate: 970356361 is between 9.770140625E8 and 9.829859375E8
error %= 0.009840447959183674
Expect estimate: 979977332 is between 9.8698359375E8 and
9.9301640625E8 error %= 0.01012390707070707
Expect estimate: 989620929 is between 9.96953125E8 and 1.003046875E9
error %= 0.010379071
Expect estimate: 999267438 is between 1.00692265625E9 and
1.01307734375E9 error %= 0.01062629900990099
Expect estimate: 1008924430 is between 1.0168921875E9 and
1.0231078125E9 error %= 0.010858401960784313
Expect estimate: 1018744656 is between 1.02686171875E9 and
1.03313828125E9 error %= 0.010927518446601942
Expect estimate: 1028345305 is between 1.03683125E9 and 1.04316875E9
error %= 0.0112064375
Expect estimate: 1038161806 is between 1.04680078125E9 and
1.05319921875E9 error %= 0.011274470476190476
Expect estimate: 1047698134 is between 1.0567703125E9 and
1.0632296875E9 error %= 0.011605533962264152
Expect estimate: 1057451040 is between 1.06673984375E9 and
1.07326015625E9 error %= 0.011728
Expect estimate: 1067284127 is between 1.076709375E9 and 1.083290625E9
error %= 0.011773956481481482
Expect estimate: 1077084804 is between 1.08667890625E9 and
1.09332109375E9 error %= 0.01184880366972477
Expect estimate: 1086672224 is between 1.0966484375E9 and
1.1033515625E9 error %= 0.01211616
Expect estimate: 1096450731 is between 1.10661796875E9 and
1.11338203125E9 error %= 0.012206548648648648
Expect estimate: 1106192631 is between 1.1165875E9 and 1.1234125E9
error %= 0.012328008035714285
Expect estimate: 1115889699 is between 1.12655703125E9 and
1.13344296875E9 error %= 0.01248699203539823
Expect estimate: 1125148831 is between 1.1365265625E9 and
1.1434734375E9 error %= 0.013027341228070175
Expect estimate: 1134431956 is between 1.14649609375E9 and
1.15350390625E9 error %= 0.013537429565217392
Expect estimate: 1143967826 is between 1.156465625E9 and 1.163534375E9
error %= 0.013820839655172414
Expect estimate: 1153535298 is between 1.16643515625E9 and
1.17356484375E9 error %= 0.014072394871794873
Expect estimate: 1163004276 is between 1.1764046875E9 and
1.1835953125E9 error %= 0.01440315593220339
Expect estimate: 1172400825 is between 1.18637421875E9 and
1.19362578125E9 error %= 0.014789222689075631
Expect estimate: 1182084427 is between 1.19634375E9 and 1.20365625E9
error %= 0.014929644166666667
Expect estimate: 1191642935 is between 1.20631328125E9 and
1.21368671875E9 error %= 0.015171128099173554
Expect estimate: 1200991300 is between 1.2162828125E9 and
1.2237171875E9 error %= 0.015580901639344263
Expect estimate: 1210384212 is between 1.22625234375E9 and
1.23374765625E9 error %= 0.01594779512195122
Expect estimate: 1220089240 is between 1.236221875E9 and 1.243778125E9
error %= 0.01605706451612903
Expect estimate: 1229495646 is between 1.24619140625E9 and
1.25380859375E9 error %= 0.0164034832
Expect estimate: 1239070667 is between 1.2561609375E9 and
1.2638390625E9 error %= 0.016610581746031746
Expect estimate: 1248438682 is between 1.26613046875E9 and
1.27386953125E9 error %= 0.016977415748031497
Expect estimate: 1258158447 is between 1.2761E9 and 1.2839E9 error %=
0.01706371328125
Expect estimate: 1267434086 is between 1.28606953125E9 and
1.29393046875E9 error %= 0.017492956589147287
Expect estimate: 1276986796 is between 1.2960390625E9 and
1.3039609375E9 error %= 0.017702464615384616
Expect estimate: 1286398750 is between 1.30600859375E9 and
1.31399140625E9 error %= 0.0180162213740458
Expect estimate: 1295765152 is between 1.315978125E9 and 1.324021875E9
error %= 0.018359733333333333
Expect estimate: 1305182418 is between 1.32594765625E9 and
1.33405234375E9 error %= 0.018659836090225562
Expect estimate: 1314791315 is between 1.3359171875E9 and
1.3440828125E9 error %= 0.018812451492537314
Expect estimate: 1324472268 is between 1.34588671875E9 and
1.35411328125E9 error %= 0.01890943111111111
Expect estimate: 1333891674 is between 1.35585625E9 and 1.36414375E9
error %= 0.019197298529411766
Expect estimate: 1342935335 is between 1.36582578125E9 and
1.37417421875E9 error %= 0.0197552299270073
Expect estimate: 1352183509 is between 1.3757953125E9 and
1.3842046875E9 error %= 0.020156877536231885
Expect estimate: 1361592187 is between 1.38576484375E9 and
1.39423515625E9 error %= 0.020437275539568346
Expect estimate: 1370972134 is between 1.395734375E9 and 1.404265625E9
error %= 0.02073419
Expect estimate: 1380356094 is between 1.40570390625E9 and
1.41429609375E9 error %= 0.021024046808510638
Expect estimate: 1389475546 is between 1.4156734375E9 and
1.4243265625E9 error %= 0.021496094366197184
Expect estimate: 1398845192 is between 1.42564296875E9 and
1.43435703125E9 error %= 0.02178657902097902
Expect estimate: 1408334734 is between 1.4356125E9 and 1.4443875E9
error %= 0.021989768055555556
Expect estimate: 1417412540 is between 1.44558203125E9 and
1.45441796875E9 error %= 0.022474110344827585
Expect estimate: 1426978815 is between 1.4555515625E9 and
1.4644484375E9 error %= 0.02261725
Expect estimate: 1436270463 is between 1.46552109375E9 and
1.47447890625E9 error %= 0.02294526326530612
Expect estimate: 1445727982 is between 1.475490625E9 and 1.484509375E9
error %= 0.023156768918918918
Expect estimate: 1455016892 is between 1.48546015625E9 and
1.49453984375E9 error %= 0.023478595973154364

cb

unread,
Jun 27, 2012, 2:50:39 PM6/27/12
to stream-...@googlegroups.com
hi Matt,

I'm sorry if I wasn't clear - the 4% is our internal threshold, we cannot overrun this value.

It is not clear to me how can you count 1B unique items using 5bit instances.

From my understanding - you must use a register which is big enough to hold 'large IDs' - since we are dealing with 1B unique items, the counter must be big enough since the IDs will be pretty big.

It would be really helpful if you can post an example - how to count 1B uniques using 5bit counters with a proven low estimation error.

Also - here is the code that I'm using for my test. Running it using 20 bits will fail. Please note that I'm not using the random population like you are doing in some of the test methods. I'm using a more direct approach for the population:

I changed the following methods in TestCardinality.java:

//Since I think there is a bug in this method I fix it to return the Hex representation of the input argument i instead of using a random representation

    protected static Object streamElement(long i)

    {                

        //return Long.toHexString(prng.nextLong());

    return Long.toHexString(i);

        //return se++;

    }



    @Test

    public void testHighCardinality()

    {

        long start = System.currentTimeMillis();

        int bits = 30;

        HyperLogLog hyperLogLog = new HyperLogLog(bits);

        long size = 1500000000;

        for (long i = 0; i < size; i++)

        {

            hyperLogLog.offer(TestICardinality.streamElement(i));

        }


        System.out.println("build time: " + ((double)(System.currentTimeMillis() - start))/(double)60000 + "minutes.");

        long estimate = hyperLogLog.cardinality();

        System.out.println("The estimated cardinality is: " + estimate);

        double err = Math.abs(estimate - size) / (double) size;

        System.out.println("The estimation error is: " + err * 100 + "%");

        assertTrue(err < .1);

Matt Abrams

unread,
Jun 27, 2012, 2:56:33 PM6/27/12
to stream-...@googlegroups.com
CB -

Can you please change:

long estimate = hyperLogLog.cardinality();

to:

long estimate = hyperLogLog.cardinality(false);

Like I discussed in my previous response. Let me know what the results are.

Matt

Matt Abrams

unread,
Jun 27, 2012, 2:56:55 PM6/27/12
to stream-...@googlegroups.com
Also, change bits to 20.

Matt

cb

unread,
Jun 27, 2012, 3:51:49 PM6/27/12
to stream-...@googlegroups.com
OK - I've fixed the code per your request. I will let you know once the processing is done.

Meanwhile - Can you please elaborate on how 5bit HLL can be used to count 1B uniques?

A code example will be great.

Final note - when using HLL with log2m = 30, not only do I have a memory footprint problem but also from a merge perspective - the performance degrade significantly.

Matt Abrams

unread,
Jun 27, 2012, 4:52:55 PM6/27/12
to stream-...@googlegroups.com
CB -

Yes, I've mentioned that 30 is not an a appropriate size. If you use
it things will be slow and take lots of memory. I recommend that you
not use it. Please let us know what the results of the test I asked
you to run are.

If you are interested in the fundamentals of how HLL works you should
read this paper:

http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

When I have time I will work on enhancing our implementation to count
values about 1.5B with less collisions. If you need something quicker
then I am able to get to it we love contributions!

Matt

cb

unread,
Jun 28, 2012, 1:29:33 AM6/28/12
to stream-...@googlegroups.com
hi Matt,

I can confirm that your suggestion did work out - the estimation error for counting 1.5B uniques was 

2.73%.

Yet, log2m = 20  results in a 683KB blob size  which could become a bottleneck for retrieval purposes.

I will definitely read the article more thoroughly. My aim is trying to achieve a 5bit based counting system under the same accuracy restrictions.

When you have time - please share with us a sample code for doing that. If I manage to do it - I will share the code here.

In regards to contributing back - I would be happy to do that but I will need specific instructions ;) we can take it offline if you want.

Thanks for all the help!!

Chen.

Reply all
Reply to author
Forward
0 new messages