CB -
One of the things that makes HLL efficient is the use of 'small
bytes'. This helps us count to cardinalities of up to about 1B using
5 bits. For cardinalities over 1B this becomes a problem because the
number of hash collisions increases and this reduces the accuracy of
the count.
I think the solution in this case is to modify the RegisterSet
implementation to take an expected max cardinality and use a larger
value to avoid these hash collisions when cardinality >> 1B. This is
relevant to our use case at AddThis so I plan on working on a fix
soon. However, if you are interested in contributing a fix that would
be welcome. Using 20 bits I took the cardinality estimate every 10M
records from 0 to 1.5B and recorded the error. The results are below.
You can see we achieve 1% error up to about 1B, 1.5% error at 1.2 B,
and around 2% error at 1.5B. I did not see the 4% error that you
mentioned. Can you share your code example that you used?
Matt
Expect estimate: 10001221 is between 9969531.25 and 1.003046875E7
error %= -1.221E-4
Expect estimate: 19972233 is between 1.99390625E7 and 2.00609375E7
error %= 0.00138835
Expect estimate: 30001405 is between 2.990859375E7 and 3.009140625E7
error %= -4.683333333333333E-5
Expect estimate: 39976218 is between 3.9878125E7 and 4.0121875E7 error
%= 5.9455E-4
Expect estimate: 49973736 is between 4.984765625E7 and 5.015234375E7
error %= 5.2528E-4
Expect estimate: 59953664 is between 5.98171875E7 and 6.01828125E7
error %= 7.722666666666667E-4
Expect estimate: 69962678 is between 6.978671875E7 and 7.021328125E7
error %= 5.331714285714286E-4
Expect estimate: 79972560 is between 7.975625E7 and 8.024375E7 error %= 3.43E-4
Expect estimate: 89928792 is between 8.972578125E7 and 9.027421875E7
error %= 7.912E-4
Expect estimate: 99906814 is between 9.96953125E7 and 1.003046875E8
error %= 9.3186E-4
Expect estimate: 109919102 is between 1.0966484375E8 and
1.1033515625E8 error %= 7.354363636363637E-4
Expect estimate: 119890875 is between 1.19634375E8 and 1.20365625E8
error %= 9.09375E-4
Expect estimate: 129849677 is between 1.2960390625E8 and
1.3039609375E8 error %= 0.0011563307692307693
Expect estimate: 139894483 is between 1.395734375E8 and 1.404265625E8
error %= 7.536928571428572E-4
Expect estimate: 149881248 is between 1.4954296875E8 and
1.5045703125E8 error %= 7.9168E-4
Expect estimate: 159912794 is between 1.595125E8 and 1.604875E8 error
%= 5.450375E-4
Expect estimate: 169846585 is between 1.6948203125E8 and
1.7051796875E8 error %= 9.024411764705883E-4
Expect estimate: 179861731 is between 1.794515625E8 and 1.805484375E8
error %= 7.681611111111111E-4
Expect estimate: 189820093 is between 1.8942109375E8 and
1.9057890625E8 error %= 9.468789473684211E-4
Expect estimate: 199854914 is between 1.99390625E8 and 2.00609375E8
error %= 7.2543E-4
Expect estimate: 209922111 is between 2.0936015625E8 and
2.1063984375E8 error %= 3.709E-4
Expect estimate: 219889335 is between 2.193296875E8 and 2.206703125E8
error %= 5.030227272727273E-4
Expect estimate: 229942513 is between 2.2929921875E8 and
2.3070078125E8 error %= 2.499434782608696E-4
Expect estimate: 239829472 is between 2.3926875E8 and 2.4073125E8
error %= 7.105333333333333E-4
Expect estimate: 249789610 is between 2.4923828125E8 and
2.5076171875E8 error %= 8.4156E-4
Expect estimate: 259641311 is between 2.592078125E8 and 2.607921875E8
error %= 0.0013795730769230769
Expect estimate: 269508935 is between 2.6917734375E8 and
2.7082265625E8 error %= 0.0018187592592592593
Expect estimate: 279509573 is between 2.79146875E8 and 2.80853125E8
error %= 0.001751525
Expect estimate: 289401159 is between 2.8911640625E8 and
2.9088359375E8 error %= 0.002064968965517241
Expect estimate: 299477638 is between 2.990859375E8 and 3.009140625E8
error %= 0.0017412066666666667
Expect estimate: 309450373 is between 3.0905546875E8 and
3.1094453125E8 error %= 0.0017729903225806453
Expect estimate: 319477626 is between 3.19025E8 and 3.20975E8 error %=
0.00163241875
Expect estimate: 329489516 is between 3.2899453125E8 and
3.3100546875E8 error %= 0.001546921212121212
Expect estimate: 339503153 is between 3.389640625E8 and 3.410359375E8
error %= 0.001461314705882353
Expect estimate: 349586200 is between 3.4893359375E8 and
3.5106640625E8 error %= 0.0011822857142857143
Expect estimate: 359617742 is between 3.58903125E8 and 3.61096875E8
error %= 0.0010618277777777777
Expect estimate: 369765279 is between 3.6887265625E8 and
3.7112734375E8 error %= 6.34381081081081E-4
Expect estimate: 379664146 is between 3.788421875E8 and 3.811578125E8
error %= 8.838263157894737E-4
Expect estimate: 389607121 is between 3.8881171875E8 and
3.9118828125E8 error %= 0.0010073820512820513
Expect estimate: 399674063 is between 3.9878125E8 and 4.0121875E8
error %= 8.148425E-4
Expect estimate: 409591205 is between 4.0875078125E8 and
4.1124921875E8 error %= 9.97060975609756E-4
Expect estimate: 419520957 is between 4.187203125E8 and 4.212796875E8
error %= 0.0011405785714285715
Expect estimate: 429328929 is between 4.2868984375E8 and
4.3131015625E8 error %= 0.0015606302325581395
Expect estimate: 439383293 is between 4.38659375E8 and 4.41340625E8
error %= 0.0014016068181818182
Expect estimate: 449207149 is between 4.4862890625E8 and
4.5137109375E8 error %= 0.0017618911111111112
Expect estimate: 459035934 is between 4.585984375E8 and 4.614015625E8
error %= 0.002095795652173913
Expect estimate: 468980083 is between 4.6856796875E8 and
4.7143203125E8 error %= 0.002170036170212766
Expect estimate: 479171646 is between 4.785375E8 and 4.814625E8 error
%= 0.0017257375
Expect estimate: 489056787 is between 4.8850703125E8 and
4.9149296875E8 error %= 0.0019249244897959184
Expect estimate: 498878685 is between 4.984765625E8 and 5.015234375E8
error %= 0.00224263
Expect estimate: 508795307 is between 5.0844609375E8 and
5.1155390625E8 error %= 0.002362143137254902
Expect estimate: 518674952 is between 5.18415625E8 and 5.21584375E8
error %= 0.002548169230769231
Expect estimate: 528710461 is between 5.2838515625E8 and
5.3161484375E8 error %= 0.0024330924528301887
Expect estimate: 538518565 is between 5.383546875E8 and 5.416453125E8
error %= 0.002743398148148148
Expect estimate: 548583612 is between 5.4832421875E8 and
5.5167578125E8 error %= 0.002575250909090909
Expect estimate: 558597312 is between 5.5829375E8 and 5.6170625E8
error %= 0.0025048
Expect estimate: 568512330 is between 5.6826328125E8 and
5.7173671875E8 error %= 0.002609947368421053
Expect estimate: 578229611 is between 5.782328125E8 and 5.817671875E8
error %= 0.003052394827586207
Expect estimate: 587999632 is between 5.8820234375E8 and
5.9179765625E8 error %= 0.0033904542372881355
Expect estimate: 598026812 is between 5.98171875E8 and 6.01828125E8
error %= 0.003288646666666667
Expect estimate: 607964098 is between 6.0814140625E8 and
6.1185859375E8 error %= 0.003337544262295082
Expect estimate: 617897068 is between 6.181109375E8 and 6.218890625E8
error %= 0.0033918258064516127
Expect estimate: 627653883 is between 6.2808046875E8 and
6.3191953125E8 error %= 0.003723995238095238
Expect estimate: 637634015 is between 6.3805E8 and 6.4195E8 error %=
0.0036968515625
Expect estimate: 647586741 is between 6.4801953125E8 and
6.5198046875E8 error %= 0.0037127061538461538
Expect estimate: 657314115 is between 6.579890625E8 and 6.620109375E8
error %= 0.004069522727272727
Expect estimate: 667312744 is between 6.6795859375E8 and
6.7204140625E8 error %= 0.004010829850746269
Expect estimate: 677261628 is between 6.77928125E8 and 6.82071875E8
error %= 0.004027017647058823
Expect estimate: 687182675 is between 6.8789765625E8 and
6.9210234375E8 error %= 0.004083079710144927
Expect estimate: 697152125 is between 6.978671875E8 and 7.021328125E8
error %= 0.004068392857142857
Expect estimate: 706914779 is between 7.0783671875E8 and
7.1216328125E8 error %= 0.004345381690140845
Expect estimate: 716607579 is between 7.1780625E8 and 7.2219375E8
error %= 0.004711695833333333
Expect estimate: 726395954 is between 7.2777578125E8 and
7.3222421875E8 error %= 0.004937049315068493
Expect estimate: 736471549 is between 7.377453125E8 and 7.422546875E8
error %= 0.004768177027027027
Expect estimate: 746408272 is between 7.4771484375E8 and
7.5228515625E8 error %= 0.004788970666666666
Expect estimate: 756069744 is between 7.57684375E8 and 7.62315625E8
error %= 0.00517138947368421
Expect estimate: 765765852 is between 7.6765390625E8 and
7.7234609375E8 error %= 0.005498893506493506
Expect estimate: 775640259 is between 7.776234375E8 and 7.823765625E8
error %= 0.005589411538461538
Expect estimate: 785540261 is between 7.8759296875E8 and
7.9240703125E8 error %= 0.005645239240506329
Expect estimate: 795506264 is between 7.975625E8 and 8.024375E8 error
%= 0.00561717
Expect estimate: 805277185 is between 8.0753203125E8 and
8.1246796875E8 error %= 0.005830635802469136
Expect estimate: 814895060 is between 8.175015625E8 and 8.224984375E8
error %= 0.006225536585365853
Expect estimate: 824480029 is between 8.2747109375E8 and
8.3252890625E8 error %= 0.006650567469879518
Expect estimate: 834308804 is between 8.37440625E8 and 8.42559375E8
error %= 0.0067752333333333335
Expect estimate: 844103887 is between 8.4741015625E8 and
8.5258984375E8 error %= 0.006936603529411765
Expect estimate: 853779474 is between 8.573796875E8 and 8.626203125E8
error %= 0.00723316976744186
Expect estimate: 863600921 is between 8.6734921875E8 and
8.7265078125E8 error %= 0.0073552632183908045
Expect estimate: 873309338 is between 8.7731875E8 and 8.8268125E8
error %= 0.007603025
Expect estimate: 883137786 is between 8.8728828125E8 and
8.9271171875E8 error %= 0.007710352808988764
Expect estimate: 892798062 is between 8.972578125E8 and 9.027421875E8
error %= 0.008002153333333333
Expect estimate: 902012930 is between 9.0722734375E8 and
9.1277265625E8 error %= 0.008777
Expect estimate: 911920590 is between 9.17196875E8 and 9.22803125E8
error %= 0.008781967391304348
Expect estimate: 921498515 is between 9.2716640625E8 and
9.3283359375E8 error %= 0.009141381720430107
Expect estimate: 931230782 is between 9.371359375E8 and 9.428640625E8
error %= 0.009328955319148936
Expect estimate: 940993876 is between 9.4710546875E8 and
9.5289453125E8 error %= 0.00948013052631579
Expect estimate: 950603057 is between 9.57075E8 and 9.62925E8 error %=
0.009788482291666666
Expect estimate: 960429399 is between 9.6704453125E8 and
9.7295546875E8 error %= 0.009866598969072165
Expect estimate: 970356361 is between 9.770140625E8 and 9.829859375E8
error %= 0.009840447959183674
Expect estimate: 979977332 is between 9.8698359375E8 and
9.9301640625E8 error %= 0.01012390707070707
Expect estimate: 989620929 is between 9.96953125E8 and 1.003046875E9
error %= 0.010379071
Expect estimate: 999267438 is between 1.00692265625E9 and
1.01307734375E9 error %= 0.01062629900990099
Expect estimate: 1008924430 is between 1.0168921875E9 and
1.0231078125E9 error %= 0.010858401960784313
Expect estimate: 1018744656 is between 1.02686171875E9 and
1.03313828125E9 error %= 0.010927518446601942
Expect estimate: 1028345305 is between 1.03683125E9 and 1.04316875E9
error %= 0.0112064375
Expect estimate: 1038161806 is between 1.04680078125E9 and
1.05319921875E9 error %= 0.011274470476190476
Expect estimate: 1047698134 is between 1.0567703125E9 and
1.0632296875E9 error %= 0.011605533962264152
Expect estimate: 1057451040 is between 1.06673984375E9 and
1.07326015625E9 error %= 0.011728
Expect estimate: 1067284127 is between 1.076709375E9 and 1.083290625E9
error %= 0.011773956481481482
Expect estimate: 1077084804 is between 1.08667890625E9 and
1.09332109375E9 error %= 0.01184880366972477
Expect estimate: 1086672224 is between 1.0966484375E9 and
1.1033515625E9 error %= 0.01211616
Expect estimate: 1096450731 is between 1.10661796875E9 and
1.11338203125E9 error %= 0.012206548648648648
Expect estimate: 1106192631 is between 1.1165875E9 and 1.1234125E9
error %= 0.012328008035714285
Expect estimate: 1115889699 is between 1.12655703125E9 and
1.13344296875E9 error %= 0.01248699203539823
Expect estimate: 1125148831 is between 1.1365265625E9 and
1.1434734375E9 error %= 0.013027341228070175
Expect estimate: 1134431956 is between 1.14649609375E9 and
1.15350390625E9 error %= 0.013537429565217392
Expect estimate: 1143967826 is between 1.156465625E9 and 1.163534375E9
error %= 0.013820839655172414
Expect estimate: 1153535298 is between 1.16643515625E9 and
1.17356484375E9 error %= 0.014072394871794873
Expect estimate: 1163004276 is between 1.1764046875E9 and
1.1835953125E9 error %= 0.01440315593220339
Expect estimate: 1172400825 is between 1.18637421875E9 and
1.19362578125E9 error %= 0.014789222689075631
Expect estimate: 1182084427 is between 1.19634375E9 and 1.20365625E9
error %= 0.014929644166666667
Expect estimate: 1191642935 is between 1.20631328125E9 and
1.21368671875E9 error %= 0.015171128099173554
Expect estimate: 1200991300 is between 1.2162828125E9 and
1.2237171875E9 error %= 0.015580901639344263
Expect estimate: 1210384212 is between 1.22625234375E9 and
1.23374765625E9 error %= 0.01594779512195122
Expect estimate: 1220089240 is between 1.236221875E9 and 1.243778125E9
error %= 0.01605706451612903
Expect estimate: 1229495646 is between 1.24619140625E9 and
1.25380859375E9 error %= 0.0164034832
Expect estimate: 1239070667 is between 1.2561609375E9 and
1.2638390625E9 error %= 0.016610581746031746
Expect estimate: 1248438682 is between 1.26613046875E9 and
1.27386953125E9 error %= 0.016977415748031497
Expect estimate: 1258158447 is between 1.2761E9 and 1.2839E9 error %=
0.01706371328125
Expect estimate: 1267434086 is between 1.28606953125E9 and
1.29393046875E9 error %= 0.017492956589147287
Expect estimate: 1276986796 is between 1.2960390625E9 and
1.3039609375E9 error %= 0.017702464615384616
Expect estimate: 1286398750 is between 1.30600859375E9 and
1.31399140625E9 error %= 0.0180162213740458
Expect estimate: 1295765152 is between 1.315978125E9 and 1.324021875E9
error %= 0.018359733333333333
Expect estimate: 1305182418 is between 1.32594765625E9 and
1.33405234375E9 error %= 0.018659836090225562
Expect estimate: 1314791315 is between 1.3359171875E9 and
1.3440828125E9 error %= 0.018812451492537314
Expect estimate: 1324472268 is between 1.34588671875E9 and
1.35411328125E9 error %= 0.01890943111111111
Expect estimate: 1333891674 is between 1.35585625E9 and 1.36414375E9
error %= 0.019197298529411766
Expect estimate: 1342935335 is between 1.36582578125E9 and
1.37417421875E9 error %= 0.0197552299270073
Expect estimate: 1352183509 is between 1.3757953125E9 and
1.3842046875E9 error %= 0.020156877536231885
Expect estimate: 1361592187 is between 1.38576484375E9 and
1.39423515625E9 error %= 0.020437275539568346
Expect estimate: 1370972134 is between 1.395734375E9 and 1.404265625E9
error %= 0.02073419
Expect estimate: 1380356094 is between 1.40570390625E9 and
1.41429609375E9 error %= 0.021024046808510638
Expect estimate: 1389475546 is between 1.4156734375E9 and
1.4243265625E9 error %= 0.021496094366197184
Expect estimate: 1398845192 is between 1.42564296875E9 and
1.43435703125E9 error %= 0.02178657902097902
Expect estimate: 1408334734 is between 1.4356125E9 and 1.4443875E9
error %= 0.021989768055555556
Expect estimate: 1417412540 is between 1.44558203125E9 and
1.45441796875E9 error %= 0.022474110344827585
Expect estimate: 1426978815 is between 1.4555515625E9 and
1.4644484375E9 error %= 0.02261725
Expect estimate: 1436270463 is between 1.46552109375E9 and
1.47447890625E9 error %= 0.02294526326530612
Expect estimate: 1445727982 is between 1.475490625E9 and 1.484509375E9
error %= 0.023156768918918918
Expect estimate: 1455016892 is between 1.48546015625E9 and
1.49453984375E9 error %= 0.023478595973154364