Re: The best of all cardinality algorithms

175 views
Skip to first unread message

Matt Abrams

unread,
Apr 28, 2013, 7:04:02 PM4/28/13
to stream-...@googlegroups.com
Charles -

How are you constructing the adaptive counter? The AdaptiveCounter
extends from LogLog. For counts less than 4.25M the AdaptiveCounter
creates a LinearCounter with 1% error. When the estimated size is
greater than 4.25M it creates a LogLog counter with k=16. Based on
your test results it seems likely your AdaptiveCounter is actually a
LinearCounter.

LinearCounter's are deferent from the LogLog family of estimators in
that LinearCounters grow larger as the number of objects you need to
offer to it grows. LinearCounters do provide a higher precision at
low cardinalities than a typical LogLog based counter. The advantage
of LogLog counters is that they can count very large sets and their
size is a function of the required precision not the number of
elements offered to them (there are some upper bounds at which point
the error increases, either billions or trillions depending on the
number of bits in a small byte). LogLog based counters focus on small
memory footprints. Often there are hundreds, thousands, or even
millions of counter instances for a given use case so memory
efficiency is key.

GIven the assumption you were counting relatively small sets your HLL+
instance was very likely using he 'sp' value before flipping to normal
mode. The recommended default value of sp is 25. Can you try your
test again with 25 to see what the error rate is? My guess is it will
improve a bit.

So bottom line is that each counter offers different properties. As a
general rule LinearCounters are optimal for smaller sets and cases
where you need a defined level of accuracy. In cases where the sets
may grow large HLL+ with (14,25) is likely your best starting point.
The reason we have LogLog, HyperLogLog, and HyperLogLog+ is that the
state of the art has evolved over time and we want to maintain
backward compatibility with implementations that have used LogLog or
HLL even though HLL+ is recommended for new implementations.

Matt


On Sun, Apr 28, 2013 at 1:29 PM, <overcl...@gmail.com> wrote:
> Hey,
>
> Many thanks for implementing all this cool stuff. I have been working with
> library for a while now; trying to tweak various parameters to find the best
> suitable cardinality mechanism to use.
> So I have got this is:
>
> TYPE ErrorRate (Size)
> Adaptive 0.1% (65kB)
> HLL+ 0.4% (~11 kB p=14; sp=20)
> LogLog 1.8% (65 kB)
>
> So is Adaptive counting expected to outperform HLL+ everytime?
> because I have tried HLL+ for various values {(12, 18), (15, 21), (16,22)}
> but it is {14,20} that deviates least from actual results.
>
> Thanks
> Charles Adam
>
> --
> You received this message because you are subscribed to the Google Groups
> "stream-lib-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to stream-lib-us...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Matt Abrams

unread,
Apr 29, 2013, 8:16:35 AM4/29/13
to stream-...@googlegroups.com, overcl...@gmail.com
Can you share a gist of your test code?

Thanks

--
Matt Abrams

On Monday, April 29, 2013 at 2:31 AM, overcl...@gmail.com wrote:

Hi,

Thanks for the elaborate reply. It did help clearing a lot of distinctions among these methods.
The dataset this code will address will have > 4 million unique values as well as < 100 values; (since unique values will have to be stored in multiple dimensions (grouped by each type) ) hence it makes sense to use HLL+ as it will adjust the memory accordingly; and at the same time sticking to the accuracy limits. However, the fact that I am not getting better than 0.4 % error rate is a bit troubling.

I am using the following constructor to declare an object: 
                    AdaptiveCounting.Builder.obyCount(Integer.MAX_VALUE).build();

I checked; it is going into the loglog constructor.
I am calculating this cardinality for a bunch of files; storing them in an array of type ICardinality; and then merging them all into one big ICardinality type.
I do this for LogLog and HLL+ also.

What I do not get is why I am getting different values when I am constructing my cardinality object as new LogLog(16) instead of AdaptiveCounting.Builder.obyCount(Integer.MAX_VALUE).build();
although Adaptive should use same mechanism for offer, merge and cardinality() functions (am I right making this assumption?).

Also I tried your suggestion of using (14,25). The results are the same.

Thanks for looking into this.


On Sunday, April 28, 2013 10:29:37 AM UTC-7, overcl...@gmail.com wrote:
Hey,

Many thanks for implementing all this cool stuff. I have been working with library for a while now; trying to tweak various parameters to find the best suitable cardinality mechanism to use.
So I have got this is:

TYPE            ErrorRate (Size)
Adaptive        0.1% (65kB)
HLL+             0.4% (~11 kB p=14; sp=20)
LogLog          1.8% (65 kB)

So is Adaptive counting expected to outperform HLL+ everytime?
because I have tried HLL+ for various values {(12, 18), (15, 21), (16,22)} but it is {14,20} that deviates least from actual results.

Thanks
Charles Adam

Matt Abrams

unread,
Apr 29, 2013, 3:15:10 PM4/29/13
to stream-...@googlegroups.com, overcl...@gmail.com
Thanks. Without your files to test with I cannot repeat the same
scenario. Here is a small gist that has a test for different P
values.

https://gist.github.com/abramsm/5483928

The results for a sample run are here:

Expect estimate for p=10 is 4318199 with error: -0.028142619047619048
is between 3790500.0 and 4609500.0 mem:700
Expect estimate for p=11 is 4200925 with error: -2.2023809523809523E-4
is between 3910439.773104109 and 4489560.226895891 mem:1384
Expect estimate for p=12 is 4244290 with error: -0.010545238095238095
is between 3995250.0 and 4404750.0 mem:2748
Expect estimate for p=13 is 4198910 with error: 2.5952380952380953E-4
is between 4055219.8865520544 and 4344780.113447946 mem:5480
Expect estimate for p=14 is 4204352 with error: -0.0010361904761904761
is between 4097625.0 and 4302375.0 mem:10940
Expect estimate for p=15 is 4208199 with error: -0.0019521428571428572
is between 4127609.943276027 and 4272390.056723973 mem:21864
Expect estimate for p=16 is 4206409 with error: -0.0015259523809523809
is between 4148812.5 and 4251187.5 mem:43708
Expect estimate for p=17 is 4209883 with error: -0.002353095238095238
is between 4163804.9716380136 and 4236195.028361986 mem:87400

What version of stream-lib are you using?

Thanks,
Matt

On Mon, Apr 29, 2013 at 12:06 PM, <rgu...@tunein.com> wrote:
> Here you go...I am iteratively reading 31 files;
> offering keys in each of the file to a cardinality object.
> And then I merge all the objects..
>
> public class HyperLogLogPlusRunner
> {
> ICardinality total_cardinality;
> ICardinality[] all_cardinalities;
> int p = 14;
> int sp = 25;
>
> public HyperLogLogPlusRunner(int size)
> {
> total_cardinality = new HyperLogLogPlus(p, sp);
> all_cardinalities = new ICardinality[size];
> }
>
>
> void readAndSet(Path pathToFile, int index)
> {
> Configuration conf = new Configuration();
> FileSystem fs;
> all_cardinalities[index] = new HyperLogLogPlus(p, sp);
> try
> {
> /* Some hadoop IO stuff to get the keys... */
> fs = FileSystem.get(conf);
> SequenceFile.Reader reader = new SequenceFile.Reader(fs,
> pathToFile, conf);
> Pair key = new Pair();
> NullWritable val = NullWritable.get();
>
> while (reader.next(key, val))
> {
> all_cardinalities[index].offer(key.getUnique().toString());
> }
> System.out.println("Count: " +
> all_cardinalities[index].cardinality());
> System.out.println("Size: " +
> all_cardinalities[index].getBytes().length);
>
> }
> catch (IOException e)
> {
> e.printStackTrace();
> }
> }
>
> void mergeAll()
> {
> try
> {
> ICardinality temp = total_cardinality.merge(all_cardinalities);
> System.out.println("Final Count");
> System.out.println( temp.getBytes().length );
> System.out.println( temp.cardinality());
> }
> catch (Exception e)
> {
> e.printStackTrace();
> }
> }
>
> public static void main(String args[])
> {
> int start = 1; int end = 31;
> HyperLogLogPlusRunner tst = new HyperLogLogPlusRunner(end - start +
> 1);
>
> for(int i = start; i <= end; i++)
> {
> /* get path to the ith .seq file */
> Path file_path = new Path("data/temp"+i+".seq");
> System.out.println("Processing: "+ "temp"+i+".seq");
>
> /* offer all key values to ith cardinality object */
> tst.readAndSet(file_path, i-1);
> }
> /* merge cardinalities into a single object */
> tst.mergeAll();

Matt Abrams

unread,
Apr 29, 2013, 4:00:51 PM4/29/13
to stream-...@googlegroups.com, overcl...@gmail.com
Thanks for investigating. Here is a modified gist using a merge:

https://gist.github.com/abramsm/5484296

the results are not as good as single HLL+ but not terrible. One
oddity which may be explained by variance is that lower p values
sometimes out perform higher p values.

Expect estimate for p=10 is 466918 with error: -0.037595555555555556
is between 406125.0 and 493875.0 mem:700
Expect estimate for p=11 is 446037 with error: 0.008806666666666668 is
between 418975.6899754402 and 481024.3100245598 mem:1384
Expect estimate for p=12 is 451829 with error: -0.004064444444444444
is between 428062.5 and 471937.5 mem:2748
Expect estimate for p=13 is 449006 with error: 0.0022088888888888887
is between 434487.84498772013 and 465512.15501227987 mem:5480
Expect estimate for p=14 is 455098 with error: -0.011328888888888888
is between 439031.25 and 460968.75 mem:10940
Expect estimate for p=15 is 447972 with error: 0.004506666666666667 is
between 442243.92249386007 and 457756.07750613993 mem:21864
Expect estimate for p=16 is 450017 with error: -3.777777777777778E-5
is between 444515.625 and 455484.375 mem:43708
Expect estimate for p=17 is 450524 with error: -0.0011644444444444445
is between 446121.96124693 and 453878.03875307 mem:87400

On Mon, Apr 29, 2013 at 3:38 PM, <rgu...@tunein.com> wrote:
> Hey,
>
> I don't know about the version; I just cloned the repo from
> https://github.com/clearspring/stream-lib.git a week ago.
> btw; thanks for the test Matt. Looks like HLL+ should perform better. I'll
> dig deeper into the stream-lib and my data. I'll let you know if results
> improve.
>
> Thanks a lot!
> Charles
Reply all
Reply to author
Forward
0 new messages