HyperLogLog - Intersection Arithmetic

2,651 views
Skip to first unread message

cb

unread,
Jul 18, 2012, 9:06:47 AM7/18/12
to stream-...@googlegroups.com
hi,

Is it possible to extend HLL to provide intersection between two (or more) HLLs ?

HLL1 Intersect HLL2 = cardinality of items that where represented in both HLL1 and HLL2

If anyone can point me to an algorithm - I'll be happy to contribute and extend the existing implementation.

Regards,
Chen.

Matt Abrams

unread,
Jul 18, 2012, 9:18:36 AM7/18/12
to stream-...@googlegroups.com
Chen -

There was another thread on this topic a few months ago. I will
forward the thread to you.

Matt

Matt Abrams

unread,
Sep 25, 2012, 7:37:35 AM9/25/12
to stream-...@googlegroups.com
I've pasted the thread from our pre stream-lib-user email chain so it
is here for posterity. You might also look at the code twitter just
open sourced that has an estimateIntersectionSize method. I'll likely
add that to our implementation unless someone beats me to it.

https://github.com/echen/algebird/blob/master/src/main/scala/com/twitter/algebird/HyperLogLog.scala

hi Matt,

I can confirm that the |A INTERSECT B| = |A| + |B| - |A UNION B| works
really good. I confirmed this by testing a 2B uniques hierarchical
data : Uniques per GEO and breakdown by Gender.

Two quick questions -

It seems that the HLL accuracy is not consistent and depends on UID
distribution.

In my test app, I'm building sequential IDs (starting from 1 up to 2B)
and as a result, the distribution is bulked in each segment. This is
obviously not a natural distribution as expected in real life. Do you
think that this is the reason why for example the HLL of GEO1->Female
is 3% error and GEO2->Female is only 0.5% error although both contain
the same order of cardinality?

The second question is:

I saw in one of the HLL talks that in addition to Intersection, there
is another important feature which is DIFF between two HLLs.
Do you agree and if so - can you provide an example use case where
DIFF operations might become handy?

Many Thanks!


Chen.


On Wed, Jul 18, 2012 at 4:19 PM, Matt Abrams <abr...@gmail.com> wrote:
---------- Forwarded message ----------
From: Josh Ferguson <jfer...@yammer-inc.com>
Date: Tue, Apr 17, 2012 at 5:51 PM
Subject: Re: Intersection of HyperLogLog register sets
To: Matt Abrams <abr...@gmail.com>
Cc: Chris Burroughs <chris.b...@gmail.com>


The intersection approach won't work exactly due to some double
counting problems. The best approach to thinking about these things I
guess is as compressed inverted indices. Our approach now will be to
build sets of data that include common dimensions we want to slice and
dice together. You can think about this as coming up with sets of
indices for database tables that you want to put into predicates
together.

For instance one of our inverted indices for our "click stream" events
dataset might looks like:

day, controller, action, client_id, user_agent_name,
user_agent_version, user_agent_platform, continent, country, users,
event_count

The users field would hold the probabilistic registers. For every
distinct combination of every value of these fields we'd have one
bitset for the users that were contained in that set. From this we
could simply do filters and groupings with the estimation of union'd
registers as the aggregate function. We could do this on any subset of
the fields contained in this table. The hardest part at this point is
figuring out what our front-end use cases are such that we can build
the best set dimension groupings possible. It might also be nice to
write some custom postgres aggregate functions in order to do a lot of
the front-end reporting calculations in a database instead of
rewriting query logic programmatically.

Josh Ferguson


On Apr 17, 2012, at 11:00 AM, Matt Abrams wrote:

> Josh -
>
> That sounds like a interesting approach. Please let us know how it works out.
>
> I created a java version of HLL using your JavaScript implementation as a model:
>
> https://github.com/clearspring/stream-lib/blob/hyperloglog/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java
>
> Re the storage size issue I mentioned in the other email. I think if
> you use this:
>
> Math.floor(count / 6.0)
>
> it will size the array correctly.
>
> Regards,
> Matt
>
> On Mon, Apr 16, 2012 at 2:24 PM, Josh Ferguson <jfer...@yammer-inc.com> wrote:
>> There could certainly be bugs. I just started implementing it and testing it last week and all getting all the crazy math to match up exactly is hard for my soft little brain. I'll look into it some more and see if I can write a test case that reproduces it.
>>
>> What we're going to do is essentially try and create partitioned datasets around certain dimensions of our data. So for instance have event logging where we log all kinds of information about users and actions they take. We have a few general sets of dimensions around these as follows:
>>
>> Time
>> month, day, hour
>>
>> Events
>> event name, controller, action
>>
>> Client / User Agent
>> client, user agent name, user agent version, user agent platform
>>
>> Location
>> continent, country, region, closest large city (pop > 500,000)
>>
>> For every distinct combination of values in these dimension sets we're going to keep a set of registers for user_id and event_id. This will allow us to count the number of users and number of events that occurred. We can then use unions and intersections to calculate various things.
>>
>> For instance say I want to know the estimate of # of users in the past 28 days who posted a message grouped by country.
>>
>> First we'd union estimators within each of the dimension sets
>>
>> Union the sets of users over the past 28 days from 'Time'
>> Union the sets of users for grouped by country from 'Location'
>> Union the sets of users with event name 'message_creation' from 'Events'
>>
>> We then combine them in a grouped fashion where the aggregate is intersection to get the final grouped estimated dataset.
>>
>> Our approach is basically like building a probabilistic database but I think the benefit here is that you can query *very* large datasets in near real-time and build out some really great dashboard and analytics functionality. I think we're going to be using CrossFIlter (formerly Tesseract) to help us achieve a lot of the more database type functionality within our datasets and just pass in our own aggregate functions to do the unions and intersections.
>>
>> What is your thought on this?
>>
>> Josh
>>
>>
>> On Apr 16, 2012, at 6:30 AM, Matt Abrams wrote:
>>
>>> Josh -
>>>
>>> Thanks, this is really helpful. I was running into the same
>>> overestimation problem this weekend but hadn't found a solution yet.
>>> I will try to implment your approach today.
>>>
>>> Also, I was looking at your implementation and I'm not sure I
>>> understand it completely. In register_set.js you set the storage size
>>> for your buckets based on the function:
>>>
>>> var storage = Math.floor(5 * this.count / this.bitCount)
>>>
>>> if(storage == 0)
>>> this.buckets = new Array(1)
>>> else if(storage % this.bitCount == 0)
>>> this.buckets = new Array(storage)
>>> else
>>> this.buckets = new Array(storage + 1)
>>>
>>> so for count = 256 you will end up with storage = 40 and 41 buckets.
>>>
>>> Later the estimate method of estimator.js runs:
>>>
>>> for(var j = 0; j < registers.count; j++) {
>>> r_sum += Math.pow(2, (-1 * registers.get(j)))
>>> }
>>>
>>> So we would pass a number like 255 to registers.get(j) and that method
>>> calculates the bucket position based on j, so it would be:
>>>
>>> var bucketPos = Math.floor(position / this.registersPerBucket)
>>>
>>> and we'd get bucketPos = 42. JS transparently adds the extra buckets
>>> but I don' think that is intended.
>>>
>>> Did I interpret this correctly?
>>>
>>> Thanks,
>>> Matt
>>>
>>>
>>> On Sun, Apr 15, 2012 at 9:45 PM, Josh Ferguson <jfer...@yammer-inc.com> wrote:
>>>> I emailed Eric Fusy one of the authors. The answer is of course to remember from set theory that
>>>>
>>>> |A INTERSECT B| = |A| + |B| - |A UNION B|
>>>>
>>>> So given that we can estimate individual registers and merge registers to produce the union we can also get good results for the intersection. Eric said they had seen very good results with this method. I'm writing some unit tests around my js implementation to see what kind of error rates we see at different register sizes.
>>>>
>>>> Josh Ferguson
>>>>
>>>> On Apr 15, 2012, at 3:46 PM, Chris Burroughs wrote:
>>>>
>>>>> I can't recall if the intersection is possibly from the literature.
>>>>> I'll try to look it up. Matt Abrams has been working on the HyperLogLog
>>>>> integration and may be able to elaborate.
>>>>>
>>>>> On 04/12/2012 09:06 PM, Josh Ferguson wrote:
>>>>>> Have you done any work on taking the intersection of two loglog register sets? The union is represented by your merge function wherein you take the max of each register position but taking minimum of each one doesn't produce an intersection. It actually tends to over-estimate the size of the intersection systematically which leads me to believe there is some kind of probabilistic way to quantify or re-adjust the results. It man be the MIN{r_1, r_2, r_3} is not the right function to apply. Any ideas?
>>>>>>
>>>>>> I did a test as follows:
>>>>>>
>>>>>> I created 2 datasets, each of 1000 distinct items that have a 500 item overlap.
>>>>>> I created 4 sets of registers using the HyperLogLog formula from Figure 3 of the paper.
>>>>>>
>>>>>> e1 and e2 are the registers for the first and second 1000 item datasets respectively.
>>>>>> e3 combines e1 and e2 by taking the minimum value of the registers across every position
>>>>>>
>>>>>> e4 is the set of registers created by first taking the intersection of the set and then running the algorithm normally.
>>>>>>
>>>>>> e1: [ 5, 8, 6, 10, 10, 6, 7, 7, 4, 12, 6, 7, 9, 8, 7, 8 ]
>>>>>> e2: [ 10, 8, 6, 10, 5, 6, 7, 6, 6, 5, 5, 7, 6, 9, 7, 8 ]
>>>>>> -----------------------------
>>>>>> e3: [ 5, 8, 6, 10, 5, 6, 7, 6, 4, 5, 5, 7, 6, 8, 7, 8 ]
>>>>>> e4: [ 4, 8, 6, 10, 4, 5, 7, 4, 4, 5, 4, 7, 6, 5, 7, 8 ]
>>>>>>
>>>>>> The code of my implemention is located at https://github.com/yammer/probablyjs/blob/master/lib/cardinality/log_log.js
>>>>>> The test case that generates this dataset is located from line 63 to line 87 at https://github.com/yammer/probablyjs/blob/master/spec/cardinality/log_log_spec.rb
>>>>>>
>>>>>>
>>>>>
>>>>
>>


On Tue, Sep 25, 2012 at 2:07 AM, <gar...@gmail.com> wrote:
> Also looking for doing intersections ... Anybody have any info on howo to do
> that?
>
> Davy

sto...@gmail.com

unread,
Apr 4, 2014, 9:33:50 AM4/4/14
to stream-...@googlegroups.com, timon.k...@gmail.com
Hi

Was intersection (for arbitrary number of HyperLogLogs) ever implemented in Java? I saw the Algebird implementation but would like to avoid the overhead of having a whole new different language dependency. 

Cheers,
-Kristoffer

On Monday, December 17, 2012 11:08:09 PM UTC+1, timon.k...@gmail.com wrote:
Hey Matt,

I've written up a little post on using HyperLogLog for intersections, where I go through what the error looks like and the effect of register count on the accuracy. Hope this helps address some of the "how"s and "how well"s raised in that thread.

Matt Abrams

unread,
Apr 4, 2014, 10:09:29 AM4/4/14
to stream-...@googlegroups.com, Timon Karnezos
Yes, in general HLL intersection in StreamLib works. |A INTERSECT B|
= |A| + |B| - |A UNION B|. Timon's article on intersection is
important to read though. The usefulness of HLL intersection depends
on the features of the HLLs you are intersecting. Quoting from his
article:

"""
The intuition behind this is very simple: if the error of any one of
the terms in your calculation is roughly as large as the true value of
the result then you're not going to estimate that result well.
"""

I do recommend using HLLPlus over HLL.

https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java

It performs significantly better than HLL in terms of size, speed, and
accuracy for both large and small cardinality sets.

Matt
> --
> You received this message because you are subscribed to the Google Groups
> "stream-lib-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to stream-lib-us...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Kristoffer Sjögren

unread,
Apr 4, 2014, 10:39:38 AM4/4/14
to stream-...@googlegroups.com, Timon Karnezos
Thanks Matt. 

Maybe i'm overlooking something obvious, but I cant find how to intersect HyperLogLogPlus instances? Is this the merge operation you're referring to?


You received this message because you are subscribed to a topic in the Google Groups "stream-lib-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stream-lib-user/5E8Ejd2fBCU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stream-lib-us...@googlegroups.com.

Matt Abrams

unread,
Apr 4, 2014, 10:48:43 AM4/4/14
to stream-...@googlegroups.com, Timon Karnezos
You take the cardinality of A plus cardinality of B and then subtract
the cardinality of A merged with B.

So, in psuedo code:

aCard = a.cardinality();
bCard = b.cardinality();
Merged m = new HLLPlus()
m.addAll(a);
m.addAll(b);
mCard = m.cardinality();
intersect = aCard + bCard - mCard

Matt

Kristoffer Sjögren

unread,
Apr 5, 2014, 5:35:55 AM4/5/14
to stream-...@googlegroups.com
Is it possible to calculate intersection for multiple cardinalities as well?

The numbers get way off by simply add another cardinality c.

aCard = a.cardinality();
bCard = b.cardinality();
cCard = c.cardinality();

Merged m = new HLLPlus()
m.addAll(a);
m.addAll(b);
m.addAll(c);
mCard = m.cardinality();
intersect = aCard + bCard + cCard - mCard

Matt Abrams

unread,
Apr 5, 2014, 5:47:58 AM4/5/14
to stream-...@googlegroups.com
Without knowing the characteristics of your dataset I cannot answer
this question. Have you reviewed Timon's blog on intersection to
ensure you are using HLLs with the proper characteristics for
intersection?

http://blog.aggregateknowledge.com/2012/12/17/hll-intersections-2/

Matt

Kristoffer Sjögren

unread,
Apr 5, 2014, 6:20:42 AM4/5/14
to stream-...@googlegroups.com
Yes, I have read the blog on intersection and the characteristics of the sets are roughly as recommended in that post.

I have picked sets of 1.000.000 random murmur hashed numbers with a overlap of 100.000 numbers. 

It works great for 2 sets, around 1-2% error. With 3 sets I get an error over 100%. 



public class Intersection {

  public static int intersectionNum = 500_000;

  public static int aNum = 1_000_000;
  public static int bNum = 1_000_000;
  public static int cNum = 1_000_000;

  private static DecimalFormat format = new DecimalFormat();
  static {
    format.setMaximumFractionDigits(2);
  }

  public static void main(String[] args) throws IOException, CardinalityMergeException {
    for (int i = 0; i < 10; i++) {
      List<Long> intersection = random(intersectionNum);

      HyperLogLogPlus aLog = createLogLog("a", aNum, intersection);
      HyperLogLogPlus bLog = createLogLog("b", bNum, intersection);
      HyperLogLogPlus cLog = createLogLog("c", cNum, intersection);

      HyperLogLogPlus intersectionLog = new HyperLogLogPlus(16);
      for (long id : intersection) {
        aLog.offerHashed(id);
        bLog.offerHashed(id);
        cLog.offerHashed(id);
      }

      intersectionLog.addAll(aLog);
      intersectionLog.addAll(bLog);
      intersectionLog.addAll(cLog);
      System.out.println("a&b&c: " + intersectionLog.cardinality() + " " + getErr(aNum + bNum + cNum +  intersectionNum, intersectionLog.cardinality()));

      long intersectEstimate = (aLog.cardinality() + bLog.cardinality() + cLog.cardinality()) - intersectionLog.cardinality();

      System.out.println("a|b|c: " + intersectEstimate + " " + getErr(intersectionNum, intersectEstimate));
      System.out.println("----");
    }
  }

  public static HyperLogLogPlus createLogLog(String label, int num, List<Long> intersection) {
    HyperLogLogPlus log = new HyperLogLogPlus(16);
    List<Long> ids = random(num);
    for (Long id : Iterables.concat(ids, intersection)) {
      log.offerHashed(id);
    }
    System.out.println(label + ":   " + log.cardinality() + " " + getErr(num + intersectionNum, log.cardinality()));
    return log;
  }

  public static String getErr(long expected, long actual) {
    long diff = expected > actual ? expected - actual : actual - expected;
    return format.format( ((double) diff / expected) * 100) + " %";
  }


  public static List<Long> random(int num) {
    List<Long> list = new ArrayList<>();
    for (int i = 0; i < num; i++) {
      list.add(hash());
    }
    return list;
  }

  public static long hash() {
    byte[] uid = new byte[8];
    new Random().nextBytes(uid);
    return MurmurHash.hash64(uid, 8);
  }
}


Matt Abrams

unread,
Apr 5, 2014, 6:29:27 AM4/5/14
to stream-...@googlegroups.com
You have a small error in the example. Rather than using

a.card + b.card + c.card - intersection.card

You should use either:

ab.card + c.card - intersection.card

or

a.card + bc.card - intersection.card

a.card + b.card + c.card overcounts some elements. Using the
corrected version I get good accuracy with your example. e.g.

a: 1489854 0.68 %
b: 1498156 0.12 %
c: 1491990 0.53 %
a&b&c: 3487997 0.34 %
a|b|c: 500075 0.01 %

Matt

Kristoffer Sjögren

unread,
Apr 5, 2014, 6:46:00 AM4/5/14
to stream-...@googlegroups.com
Ah! That's the trick!

Thanks a bunch for your help!

Kristoffer Sjögren

unread,
Apr 5, 2014, 8:20:43 AM4/5/14
to stream-...@googlegroups.com
BTW, here's an example of a general solution for any number of instances, in case it may be useful for somebody else.


  public static long intersection(LinkedList<HyperLogLogPlus> logs) throws Exception {
    if (logs.size() < 2) {
      throw new IllegalArgumentException("Require at least two instances.");
    }
    HyperLogLogPlus intersectionLog = new HyperLogLogPlus(16);
    for (HyperLogLogPlus log : logs) {
      intersectionLog.addAll(log);
    }
    return intersection(logs.poll(), logs) - intersectionLog.cardinality();
  }

  private static long intersection(HyperLogLogPlus log, LinkedList<HyperLogLogPlus> tail) throws Exception {
    if (tail.size() == 0) {
      return log.cardinality();
    } else if (tail.size() == 1) {
      return log.cardinality() + tail.poll().cardinality();
    }
    log.addAll(tail.poll());
    return intersection(log, tail);
  }

Kristoffer Sjögren

unread,
Apr 7, 2014, 8:16:16 AM4/7/14
to stream-...@googlegroups.com
Hmm, after some tests and second thoughts I doubt the reasoning around the general solution since the order and cardinality of hyperloglog instances affect the intersection calculation. 

Here are some numbers on real data where there are 3 sets with around 40k elements each (the number in parentheses is the estimation).

A n B = 9k (9k)
A n C = 9k (9k)
B n C = 38k (38k)
(A n B) n C = 9k (38k)
(C n A) n B = 9k (38k)
(B n C) n A = 9k (9k)

The fact that there is almost a full overlap between B and C will only "even-out" if we take the bc.card + a.card - intersection.card.

Am I missing something obvious here or is this the "expected error" as mentioned in the blog post mentioned earlier?


>> >> >> >&
...
Message has been deleted

nghiap...@gmail.com

unread,
Jun 26, 2014, 1:42:28 PM6/26/14
to stream-...@googlegroups.com
I think the Inclusion-Exclusion principle can apply here. And I got 20% error when using it.

coo...@gmail.com

unread,
Mar 20, 2015, 8:07:57 AM3/20/15
to stream-...@googlegroups.com, nghiap...@gmail.com
do you mind giving an example for Exclusion using HLL?
...
Reply all
Reply to author
Forward
0 new messages