I've pasted the thread from our pre stream-lib-user email chain so it
is here for posterity. You might also look at the code twitter just
open sourced that has an estimateIntersectionSize method. I'll likely
add that to our implementation unless someone beats me to it.
https://github.com/echen/algebird/blob/master/src/main/scala/com/twitter/algebird/HyperLogLog.scala
hi Matt,
I can confirm that the |A INTERSECT B| = |A| + |B| - |A UNION B| works
really good. I confirmed this by testing a 2B uniques hierarchical
data : Uniques per GEO and breakdown by Gender.
Two quick questions -
It seems that the HLL accuracy is not consistent and depends on UID
distribution.
In my test app, I'm building sequential IDs (starting from 1 up to 2B)
and as a result, the distribution is bulked in each segment. This is
obviously not a natural distribution as expected in real life. Do you
think that this is the reason why for example the HLL of GEO1->Female
is 3% error and GEO2->Female is only 0.5% error although both contain
the same order of cardinality?
The second question is:
I saw in one of the HLL talks that in addition to Intersection, there
is another important feature which is DIFF between two HLLs.
Do you agree and if so - can you provide an example use case where
DIFF operations might become handy?
Many Thanks!
Chen.
On Wed, Jul 18, 2012 at 4:19 PM, Matt Abrams <
abr...@gmail.com> wrote:
---------- Forwarded message ----------
From: Josh Ferguson <
jfer...@yammer-inc.com>
Date: Tue, Apr 17, 2012 at 5:51 PM
Subject: Re: Intersection of HyperLogLog register sets
To: Matt Abrams <
abr...@gmail.com>
Cc: Chris Burroughs <
chris.b...@gmail.com>
The intersection approach won't work exactly due to some double
counting problems. The best approach to thinking about these things I
guess is as compressed inverted indices. Our approach now will be to
build sets of data that include common dimensions we want to slice and
dice together. You can think about this as coming up with sets of
indices for database tables that you want to put into predicates
together.
For instance one of our inverted indices for our "click stream" events
dataset might looks like:
day, controller, action, client_id, user_agent_name,
user_agent_version, user_agent_platform, continent, country, users,
event_count
The users field would hold the probabilistic registers. For every
distinct combination of every value of these fields we'd have one
bitset for the users that were contained in that set. From this we
could simply do filters and groupings with the estimation of union'd
registers as the aggregate function. We could do this on any subset of
the fields contained in this table. The hardest part at this point is
figuring out what our front-end use cases are such that we can build
the best set dimension groupings possible. It might also be nice to
write some custom postgres aggregate functions in order to do a lot of
the front-end reporting calculations in a database instead of
rewriting query logic programmatically.
Josh Ferguson
On Apr 17, 2012, at 11:00 AM, Matt Abrams wrote:
> Josh -
>
> That sounds like a interesting approach. Please let us know how it works out.
>
> I created a java version of HLL using your JavaScript implementation as a model:
>
>
https://github.com/clearspring/stream-lib/blob/hyperloglog/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java
>
> Re the storage size issue I mentioned in the other email. I think if
> you use this:
>
> Math.floor(count / 6.0)
>
> it will size the array correctly.
>
> Regards,
> Matt
>
> On Mon, Apr 16, 2012 at 2:24 PM, Josh Ferguson <
jfer...@yammer-inc.com> wrote:
>> There could certainly be bugs. I just started implementing it and testing it last week and all getting all the crazy math to match up exactly is hard for my soft little brain. I'll look into it some more and see if I can write a test case that reproduces it.
>>
>> What we're going to do is essentially try and create partitioned datasets around certain dimensions of our data. So for instance have event logging where we log all kinds of information about users and actions they take. We have a few general sets of dimensions around these as follows:
>>
>> Time
>> month, day, hour
>>
>> Events
>> event name, controller, action
>>
>> Client / User Agent
>> client, user agent name, user agent version, user agent platform
>>
>> Location
>> continent, country, region, closest large city (pop > 500,000)
>>
>> For every distinct combination of values in these dimension sets we're going to keep a set of registers for user_id and event_id. This will allow us to count the number of users and number of events that occurred. We can then use unions and intersections to calculate various things.
>>
>> For instance say I want to know the estimate of # of users in the past 28 days who posted a message grouped by country.
>>
>> First we'd union estimators within each of the dimension sets
>>
>> Union the sets of users over the past 28 days from 'Time'
>> Union the sets of users for grouped by country from 'Location'
>> Union the sets of users with event name 'message_creation' from 'Events'
>>
>> We then combine them in a grouped fashion where the aggregate is intersection to get the final grouped estimated dataset.
>>
>> Our approach is basically like building a probabilistic database but I think the benefit here is that you can query *very* large datasets in near real-time and build out some really great dashboard and analytics functionality. I think we're going to be using CrossFIlter (formerly Tesseract) to help us achieve a lot of the more database type functionality within our datasets and just pass in our own aggregate functions to do the unions and intersections.
>>
>> What is your thought on this?
>>
>> Josh
>>
>>
>> On Apr 16, 2012, at 6:30 AM, Matt Abrams wrote:
>>
>>> Josh -
>>>
>>> Thanks, this is really helpful. I was running into the same
>>> overestimation problem this weekend but hadn't found a solution yet.
>>> I will try to implment your approach today.
>>>
>>> Also, I was looking at your implementation and I'm not sure I
>>> understand it completely. In register_set.js you set the storage size
>>> for your buckets based on the function:
>>>
>>> var storage = Math.floor(5 * this.count / this.bitCount)
>>>
>>> if(storage == 0)
>>> this.buckets = new Array(1)
>>> else if(storage % this.bitCount == 0)
>>> this.buckets = new Array(storage)
>>> else
>>> this.buckets = new Array(storage + 1)
>>>
>>> so for count = 256 you will end up with storage = 40 and 41 buckets.
>>>
>>> Later the estimate method of estimator.js runs:
>>>
>>> for(var j = 0; j < registers.count; j++) {
>>> r_sum += Math.pow(2, (-1 * registers.get(j)))
>>> }
>>>
>>> So we would pass a number like 255 to registers.get(j) and that method
>>> calculates the bucket position based on j, so it would be:
>>>
>>> var bucketPos = Math.floor(position / this.registersPerBucket)
>>>
>>> and we'd get bucketPos = 42. JS transparently adds the extra buckets
>>> but I don' think that is intended.
>>>
>>> Did I interpret this correctly?
>>>
>>> Thanks,
>>> Matt
>>>
>>>
>>> On Sun, Apr 15, 2012 at 9:45 PM, Josh Ferguson <
jfer...@yammer-inc.com> wrote:
>>>> I emailed Eric Fusy one of the authors. The answer is of course to remember from set theory that
>>>>
>>>> |A INTERSECT B| = |A| + |B| - |A UNION B|
>>>>
>>>> So given that we can estimate individual registers and merge registers to produce the union we can also get good results for the intersection. Eric said they had seen very good results with this method. I'm writing some unit tests around my js implementation to see what kind of error rates we see at different register sizes.
>>>>
>>>> Josh Ferguson
>>>>
>>>> On Apr 15, 2012, at 3:46 PM, Chris Burroughs wrote:
>>>>
>>>>> I can't recall if the intersection is possibly from the literature.
>>>>> I'll try to look it up. Matt Abrams has been working on the HyperLogLog
>>>>> integration and may be able to elaborate.
>>>>>
>>>>> On 04/12/2012 09:06 PM, Josh Ferguson wrote:
>>>>>> Have you done any work on taking the intersection of two loglog register sets? The union is represented by your merge function wherein you take the max of each register position but taking minimum of each one doesn't produce an intersection. It actually tends to over-estimate the size of the intersection systematically which leads me to believe there is some kind of probabilistic way to quantify or re-adjust the results. It man be the MIN{r_1, r_2, r_3} is not the right function to apply. Any ideas?
>>>>>>
>>>>>> I did a test as follows:
>>>>>>
>>>>>> I created 2 datasets, each of 1000 distinct items that have a 500 item overlap.
>>>>>> I created 4 sets of registers using the HyperLogLog formula from Figure 3 of the paper.
>>>>>>
>>>>>> e1 and e2 are the registers for the first and second 1000 item datasets respectively.
>>>>>> e3 combines e1 and e2 by taking the minimum value of the registers across every position
>>>>>>
>>>>>> e4 is the set of registers created by first taking the intersection of the set and then running the algorithm normally.
>>>>>>
>>>>>> e1: [ 5, 8, 6, 10, 10, 6, 7, 7, 4, 12, 6, 7, 9, 8, 7, 8 ]
>>>>>> e2: [ 10, 8, 6, 10, 5, 6, 7, 6, 6, 5, 5, 7, 6, 9, 7, 8 ]
>>>>>> -----------------------------
>>>>>> e3: [ 5, 8, 6, 10, 5, 6, 7, 6, 4, 5, 5, 7, 6, 8, 7, 8 ]
>>>>>> e4: [ 4, 8, 6, 10, 4, 5, 7, 4, 4, 5, 4, 7, 6, 5, 7, 8 ]
>>>>>>
>>>>>> The code of my implemention is located at
https://github.com/yammer/probablyjs/blob/master/lib/cardinality/log_log.js
>>>>>> The test case that generates this dataset is located from line 63 to line 87 at
https://github.com/yammer/probablyjs/blob/master/spec/cardinality/log_log_spec.rb
>>>>>>
>>>>>>
>>>>>
>>>>
>>
On Tue, Sep 25, 2012 at 2:07 AM, <
gar...@gmail.com> wrote:
> Also looking for doing intersections ... Anybody have any info on howo to do
> that?
>
> Davy