Merging StreamSummary objects

157 views
Skip to first unread message

Jerome Banks

unread,
Sep 24, 2013, 8:20:38 PM9/24/13
to stream-...@googlegroups.com

 Hey,
   Hey, we really like the stream-lib project, but have a few questions ...

  We're trying to write a Hive UDAF to calculate estimated Top K for heavy-hitters, but running into an issue.
 We're currently trying to use  the StreamSummary class, but there is no way to merge two different StreamSummary objects.
   Is it safe to just merge the internal data structures ??? Will this cause a lack of precision ???

 thx... 
--- jerome

Eugene Kirpichov

unread,
Sep 24, 2013, 10:29:33 PM9/24/13
to stream-...@googlegroups.com
Hi,

A StreamSummary is essentially a table element->counter, though sorted by counter. You should be able to merge one StreamSummary into another by simply traversing one and inserting all of its elements with their counts into the other, without loss of precision.
I agree that this should be part of the public API. Pull request welcome, this should be a very easy change :)
--
You received this message because you are subscribed to the Google Groups "stream-lib-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stream-lib-us...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--
Eugene Kirpichov
Google, distributed computing frameworks
http://www.linkedin.com/in/eugenekirpichov
http://jkff.info/software/timeplotters - my performance visualization tools

Jerome Banks

unread,
Sep 25, 2013, 1:12:29 PM9/25/13
to stream-...@googlegroups.com

Thx for the response !!!
  I'll implement in my fork,  and send a pull request.

 thx ...
 --- jerome

On Tuesday, September 24, 2013 7:29:33 PM UTC-7, Eugene Kirpichov wrote:
Hi,

A StreamSummary is essentially a table element->counter, though sorted by counter. You should be able to merge one StreamSummary into another by simply traversing one and inserting all of its elements with their counts into the other, without loss of precision.
I agree that this should be part of the public API. Pull request welcome, this should be a very easy change :)

On Tuesday, September 24, 2013, Jerome Banks wrote:

 Hey,
   Hey, we really like the stream-lib project, but have a few questions ...

  We're trying to write a Hive UDAF to calculate estimated Top K for heavy-hitters, but running into an issue.
 We're currently trying to use  the StreamSummary class, but there is no way to merge two different StreamSummary objects.
   Is it safe to just merge the internal data structures ??? Will this cause a lack of precision ???

 thx... 
--- jerome

--
You received this message because you are subscribed to the Google Groups "stream-lib-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stream-lib-user+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

nvol...@llnw.com

unread,
Apr 8, 2015, 11:02:53 AM4/8/15
to stream-...@googlegroups.com
Hi,

As I understand this issue (merge one StreamSummary into another) still open a half a year. Latest version of stream-lib (2.8.0) api doesn't provide/expose the method to merge two structures. Please correct me if I am missed something.

Q: Is a way to merge one StreamSummary into another without lack of precision ?

Looking forward on your response.

The information in this message may be confidential.  It is intended solely for
the addressee(s).  If you are not the intended recipient, any disclosure,
copying or distribution of the message, or any action or omission taken by you
in reliance on it, is prohibited and may be unlawful.  Please immediately
contact the sender if you have received this message in error.

Ian Barfield

unread,
Apr 8, 2015, 6:45:53 PM4/8/15
to stream-...@googlegroups.com
The pull request you see mentioned in the previous email was not forthcoming. I think the previously described strategy of summing counts should more or less work, but I think you would also need to increment the "error" value on the counter node -- if your use case ever looks at that anyway. If not, then you could do something like getK(size()) and then iterate over it, adding values. Just a few lines of code. It wouldn't be terribly efficient, but that doesn't appear to be a new trait for that class anyway.

We would still accept a PR adding a merge function directly to stream-lib if you are interested in making one.

with respect to precision: my intuition is that I would describe such a merge as lossy. If you disabled the size limits for the resultant object, the actual merge operation would not cause precision loss per se, but I think the implicit initial division of the data stream introduces loss for this structure. Therefore, you should probably not heedlessly subdivide and merge this structure on the premise that a merge could be done "without lack of precision".

To unsubscribe from this group and stop receiving emails from it, send an email to stream-lib-us...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nvol...@llnw.com

unread,
Apr 9, 2015, 5:06:22 AM4/9/15
to stream-...@googlegroups.com
Hi Ian,

Thanks for nice response.

I am going very shortly describe my use case:
* huge load of input data as result distributed architecture is used
* as result all input data go through multiple instances of StreamSummary
* at the end all multiple instances of StreamSummary should be merged to one "without lack of precision"

If it possible to cover such use case without decreasing count of StreamSummary instances to one (in my case: all input data should go to one node & any data pre-batching should be disabled) so it will be ideal !

From your response I understand that I should minimise count of StreamSummary instances.

Based on public stream-lib api there is no way to increment counter "error" value. 

As I understand StreamSummary is a self education structure & merge such structures "without lack of precision" is a challenge.

I don't have enough knowledge in probabilistic structures yet) As result I have one more question:
Q: Is it possible to merge multiple instances of StreamSummary (with the same capacity) "without lack of precision" ?

If yes I will learn probabilistic structures more deeper, implement & will send PR to you. Such feature is sufficiently important for my use case.

Thanks in advance,
Nazarii Volynets

Ian Barfield

unread,
Apr 9, 2015, 4:40:27 PM4/9/15
to stream-...@googlegroups.com
I can tell you that in practice, doing something like keeping the top "1000" events, then simply merging the hits of each summary's, say top 100k, gives reasonable results. FWIW, I glanced over the paper that is referenced in the source before replying and found nothing about merging. I think it is probably a very a safe assumption that merging that includes summing "error" counts will be accurate, but I am still a little suspicious that it might result in higher possible error counts than using a single base object. It is fairly easy to imagine cases that go both ways, so it could just be one of those tricky properties I suppose.

Another consideration is whether your distribution of processing is sharded on the field you are summarizing -- or if there is even a correlation bias. I expect in most cases it hardly matters anyway though.

harsha...@gmail.com

unread,
May 5, 2015, 11:51:35 AM5/5/15
to stream-...@googlegroups.com

This paper has a merge solution to merge MG and SpaceSaving sketches while preserving the accuracy.

Hope this is relevant.

Thanks,
Sriharsha

Ian Barfield

unread,
May 18, 2015, 11:59:30 AM5/18/15
to stream-...@googlegroups.com
hey, sorry for the slow reply. I was on vacation actually.

I read (most of) the paper, or at least at it relates to merging "space saving" sketches. A couple things to note are that they assume perfectly sharded data, which I would have thought would make the merging extremely easy. The deterministic algorithm described in the paper seems to be roughly:

- merge all the shards as if K was unlimited
- prune bottom entries by subtracting from higher/remaining ones

possibly they are assuming that error counts are not tracked per entry. In any event, this is what I might call an "aggressive" algorithm in that it aggressively attempts to move to wards its higher error bound. There are a number of randomization-based strategies that follow which I assume attempt to address the more egregious cases. The end result will be a sketch whose error bounds are based on the aggregate count of objects, so from the perspective of one of the original sketches, there will still be "precision loss" (the error is basically added together as you'd expect).

Many probabilistic algorithms suffer from a large disparity between the sometimes known: "provable error bounds", and the much more commonly experienced, "things looks pretty okay to me". In many cases, if an implementation spit out the worst possible allowed value, the user would be extremely shocked and unhappy.

In general, if you can be assured that your data is sharded on the input key, then it is pretty safe, if not downright beneficial, to shard and then merge using methods either from the paper or as described earlier. If the data is sharded on some other value, then there may be more potential for trouble if each had a high error count... but that information would at least be available, and there could be some error vs weighting valuation during the merging if you wanted. I don't think the paper attempted to address that case, but maybe someone else has.
Reply all
Reply to author
Forward
0 new messages