Summable top-k Sketch

386 views
Skip to first unread message

harsha...@gmail.com

unread,
May 12, 2014, 12:29:53 PM5/12/14
to stream-...@googlegroups.com
Hi,

Can you suggest any stream sketches for maintaining top-k while still being summable?

- I have looked at StreamSummary and Stochastic topper sketches (implementing ITopK) for maintaining top-k elements but they are not summable (mergable).

- Another alternative is to maintain the top-k elements along with a count-min sketch. While the count-min sketch is summable, the merge operation of two top-k sets does not seem to be trivial. For instance, we can define the merge operation on two top-k sets by estimating the counts of each of the elements from both the top-k sets (via the merged count-min sketch). Based on these estimates, we can then select the top-k among all the 2k set of elements. This merge operation may miss some of elements that are not present in either of the two top-k sets. Could there be a better solution?

Thanks,
SG

Travis Brady

unread,
May 12, 2014, 12:39:05 PM5/12/14
to stream-...@googlegroups.com
Couldn't you merge two StreamSummary's by creating a third StreamSummary object c and then calling topK on a and b and then c.offer(item, item.count)?


--
You received this message because you are subscribed to the Google Groups "stream-lib-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stream-lib-us...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

harsha...@gmail.com

unread,
May 13, 2014, 1:48:23 PM5/13/14
to stream-...@googlegroups.com
Thank you Travis.

For StreamSummary, if an item is not present in the topk of 'a' and topk of 'b', it cannot be the top-k in c (even if its real count suggests that it should belong to topk in c). With count-min sketch we have the same problem. However, in case count-min sketches, an item with small counts in individual sketches can eventually rise up the ranks and belong to the top-k in the merged sketch. (This is because the counters of the buckets corresponding to that item get incremented for every merge). I am not sure if this property holds for StreamSummary as well.

Thanks,
SG

Travis Brady

unread,
May 13, 2014, 2:03:28 PM5/13/14
to stream-...@googlegroups.com

But that doesn't mean StreamSummary isn't mergeable.

That's an inherent limitation of any solution to top k, no?
Recommendations are typically to set capacity to 2x your expected k.  In my case I use one StreamSummary obj per 5 minute window of time to track top hashtags, images, phrases, etc on twitter. To avoid the problem you mention I set capacity to 25 even though I only need the top 5 or 10.

Travis

harsha...@gmail.com

unread,
May 14, 2014, 1:28:14 PM5/14/14
to stream-...@googlegroups.com
Thank you Travis.

I agree that is an inherent problem for any top-k solution.  Indeed, StreamSummary is the recommended Sketch for top-k as per (http://www.vldb.org/pvldb/1/1454225.pdf)However, the degradation in accuracy due to such merges has probably not been studied before or compared across different sketches. After going though the options again, StreamSummary does seem more appealing.

remzic...@gmail.com

unread,
Jul 10, 2015, 10:45:41 AM7/10/15
to stream-...@googlegroups.com, harsha...@gmail.com
I will implement a distributed solution for the top-k problem so need a mergeable structure. The first idea was using count min sketch but then I realized that I can use SpaceSaving algorithm like Travis suggested. There are a lot of papers about the comparison between these two alternatives in terms of speed, space and precision but they are mostly related with local implementations. SpaceSaving seems better than the count min sketch but I dont know the case when more than one summaries are merged. Probably, speed and space will be in favor of SpaceSaving again but can you suggest any resource for the comparison between the alternatives in terms of the precision or do you have any other suggestions?

Thanks

14 Mayıs 2014 Çarşamba 20:28:14 UTC+3 tarihinde harsha...@gmail.com yazdı:

sriharsh...@gmail.com

unread,
Jul 10, 2015, 3:40:55 PM7/10/15
to stream-...@googlegroups.com, remzic...@gmail.com
  There is a recent paper on merging space saving summaries (and MG sketches). http://www.cs.utah.edu/~jeffp/papers/merge-summ.pdf . I also made a short post that about it that may be helpful: http://virtualtangent.blogspot.com/2014/06/merging-stream-summary-summaries.html .

Thanks!

remzic...@gmail.com

unread,
Jul 13, 2015, 4:11:53 AM7/13/15
to stream-...@googlegroups.com, sriharsh...@gmail.com
Thank you. They will be very useful for me.

10 Temmuz 2015 Cuma 22:40:55 UTC+3 tarihinde sriharsh...@gmail.com yazdı:
Reply all
Reply to author
Forward
0 new messages