TopK on a weighted list

55 views
Skip to first unread message

r...@algolia.com

unread,
Oct 19, 2016, 8:03:00 AM10/19/16
to algebird
Hi,

I'm trying to compute a TopK of a List of objects. Each object is weighted by a value. As an exemple let's take a list of word and the number of occurence of it. In my list a word could appear multiple times. Ex:

val list = List(
 
("a", 1),
 
("the", 2),
 
("dragon", 1),
 
("the", 1),
 
("a", 3),
 
("word", 2)
)

And I would like to get a TopK, let's take K=2 for the exemple:

someMagicTopK(2, list) => List(("a", 4), ("the", 3))


For various reasons we need to use an Aggregator, and we don't care about exact values. How can we achieve that?

We've tried TopNCMSAggregator, but we do not see how to use this weight. We also tried SketchMap but the result of the aggregator doesn't give the count for each occurrences.

Thanks for your help.

P. Oscar Boykin

unread,
Oct 21, 2016, 2:06:22 AM10/21/16
to r...@algolia.com, algebird
You can just add the item N times. There is a +(k: K, cnt: Long) method. Try that.

Does that work?
--
You received this message because you are subscribed to the Google Groups "algebird" group.
To unsubscribe from this group and stop receiving emails from it, send an email to algebird+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

r...@algolia.com

unread,
Oct 21, 2016, 6:13:02 AM10/21/16
to algebird, r...@algolia.com
If you are talking about the method in TopCMSInstance, I've already tried but stopped. 
I needed to create a custom TopNCMSMonoidTopCMSZeroTopCMSItemTopCMSInstance, and link them all together, which resulted to a lot of copy/pasting.

And it felt weird. Do you concur that I'm on the right path? Did I miss something?

Thanks,

Koert Kuipers

unread,
Oct 22, 2016, 1:22:00 PM10/22/16
to r...@algolia.com, algebird
scala> val list = List(

     |   ("a", 1),
     |   ("the", 2),
     |   ("dragon", 1),
     |   ("the", 1),
     |   ("a", 3),
     |   ("word", 2)
     | )

scala> val k = 2
k: Int = 2

scala> val agg = Aggregator.fromSemigroup(SpaceSaver.spaceSaverSemiGroup[String]).composePrepare{ (x: (String, Int)) => SpaceSaver(k * 5, x._1, x._2) }.andThenPresent{ ss => ss.topK(k).map{ case (string, approx, _) => (string, approx.estimate) }}
agg: com.twitter.algebird.Aggregator[(String, Int),com.twitter.algebird.SpaceSaver[String],Seq[(String, Long)]] = com.twitter.algebird.Aggregator$$anon$15@36388414

scala> agg(list)
res6: Seq[(String, Long)] = List((a,4), (the,3))


To unsubscribe from this group and stop receiving emails from it, send an email to algebird+unsubscribe@googlegroups.com.

r...@algolia.com

unread,
Oct 24, 2016, 11:02:16 AM10/24/16
to algebird, r...@algolia.com
Thanks a lot! It works nicely.
Reply all
Reply to author
Forward
0 new messages