Word count sort by value.

Dennis Kluge

unread,

Jun 13, 2013, 9:02:25 AM6/13/13

to spark...@googlegroups.com

Hello Everybody,

I'm pretty new to Spark and Map Reduce. While playing around with some data and the word count example I wondered how I could sort the result by values. The Hadoop implementation seems to be a little bit tricky and unintuitive: http://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/ .

But I think there must be a much easier way with Spark.

Thank you and regards from Berlin.

Bhaskar Karambelkar

unread,

Jun 13, 2013, 12:02:52 PM6/13/13

to spark...@googlegroups.com

Hi I'm very new to spark as well. So if someone with more experience wants to override my answer, please do.

But from what I've figured it has to be done at the Driver side. i.e. the sorting will happen after you've collected the RRD in to the driver process.

for e.g.

RRD.collect.toSeq.sortBy(_._2)

this worked for me, for key/value tuple.

The thing to remember is that this is driver side, so not distributed, so your RRD that is going to be collected needs to be fairly small, which should be the case for most MR jobs anyways.

best regards

Bhaskar

Ian O'Connell

unread,

Jun 13, 2013, 12:23:00 PM6/13/13

to spark...@googlegroups.com

I think the discussion at

https://groups.google.com/forum/?fromgroups#!searchin/spark-users/sort|sort:date/spark-users/dl8vc5ds3jA/w1z0uK7M55wJ

is what you want, there is some example code there too

Bhaskar Karambelkar

unread,

Jun 13, 2013, 3:32:02 PM6/13/13

to spark...@googlegroups.com

Rereading your original question, I am a bit confused .

Do you want to do 'secondary sort' i.e. sort the values arriving in the reduce phase ? Or do you want the final output sorted on count of words ?

Secondary sort for word count doesn't make sense, I mean if all you want to do is count, why does the order of values coming in to the reducer matter ?

If indeed you want the final output sorted on values, then as I've said in my first mail, you'll have to do it after you 'collect' the output .

thanks

Bhaskar

Dennis Kluge

unread,

Jun 13, 2013, 4:20:01 PM6/13/13

to spark...@googlegroups.com

Yes, I'd like to have the output sorted this would be interesting e.g. to aggregate the most used words in a text. I've found a solution with a second map which swaps the tuple and then sortsByKey:

foobar.flatMap(line => line.split(" ")

.map(word => (word, 1))

.reduceByKey(_ + _)

.map(item => item.swap)

.sortByKey()

This works without any collect.

Bhaskar Karambelkar

unread,

Jun 13, 2013, 5:56:55 PM6/13/13

to spark...@googlegroups.com

Provided, there are no 2 words with the same number of counts, else you'll have problems when you swap , no ?

Patrick Wendell

unread,

Jun 13, 2013, 6:01:41 PM6/13/13

to spark...@googlegroups.com

No - I think it's fine as is. sortByKey won't coalesce two words with the same count... they'll remain separate.

- Patrick

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward