What are the possible ways to return KEY, VALUE pairs from map function in Spark?

2,960 views
Skip to first unread message

Gaurav Dasgupta

unread,
Oct 14, 2012, 4:31:07 PM10/14/12
to spark...@googlegroups.com
Hi,

I am trying to write a MapReduce code in Spark where my Map function will take the input as String (row from RDD) and should return a set of (K,V) pairs. For instance, following is an example input to my Map function:

1 2:3,3:2,4:5|0|1

Now from this input I want to form the (k,v) pairs like the following:

(1,2:3,3:2,4:5|0|2)
(2,|3|1)
(3,|2|1)
(4,|5|1)

What I am doing is that, from my input row, I am forming a String like this:

2, |3|1@(3, |2|1@4, |5|1@1 2:3,3:2,4:5|0|1

Then in the main function rdd.map(x => myMapFunction(x)) gives me the above output.
Then, rdd.map(x => myMapFunction(x)).flatMap(x => x.split("@")) to transform it to the following:

2 |3|1
3 |2|1
4 |5|1
1 2:3,3:2,4:5|0|2

And finally, rdd.map(x => myMapFunction(x)).flatMap(x => x.split("@")).map(x => (x.split(" ")(0), x.split(" ")(1))) to form this:

(2,|3|1)
(3,|2|1)
(4,|5|1)
(1,2:3,3:2,4:5|0|2)

I am new to both Scala and Spark.
I want to know that what are possibilities by which I can return multiple (k,v) pairs and directly use the map() function on RDD to get the desired output?
I can return the multiple (k,v) pairs as an Array or List (please suggest if I can do the same in any other way). But then how can I use the map() function on RDD to get the desired output?


Thanks,
Gaurav

Matei Zaharia

unread,
Oct 14, 2012, 4:40:46 PM10/14/12
to spark...@googlegroups.com
Hi Gaurav,

flatMap is the version of map that can return multiple output values. You can just use that directly where you're using map() now, and return an array or list of key-value pairs from it. Try doing it for a normal Scala collection in the Scala shell -- the Scala collection API works the same way as the Spark one for these methods.

Also, in case it's not clear, you can use a multi-line function in your calls to map and flatMap, or even a function defined outside as a def. There's no need to have everything fit into one small expression and chain calls together. Try writing your function from a record to the list you want in a single function, and calling that in a flatMap.

Matei
Reply all
Reply to author
Forward
Message has been deleted
0 new messages