Is nested RDD possible?

Zhou Lizhi

unread,

May 22, 2013, 3:40:47 AM5/22/13

to spark...@googlegroups.com

Is something like JavaPairRDD<String, JavaRDD<Integer>> possible in spark then I can do some transform and action to the inner RDD?

PS.
I'm trying to port some PIG scripts to spark job. There is a lot of code like this. I guess I need a nested RDD to do this.

B = GROUP A by a,
C = FOR B {
C1 = B.a;
C2 = B.b;
C3 = DISTINCT C2;
GENERATE SUM(C1) as c1, COUNT(C3) as c3;
}

Zhou Lizhi

unread,

May 23, 2013, 5:16:33 AM5/23/13

to spark...@googlegroups.com

I went through the previous emails and got the answer. :-(

Matei Zaharia

unread,

May 24, 2013, 1:25:38 AM5/24/13

to spark...@googlegroups.com

Hi Zhou,

While nested RDDs are not possible, you can certainly have collections within an RDD. For example, our groupBy returns an RDD of sequences of values for each key. I'm not sure whether Pig even parallelizes the work within each group here. If it does, you could probably also flatten out what it's doing into a single MapReduce job (whatever Pig compiles this into must be expressible in Spark).

Matei

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

王国栋

unread,

May 24, 2013, 3:48:12 AM5/24/13

to spark...@googlegroups.com

HI Matei,

What we want here is to do parallel processing on the collection in each group. I am not sure whether Pig do this either. But we do hope to do this in Spark.

How can we flatten the collection in each group? Can we new a new RDD in the mapper function like this?

groupped_rdd.map( element => { newrdd = sc.parallelize(element._2, 10); newrdd.map() ... })

In my opimion, this piece of code is the same as nested RDD.

We can not handle the Seq in the map function, because some Seq has very large size, and it is not easy to find the distinct values of the Seq. So we want parallel computation.

Best.

Guodong

Mridul Muralidharan

unread,

May 24, 2013, 4:35:09 AM5/24/13

to spark...@googlegroups.com

As mentioned elsewhere, there is no nested SparkContext support - so
it is not possible to create/parallelize from within map/etc
functions.
Having said that,
a) Pig does not parallelize the code you have - the inner loop that is.
b) it is trivial to parallelize what you are tring to do though (in
pig and spark) :
Simply combine the group key and C2, distinct on that; and then
retrieve what you need from it.
You will just need to think in terms of larger datasets and unroll a
few loops, that is all.

Something like this should work :
B0 = GROUP A by (a, b);
B1 = FOREACH B0 GENERATE group.$0 as a, SUM($1.a) as partial_sum, group.$1 as b;
B = GROUP B1 by a;
C = FOREACH B GENERATE SUM(B.partial_sum) as c1, COUNT(B.b) as c3;

I am also doing a partial sum above to speed things up slightly - you
can defer that for later ofcourse.

Regards,
Mridul

王国栋

unread,

May 24, 2013, 11:42:01 AM5/24/13

to spark...@googlegroups.com

Thanks Mridul, thanks for your detailed explanation. It is quite useful to us. We will try it later. Partial sum is a great idea. :)

Guodong

Reply all

Reply to author

Forward