Is nested RDD possible?

Zhou Lizhi

未讀,

2013年5月22日凌晨3:40:472013/5/22

收件者：spark...@googlegroups.com

Is something like JavaPairRDD<String, JavaRDD<Integer>> possible in spark then I can do some transform and action to the inner RDD?

PS.
I'm trying to port some PIG scripts to spark job. There is a lot of code like this. I guess I need a nested RDD to do this.

B = GROUP A by a,
C = FOR B {
C1 = B.a;
C2 = B.b;
C3 = DISTINCT C2;
GENERATE SUM(C1) as c1, COUNT(C3) as c3;
}

Zhou Lizhi

未讀,

2013年5月23日清晨5:16:332013/5/23

收件者：spark...@googlegroups.com

I went through the previous emails and got the answer. :-(

Matei Zaharia

未讀,

2013年5月24日凌晨1:25:382013/5/24

收件者：spark...@googlegroups.com

Hi Zhou,

While nested RDDs are not possible, you can certainly have collections within an RDD. For example, our groupBy returns an RDD of sequences of values for each key. I'm not sure whether Pig even parallelizes the work within each group here. If it does, you could probably also flatten out what it's doing into a single MapReduce job (whatever Pig compiles this into must be expressible in Spark).

Matei

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

王国栋

未讀,

2013年5月24日凌晨3:48:122013/5/24

收件者：spark...@googlegroups.com

HI Matei,

What we want here is to do parallel processing on the collection in each group. I am not sure whether Pig do this either. But we do hope to do this in Spark.

How can we flatten the collection in each group? Can we new a new RDD in the mapper function like this?

groupped_rdd.map( element => { newrdd = sc.parallelize(element._2, 10); newrdd.map() ... })

In my opimion, this piece of code is the same as nested RDD.

We can not handle the Seq in the map function, because some Seq has very large size, and it is not easy to find the distinct values of the Seq. So we want parallel computation.

Best.

Guodong

Mridul Muralidharan

未讀,

2013年5月24日凌晨4:35:092013/5/24

收件者：spark...@googlegroups.com

As mentioned elsewhere, there is no nested SparkContext support - so
it is not possible to create/parallelize from within map/etc
functions.
Having said that,
a) Pig does not parallelize the code you have - the inner loop that is.
b) it is trivial to parallelize what you are tring to do though (in
pig and spark) :
Simply combine the group key and C2, distinct on that; and then
retrieve what you need from it.
You will just need to think in terms of larger datasets and unroll a
few loops, that is all.

Something like this should work :
B0 = GROUP A by (a, b);
B1 = FOREACH B0 GENERATE group.$0 as a, SUM($1.a) as partial_sum, group.$1 as b;
B = GROUP B1 by a;
C = FOREACH B GENERATE SUM(B.partial_sum) as c1, COUNT(B.b) as c3;

I am also doing a partial sum above to speed things up slightly - you
can defer that for later ofcourse.

Regards,
Mridul

王国栋

未讀,

2013年5月24日上午11:42:012013/5/24

收件者：spark...@googlegroups.com

Thanks Mridul, thanks for your detailed explanation. It is quite useful to us. We will try it later. Partial sum is a great idea. :)

Guodong

回覆所有人

回覆作者

轉寄