As mentioned elsewhere, there is no nested SparkContext support - so
it is not possible to create/parallelize from within map/etc
functions.
Having said that,
a) Pig does not parallelize the code you have - the inner loop that is.
b) it is trivial to parallelize what you are tring to do though (in
pig and spark) :
Simply combine the group key and C2, distinct on that; and then
retrieve what you need from it.
You will just need to think in terms of larger datasets and unroll a
few loops, that is all.
Something like this should work :
B0 = GROUP A by (a, b);
B1 = FOREACH B0 GENERATE group.$0 as a, SUM($1.a) as partial_sum, group.$1 as b;
B = GROUP B1 by a;
C = FOREACH B GENERATE SUM(B.partial_sum) as c1, COUNT(B.b) as c3;
I am also doing a partial sum above to speed things up slightly - you
can defer that for later ofcourse.
Regards,
Mridul