[scalding] Control no. of mappers after in-memory join

11 views
Skip to first unread message

Hagai Attias

unread,
Apr 21, 2016, 9:40:32 AM4/21/16
to cascading-user
Hi,
I have the following code which does an in-memory join between two pipes

val joined = hugePipe.leftJoinWithTiny('f1 -> 'f2, smallPipe).filter(...)

This step is executed using 50K mappers and no reduce. At the end of this step, each mapper writes one file so i have 50K files.

What I see is that in the next step, instead of launching a mapper per block (512mb in my case), I have mappers as the number of files (50K mappers).

However if i do
val joined = hugePipe.leftJoinWithTiny('f1 -> 'f2, smallPipe).groupRandomly(100) {identity}

Forcing the data to the reucers I get 22K mappers in the next step which is more reasonable. 

My question is - is there a way to avoid forcing data to reducers and still having #mappers per blocksize in the next step and not mapper per file?

Thanks,
Hagai
Reply all
Reply to author
Forward
0 new messages