[scalding] Control no. of mappers after in-memory join

11 views

Skip to first unread message

unread,

Apr 21, 2016, 9:40:32 AM4/21/16

to cascading-user

Hi,

I have the following code which does an in-memory join between two pipes

val joined = hugePipe.leftJoinWithTiny('f1 -> 'f2, smallPipe).filter(...)

This step is executed using 50K mappers and no reduce. At the end of this step, each mapper writes one file so i have 50K files.

What I see is that in the next step, instead of launching a mapper per block (512mb in my case), I have mappers as the number of files (50K mappers).

However if i do

val joined = hugePipe.leftJoinWithTiny('f1 -> 'f2, smallPipe).groupRandomly(100) {identity}

Forcing the data to the reucers I get 22K mappers in the next step which is more reasonable.

My question is - is there a way to avoid forcing data to reducers and still having #mappers per blocksize in the next step and not mapper per file?

Thanks,

Hagai

Reply all

Reply to author

Forward

0 new messages