val joined = hugePipe.leftJoinWithTiny('f1 -> 'f2, smallPipe).filter(...)
This step is executed using 50K mappers and no reduce. At the end of this step, each mapper writes one file so i have 50K files.
What I see is that in the next step, instead of launching a mapper per block (512mb in my case), I have mappers as the number of files (50K mappers).
However if i do
val joined = hugePipe.leftJoinWithTiny('f1 -> 'f2, smallPipe).groupRandomly(100) {identity}
Forcing the data to the reucers I get 22K mappers in the next step which is more reasonable.
My question is - is there a way to avoid forcing data to reducers and still having #mappers per blocksize in the next step and not mapper per file?
Thanks,
Hagai