Reducing the number of map output files

37 views

Skip to first unread message

Kevin C

unread,

Oct 10, 2014, 7:53:23 PM10/10/14

to scoobi...@googlegroups.com

I have about 40K files each about 100MB. Upon running the first mapper stage, it outputs to a bunch of 40K 10KB files. What's interesting is that in my next stage of mapping, it takes a painful 15-20 seconds to process each 10KB file. I'm guessing this is due to overhead of initializing a container. At any rate...

What's the best way to reduce/combine the number of mapper output? Is there a more elegant way besides myDlist.groupByKey.mapFlatten(...)?

-Kevin

Patrick Grandjean

unread,

Dec 3, 2014, 6:47:00 PM12/3/14

to scoobi...@googlegroups.com

+1. Same problem.

Reply all

Reply to author

Forward

0 new messages