Reducing the number of map output files

36 views
Skip to first unread message

Kevin C

unread,
Oct 10, 2014, 7:53:23 PM10/10/14
to scoobi...@googlegroups.com
I have about 40K files each about 100MB. Upon running the first mapper stage, it outputs to a bunch of 40K 10KB files. What's interesting is that in my next stage of mapping, it takes a painful 15-20 seconds to process each 10KB file. I'm guessing this is due to overhead of initializing a container. At any rate...

What's the best way to reduce/combine the number of mapper output? Is there a more elegant way besides myDlist.groupByKey.mapFlatten(...)?

-Kevin

Patrick Grandjean

unread,
Dec 3, 2014, 6:47:00 PM12/3/14
to scoobi...@googlegroups.com
+1. Same problem.
Reply all
Reply to author
Forward
0 new messages