I have about 40K files each about 100MB. Upon running the first mapper stage, it outputs to a bunch of 40K 10KB files. What's interesting is that in my next stage of mapping, it takes a painful 15-20 seconds to process each 10KB file. I'm guessing this is due to overhead of initializing a container. At any rate...
What's the best way to reduce/combine the number of mapper output? Is there a more elegant way besides myDlist.groupByKey.mapFlatten(...)?
-Kevin