Problem with NamedOutput + EMR

55 views
Skip to first unread message

Juanjo Mostazo

unread,
Feb 6, 2014, 5:11:43 AM2/6/14
to pangoo...@googlegroups.com
Hi,

I have realised that the combination [Pangool + Two or more named output + EMR (output path directly to S3) + more tasks than slots]

Example: Imagine that we have 10 reduce tasks but only 2 execution slots.
In this case, first 2 tasks run successfully, starting to copy that output of the reduce to S3 as soon as they finish.
But when tasks 3 and 4 start, since EMR has copied the previous output already to S3 so the paths already exist on the bucket, Pangool raises the following error:

org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://gbd-kraken/results/20140204111904/job-label-output/event already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137) at com.datasalt.pangool.tuplemr.mapred.lib.output.PangoolMultipleOutputs.getRecordWriter(PangoolMultipleOutputs.java:476) at com.datasalt.pangool.tuplemr.MultipleOutputsCollector.getNamedOutput(MultipleOutputsCollector.java:41) at com.gbd.kraken.job.label.LabelReducer.collectEnrichedEvent(LabelReducer.java:116) at com.gbd.kraken.job.label.LabelReducer.reduce(LabelReducer.java:70)

This does not happen using normal output, just the named output. And it only happens sending the output to S3 directly, not to HDFS.

Juanjo.

Pere Ferrera

unread,
Feb 24, 2014, 11:12:51 AM2/24/14
to pangoo...@googlegroups.com
I wonder why this happens only with MultipleOutputs but not with the normal output files. In Hadoop 1.0 every reducer commits the output file to the final destination as soon as it has finished. So if the directory existing would be a problem, it would also fail... No idea.

I also wonder, since Hadoop 2.0 has a proper Job commit, if this issue has disappeared at all from Hadoop 2.0 on.

Juanjo, if you can test the same thing some day against EMR with Hadoop 2.0 and report back, that would be great.
Reply all
Reply to author
Forward
0 new messages