Hi,
I have realised that the combination [Pangool + Two or more named output + EMR (output path directly to S3) + more tasks than slots]
Example: Imagine that we have 10 reduce tasks but only 2 execution slots.
In this case, first 2 tasks run successfully, starting to copy that output of the reduce to S3 as soon as they finish.
But when tasks 3 and 4 start, since EMR has copied the previous output already to S3 so the paths already exist on the bucket, Pangool raises the following error:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://gbd-kraken/results/20140204111904/job-label-output/event already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137) at com.datasalt.pangool.tuplemr.mapred.lib.output.PangoolMultipleOutputs.getRecordWriter(PangoolMultipleOutputs.java:476) at com.datasalt.pangool.tuplemr.MultipleOutputsCollector.getNamedOutput(MultipleOutputsCollector.java:41) at com.gbd.kraken.job.label.LabelReducer.collectEnrichedEvent(LabelReducer.java:116) at com.gbd.kraken.job.label.LabelReducer.reduce(LabelReducer.java:70)
This does not happen using normal output, just the named output. And it only happens sending the output to S3 directly, not to HDFS.
Juanjo.