Failed to delete earlier output of task

27 views
Skip to first unread message

Stan Rosenberg

unread,
Jul 8, 2014, 1:47:07 PM7/8/14
to pangoo...@googlegroups.com
Hi,

I am getting an error message of the form: Failed to delete earlier output of task: attempt_201407032220_0196_r_000000_1 with the following stack trace,

at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:199)

    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:214)

    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)

    at com.datasalt.pangool.tuplemr.mapred.lib.output.ProxyOutputFormat$ProxyOutputCommitter.commitTask(ProxyOutputFormat.java:179)

    at org.apache.hadoop.mapred.Task.commit(Task.java:1014)

    at org.apache.hadoop.mapred.Task.done(Task.java:884)

    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:453)

    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:415)

I am running pangool-core-0.70 atop CDH 4.5.0.  A cursory look at ProxyOutputCommitter.commitTask revealed that FileOutputCommitter.commitTask is invoked twice; I don't believe it's

idempotent hence the error message.  Also, if I replace ProxyOutputFormat with mine, e.g., job.setOutputFormatClass(TextOutputFormat.class), everything works as expected.

Any suggestions?

Thanks,

stan

Pere Ferrera

unread,
Jul 9, 2014, 4:47:51 AM7/9/14
to pangoo...@googlegroups.com
Hi Stan,

This is a known issue which seems to happen only on CDH4 with MR1. Neither is this a problem in Hadoop 1.0 or Hadoop 2.0.

I'm having the same thing in a CDH4 cluster. However my Jobs don't fail. These failures only happen when two tasks finish at the same time. They are retried and eventually succeed. Is your Job failing or are you just seeing some failed tasks?

The exact reason I don't know yet, but it has to do with how Pangool implements arbitrary outputformat support and how FileOutputFormat code changed between 1.0, CDH4 and 2.0.

Pere Ferrera

unread,
Jul 11, 2014, 4:21:39 AM7/11/14
to pangoo...@googlegroups.com
Hello Stan,

I have pushed a workaround for the failed tasks problem. I have tested it on a CDH4 cluster and now I see 0 failed tasks.

To test if this solves the problem in your case, you can use latest snapshot - note that classifier "mr2" is no longer needed (next version of Pangool will ship as MR2 compiled by default, and MR1 will be optional, contray to what happens now):

 <dependency>
     <groupId>com.datasalt.pangool</groupId>
     <artifactId>pangool-core</artifactId>
     <version>0.71-SNAPSHOT</version>
 </dependency>
You will need to add the snapshot repository:

 <repository>
     <id>pangool-snapshots</id>
     <name>Pangool Snapshot Repository</name>
     <url>http://clinker.datasalt.com/nexus/content/groups/public-snapshots</url>
     <releases>
         <enabled>false</enabled>
     </releases>
     <snapshots>
         <enabled>true</enabled>
     </snapshots>
 </repository>
Cheers,

Stan Rosenberg

unread,
Jul 14, 2014, 12:43:35 AM7/14/14
to pangoo...@googlegroups.com
Hi Pere,

Thanks for a quick response!  I have re-run the job with the 0.71 snapshot.  The job does succeed, however there are still failed task attempts.  The call to FileOutputCommitter.commitJob inside ProxyOutputFormat.commitTask is invoked once per every task instead of once per job.  I believe this is the cause since it introduces a "race condition" with respect to _SUCCESS.  (While each _temporary directory has a unique part file per task attempt, there is a single _SUCCESS file per job.)

Btw, I don't fully understand the use case for ProxyOutputFormat.  Is there a case where committer (delegate) is different from FileOutputCommitter?

Thanks,

stan 

Pere Ferrera

unread,
Jul 14, 2014, 4:57:49 AM7/14/14
to pangoo...@googlegroups.com
Hi Stan,

Pangool needs a proxy OutputFormat because it kind of re-implements OutputFormats. In Hadoop you specify the name of the class, but in Pangool you can pass an instance of the outputFormat with some state in it (like in Storm). This, and the fact that Pangool Multiple Outputs are also a bit different (using subfolders and separate Configurations, which also gives more freedom).

The code for ProxyOutputFormat was kind of trivial for Hadoop 1.0, but it needs to behave a bit differently for Hadoop 2.0... And there are some inconsistencies across versions (I believe CDH4-mr1 behaves differently). We're trying to maintain a single codebase for all versions, but we might need to split or deprecate Hadoop 1 for implementing things in a more "clean" way.

Did you at least observe way less failed tasks? Also, can you print the trace you get now?

Stan Rosenberg

unread,
Jul 14, 2014, 1:32:45 PM7/14/14
to pangoo...@googlegroups.com
Hi Pere,

The number of failed attempts is much fewer.  Because of non-determinism it varies from run to run.  From the previous few runs, I've seen at most 2 attempt failures out of 357 reduce tasks; on the most recent run there were none.  Before the retry logic, I see as many as 119 attempt failures (out of 357).  So, the retry logic definitely helped but it _is_ non-deterministic...

Thanks,

stan
Reply all
Reply to author
Forward
0 new messages