FailedToCheckpointException seen running MapReduce with Yarn on top of Tachyon

3 views
Skip to first unread message

Lei Fan

unread,
Aug 6, 2015, 3:28:35 AM8/6/15
to Tachyon Users
Hello,

I'm trying a new setup in my cluster:

Hadoop 1.4.1, running YARN as resource manager
Tachyon 0.6.4 (Not using HDFS as UnderFS; inputs manually loaded into Tachyon to start with using copyFromLocal)

I run Terasort using MapReduce as follows:

hadoop jar /root/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar terasort -libjars /root/tachyon/core/target/tachyon-0.6.4-jar-with-dependencies.jar -D mapred.map.tasks=180 -D mapred.reduce.tasks=30 tachyon://n1:19998/teragen-10G tachyon://n1:19998/terasort-10G

The map phase works fine; however, during the reduce phase, I get a ton of FailedToCheckpointException errors:

15/08/06 00:15:32 INFO mapreduce.Job: Task Id : attempt_1438817937737_0008_r_000026_2, Status : FAILED
Error: java.io.IOException: FailedToCheckpointException(message:Failed to rename /root/tachyon/libexec/../underfs/tmp/tachyon/workers/1438842000004/668/882 to /root/tachyon/libexec/../underfs/tmp/tachyon/data/882)
        at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:116)
        at tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:183)
        at tachyon.client.FileOutStream.close(FileOutStream.java:104)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at org.apache.hadoop.examples.terasort.TeraOutputFormat$TeraRecordWriter.close(TeraOutputFormat.java:80)
        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: FailedToCheckpointException(message:Failed to rename /root/tachyon/libexec/../underfs/tmp/tachyon/workers/1438842000004/668/882 to /root/tachyon/libexec/../underfs/tmp/tachyon/data/882)
        at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3513)
        at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3481)
        at tachyon.thrift.WorkerService$addCheckpoint_result.read(WorkerService.java:3407)
        at tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at tachyon.thrift.WorkerService$Client.recv_addCheckpoint(WorkerService.java:219)
        at tachyon.thrift.WorkerService$Client.addCheckpoint(WorkerService.java:205)
        at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:110)
        ... 13 more

And the job eventually fails.

I have tried to set HDFS as the UnderFS, and I get the same error.

I have formatted Tachyon multiple times, which was the previous fix; I get the same issue.

If the input is from Tachyon, but the output is in HDFS, I have no errors. I only see errors if the input is Tachyon and output is also Tachyon.

Any pointers?

Also, I vaguely remember that somewhere on Tachyon's official website, there's a mention of some configuration parameters that I need to set for YARN to work with Tachyon properly... but I can't find these configurations. Can someone point me to them?

Thanks!

Lei

Lei Fan

unread,
Aug 11, 2015, 2:17:44 PM8/11/15
to Tachyon Users
I figured out the issue.

Resolution:

I had to manually create the "/root/tachyon/libexec/../underfs/tmp/tachyon/data/" folder as

hadoop fs -mkdir /tmp/tachyon/data

"tachyon format" only creates the /tmp/tachyon/workers folder in HDFS, but not the /tmp/tachyon/workers folder.

Is this a bug in Tachyon? (0.6.4)

Thanks,

Lei

Gene Pang

unread,
Aug 28, 2015, 1:18:39 PM8/28/15
to Tachyon Users
Thanks for the update on the fix!

I don't think "tachyon format" has a bug. I think tachyon format should reset the file system state of tachyon, so the workers directory would not be in the tachyon file system space.

Thanks,
Gene
Reply all
Reply to author
Forward
0 new messages