FailedToCheckpointException seen running MapReduce with Yarn on top of Tachyon

Lei Fan

unread,

Aug 6, 2015, 3:28:35 AM8/6/15

to Tachyon Users

Hello,

I'm trying a new setup in my cluster:

Hadoop 1.4.1, running YARN as resource manager

Tachyon 0.6.4 (Not using HDFS as UnderFS; inputs manually loaded into Tachyon to start with using copyFromLocal)

I run Terasort using MapReduce as follows:

hadoop jar /root/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar terasort -libjars /root/tachyon/core/target/tachyon-0.6.4-jar-with-dependencies.jar -D mapred.map.tasks=180 -D mapred.reduce.tasks=30 tachyon://n1:19998/teragen-10G tachyon://n1:19998/terasort-10G

The map phase works fine; however, during the reduce phase, I get a ton of FailedToCheckpointException errors:

15/08/06 00:15:32 INFO mapreduce.Job: Task Id : attempt_1438817937737_0008_r_000026_2, Status : FAILED

Error: java.io.IOException: FailedToCheckpointException(message:Failed to rename /root/tachyon/libexec/../underfs/tmp/tachyon/workers/1438842000004/668/882 to /root/tachyon/libexec/../underfs/tmp/tachyon/data/882)

at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:116)

at tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:183)

at tachyon.client.FileOutStream.close(FileOutStream.java:104)

at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)

at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)

at org.apache.hadoop.examples.terasort.TeraOutputFormat$TeraRecordWriter.close(TeraOutputFormat.java:80)

at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)

at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Caused by: FailedToCheckpointException(message:Failed to rename /root/tachyon/libexec/../underfs/tmp/tachyon/workers/1438842000004/668/882 to /root/tachyon/libexec/../underfs/tmp/tachyon/data/882)

at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3513)

at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3481)

at tachyon.thrift.WorkerService$addCheckpoint_result.read(WorkerService.java:3407)

at tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)

at tachyon.thrift.WorkerService$Client.recv_addCheckpoint(WorkerService.java:219)

at tachyon.thrift.WorkerService$Client.addCheckpoint(WorkerService.java:205)

at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:110)

... 13 more

And the job eventually fails.

I have tried to set HDFS as the UnderFS, and I get the same error.

I have formatted Tachyon multiple times, which was the previous fix; I get the same issue.

If the input is from Tachyon, but the output is in HDFS, I have no errors. I only see errors if the input is Tachyon and output is also Tachyon.

Any pointers?

Also, I vaguely remember that somewhere on Tachyon's official website, there's a mention of some configuration parameters that I need to set for YARN to work with Tachyon properly... but I can't find these configurations. Can someone point me to them?

Thanks!

Lei

Lei Fan

unread,

Aug 11, 2015, 2:17:44 PM8/11/15

to Tachyon Users

I figured out the issue.

Resolution:

I had to manually create the "/root/tachyon/libexec/../underfs/tmp/tachyon/data/" folder as

hadoop fs -mkdir /tmp/tachyon/data

"tachyon format" only creates the /tmp/tachyon/workers folder in HDFS, but not the /tmp/tachyon/workers folder.

Is this a bug in Tachyon? (0.6.4)

Thanks,

Lei

Gene Pang

unread,

Aug 28, 2015, 1:18:39 PM8/28/15

to Tachyon Users

Thanks for the update on the fix!

I don't think "tachyon format" has a bug. I think tachyon format should reset the file system state of tachyon, so the workers directory would not be in the tachyon file system space.