Hello,
I'm trying a new setup in my cluster:
Hadoop 1.4.1, running YARN as resource manager
Tachyon 0.6.4 (Not using HDFS as UnderFS; inputs manually loaded into Tachyon to start with using copyFromLocal)
I run Terasort using MapReduce as follows:
hadoop jar /root/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar terasort -libjars /root/tachyon/core/target/tachyon-0.6.4-jar-with-dependencies.jar -D mapred.map.tasks=180 -D mapred.reduce.tasks=30 tachyon://n1:19998/teragen-10G tachyon://n1:19998/terasort-10G
The map phase works fine; however, during the reduce phase, I get a ton of FailedToCheckpointException errors:
15/08/06 00:15:32 INFO mapreduce.Job: Task Id : attempt_1438817937737_0008_r_000026_2, Status : FAILED
Error: java.io.IOException: FailedToCheckpointException(message:Failed to rename /root/tachyon/libexec/../underfs/tmp/tachyon/workers/1438842000004/668/882 to /root/tachyon/libexec/../underfs/tmp/tachyon/data/882)
at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:116)
at tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:183)
at tachyon.client.FileOutStream.close(FileOutStream.java:104)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at org.apache.hadoop.examples.terasort.TeraOutputFormat$TeraRecordWriter.close(TeraOutputFormat.java:80)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: FailedToCheckpointException(message:Failed to rename /root/tachyon/libexec/../underfs/tmp/tachyon/workers/1438842000004/668/882 to /root/tachyon/libexec/../underfs/tmp/tachyon/data/882)
at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3513)
at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3481)
at tachyon.thrift.WorkerService$addCheckpoint_result.read(WorkerService.java:3407)
at tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at tachyon.thrift.WorkerService$Client.recv_addCheckpoint(WorkerService.java:219)
at tachyon.thrift.WorkerService$Client.addCheckpoint(WorkerService.java:205)
at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:110)
... 13 more
And the job eventually fails.
I have tried to set HDFS as the UnderFS, and I get the same error.
I have formatted Tachyon multiple times, which was the previous fix; I get the same issue.
If the input is from Tachyon, but the output is in HDFS, I have no errors. I only see errors if the input is Tachyon and output is also Tachyon.
Any pointers?
Also, I vaguely remember that somewhere on Tachyon's official website, there's a mention of some configuration parameters that I need to set for YARN to work with Tachyon properly... but I can't find these configurations. Can someone point me to them?
Thanks!
Lei