java.io.IOException: FailedToCheckpointException

16 views
Skip to first unread message

shl...@163.com

unread,
Oct 31, 2014, 8:44:12 AM10/31/14
to tachyo...@googlegroups.com
Hi,
   I running spark on tachyon,and my Spark version is 1.0.2, Tachyon 0.4.1, do not use ZooKeeper, i refer to the website configuration, and test in spark-shell, 

val s = sc.textFile("tachyon://Master:19998/X")
s.count()
when the above statment is executed, the console noted the following information,
Disconnecting from the master Master/192.168.24.141:19998

continue to do the following statements

s.saveAsTextFile("tachyon://Master:19998/Y")

the console noted the following information,

ss.saveAsTextFile("tachyon://Master:19998/test1") 14/10/31 15:34:35 INFO : getFileStatus(tachyon://Master:19998/test1): HDFS Path: hdfs://Master:9000/test1 TPath: tachyon://Master:19998/test1 14/10/31 15:34:35 INFO : FileDoesNotExistException(message:/test1)/test1 14/10/31 15:34:35 INFO : File does not exist: tachyon://Master:19998/test1 14/10/31 15:34:35 INFO : mkdirs(tachyon://Master:19998/test1/_temporary, rwxrwxrwx) 14/10/31 15:34:35 INFO spark.SparkContext: Starting job: saveAsTextFile at <console>:15 14/10/31 15:34:35 INFO scheduler.DAGScheduler: Got job 1 (saveAsTextFile at <console>:15) with 2 output partitions (allowLocal=false) 14/10/31 15:34:35 INFO scheduler.DAGScheduler: Final stage: Stage 1(saveAsTextFile at <console>:15) 14/10/31 15:34:35 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/10/31 15:34:35 INFO scheduler.DAGScheduler: Missing parents: List() 14/10/31 15:34:35 INFO scheduler.DAGScheduler: Submitting Stage 1 (MappedRDD[3] at saveAsTextFile at <console>:15), which has no missing parents 14/10/31 15:34:35 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[3] at saveAsTextFile at <console>:15) 14/10/31 15:34:35 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 14/10/31 15:34:35 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 2 on executor localhost: localhost (PROCESS_LOCAL) 14/10/31 15:34:35 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 5177 bytes in 1 ms 14/10/31 15:34:35 INFO scheduler.TaskSetManager: Starting task 1.0:1 as TID 3 on executor localhost: localhost (PROCESS_LOCAL) 14/10/31 15:34:35 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as 5177 bytes in 1 ms 14/10/31 15:34:35 INFO executor.Executor: Running task ID 2 14/10/31 15:34:35 INFO executor.Executor: Running task ID 3 14/10/31 15:34:35 INFO storage.BlockManager: Found block broadcast_0 locally 14/10/31 15:34:35 INFO storage.BlockManager: Found block broadcast_0 locally 14/10/31 15:34:35 INFO rdd.HadoopRDD: Input split: tachyon://Master:19998/user/root/README.md:2405+2406 14/10/31 15:34:35 INFO rdd.HadoopRDD: Input split: tachyon://Master:19998/user/root/README.md:0+2405 14/10/31 15:34:35 INFO : open(tachyon://Master:19998/user/root/README.md, 65536) 14/10/31 15:34:35 INFO : open(tachyon://Master:19998/user/root/README.md, 65536) 14/10/31 15:34:35 ERROR : The machine does not have any local worker. 14/10/31 15:34:35 ERROR : Reading from HDFS directly 14/10/31 15:34:35 INFO : getFileStatus(tachyon://Master:19998/test1/_temporary): HDFS Path: hdfs://Master:9000/test1/_temporary TPath: tachyon://Master:19998/test1/_temporary 14/10/31 15:34:35 ERROR : The machine does not have any local worker. 14/10/31 15:34:35 INFO : mkdirs(tachyon://Master:19998/test1/_temporary/_attempt_201410311534_0000_m_000001_3, rwxrwxrwx) 14/10/31 15:34:35 INFO : getFileStatus(tachyon://Master:19998/test1/_temporary): HDFS Path: hdfs://Master:9000/test1/_temporary TPath: tachyon://Master:19998/test1/_temporary 14/10/31 15:34:35 INFO : mkdirs(tachyon://Master:19998/test1/_temporary/_attempt_201410311534_0000_m_000000_2, rwxrwxrwx) 14/10/31 15:34:35 INFO : create(tachyon://Master:19998/test1/_temporary/_attempt_201410311534_0000_m_000001_3/part-00001, rwxrwxrwx, true, 65536, 1, 33554432, org.apache.hadoop.mapred.Reporter$1@39a58e00) 14/10/31 15:34:35 WARN : tachyon.home is not set. Using /mnt/tachyon_default_home as the default value. 14/10/31 15:34:35 INFO : create(tachyon://Master:19998/test1/_temporary/_attempt_201410311534_0000_m_000000_2/part-00000, rwxrwxrwx, true, 65536, 1, 33554432, org.apache.hadoop.mapred.Reporter$1@39a58e00) 14/10/31 15:34:35 WARN : Fail to cache for: The machine does not have any local worker. 14/10/31 15:34:35 WARN : Fail to cache for: mCurrentBlockLeftByte 33554432 null 14/10/31 15:34:35 WARN : Fail to cache for: mCurrentBlockLeftByte 33554432 null 14/10/31 15:34:35 WARN : Fail to cache for: mCurrentBlockLeftByte 33554432 null 14/10/31 15:34:35 ERROR : Reading from HDFS directly 14/10/31 15:34:35 WARN : Fail to cache for: mCurrentBlockLeftByte 33554432 null 14/10/31 15:34:35 WARN : Fail to cache for: The machine does not have any local worker. 14/10/31 15:34:35 WARN : Fail to cache for: mCurrentBlockLeftByte 33554432 null 14/10/31 15:34:36 ERROR executor.Executor: Exception in task ID 3 java.io.IOException: FailedToCheckpointException(message:Failed to rename hdfs://Master:9000/tmp/tachyon/workers/1414740000001/1/22 to hdfs://Master:9000/tmp/tachyon/data/22) at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:83) at tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:156) at tachyon.client.FileOutStream.close(FileOutStream.java:205) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:103) at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:101) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:784) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: FailedToCheckpointException(message:Failed to rename hdfs://Master:9000/tmp/tachyon/workers/1414740000001/1/22 to hdfs://Master:9000/tmp/tachyon/data/22) at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:2687) at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:2655) at tachyon.thrift.WorkerService$addCheckpoint_result.read(WorkerService.java:2581) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at tachyon.thrift.WorkerService$Client.recv_addCheckpoint(WorkerService.java:148) at tachyon.thrift.WorkerService$Client.addCheckpoint(WorkerService.java:134) at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:77) ... 14 more


Best regards,
  Honglei Song

Haoyuan Li

unread,
Nov 19, 2014, 2:32:59 AM11/19/14
to shl...@163.com, tachyo...@googlegroups.com
Honglei,

This issue should have been fixed by this PR (https://github.com/amplab/tachyon/pull/477) in the master branch.

Best,

Haoyuan

--
You received this message because you are subscribed to the Google Groups "Tachyon Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tachyon-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Haoyuan Li
AMPLab, EECS, UC Berkeley

Lei Fan

unread,
Jul 29, 2015, 8:21:32 PM7/29/15
to Tachyon Users, shl...@163.com, haoyu...@gmail.com
Hi Haoyuan/Honglei,

Did this issue get resolved? I'm running into the same issue, though my setup is a little bit different:

Hadoop 2.7.1
Spark 1.4.1
Tachyon 0.6.4 (using HDFS as the UnderFS)

When I open spark-shell, I have no problems loading a file from HDFS via Tachyon (val s = sc.textFile("tachyon://<ip>:<port>/filepath"), where filepath is the path in HDFS; I did not use loadufs or anything like that to load HDFS into Tachyon).
However, when I'm done and I want to save something back to Tachyon, I can't do that and I run into the exact same issues as outlined above:

15/07/29 17:16:15 INFO : create(tachyon://n1:19998/out/_temporary/0/_temporary/attempt_201507291716_0001_m_000001_3/part-00001, rw-r--r--, true, 65536, 1, 536870912, org.apache.hadoop.mapred.Reporter$1@20a707d5)
15/07/29 17:16:15 INFO : create(tachyon://n1:19998/out/_temporary/0/_temporary/attempt_201507291716_0001_m_000000_2/part-00000, rw-r--r--, true, 65536, 1, 536870912, org.apache.hadoop.mapred.Reporter$1@20a707d5)
15/07/29 17:16:15 INFO : /mnt/ramdisk/tachyonworker/users/2/179314884608 was created!
15/07/29 17:16:15 INFO : /mnt/ramdisk/tachyonworker/users/2/181462368256 was created!
15/07/29 17:16:15 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
java.io.IOException: FailedToCheckpointException(message:Failed to rename hdfs://n1:9000//tmp/tachyon/workers/1438215000001/2/169 to hdfs://n1:9000//tmp/tachyon/data/169)
        at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:116)
        at tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:183)
        at tachyon.client.FileOutStream.close(FileOutStream.java:104)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:108)
        at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1117)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1294)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: FailedToCheckpointException(message:Failed to rename hdfs://n1:9000//tmp/tachyon/workers/1438215000001/2/169 to hdfs://n1:9000//tmp/tachyon/data/169)
        at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3513)
        at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3481)
        at tachyon.thrift.WorkerService$addCheckpoint_result.read(WorkerService.java:3407)
        at tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at tachyon.thrift.WorkerService$Client.recv_addCheckpoint(WorkerService.java:219)
        at tachyon.thrift.WorkerService$Client.addCheckpoint(WorkerService.java:205)
        at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:110)
        ... 16 more

Did you figure out a solution?

Thanks!

Lei

Lei Fan

unread,
Jul 30, 2015, 12:47:40 PM7/30/15
to Tachyon Users, shl...@163.com, haoyu...@gmail.com, leif...@gmail.com
Sorry, just figured it out.

I followed the discussion on this thread (https://groups.google.com/forum/?fromgroups#!searchin/tachyon-users/local$20worker/tachyon-users/bn39VN5M7P8/7c5OGmYeVV0J) and reformatted tachyon; that did fix the problem.

Thanks,

Lei

Haoyuan Li

unread,
Jul 30, 2015, 12:48:43 PM7/30/15
to Lei Fan, Tachyon Users, shl...@163.com
Great! Thanks for the update.

Haoyuan
Reply all
Reply to author
Forward
0 new messages