Cannot save files from Spark 1.5.0 into Tachyon 0.7.1

20 views
Skip to first unread message

Ashwin Raaghav

unread,
Dec 24, 2015, 3:00:12 AM12/24/15
to Tachyon Users
Hi everyone,

I'm new to tachyon (so sorry for the newbie question).

I am using spark 1.5.0 and tachyon 0.7.1 on a cluster of 17 nodes, and HDFS version is 2.6. Both spark master and tachyon master are in the same machine. I'm able to read the files from HDFS through Tachyon in spark.

val rdd = sc.textFile("tachyon://<<ip>>:19998/<<hdfs path>>")

This reads the RDD successfully. But when I'm trying to save something as an object file,

sc.saveAsObjectFile("tachyon://<<ip>>:19998/<<tachyon path>>")

It throws the following error.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 1.0 failed 4 times, most recent failure: Lost task 7.3 in stage 1.0 (TID 14, 10.10.5.23): java.io.IOException: FailedToCheckpointException(message:Failed to rename hdfs://10.10.3.16:8020/workers/1450942000003/77/139 to hdfs://10.10.3.16:8020/data/139) at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:130) at tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:228) at tachyon.client.FileOutStream.close(FileOutStream.java:105) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1280) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:79) at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:103) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1117) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1215) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: FailedToCheckpointException(message:Failed to rename hdfs://10.10.3.16:8020/workers/1450942000003/77/139 to hdfs://10.10.3.16:8020/data/139) at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3509) at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3477) at tachyon.thrift.WorkerService$addCheckpoint_result.read(WorkerService.java:3403) at tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at tachyon.thrift.WorkerService$Client.recv_addCheckpoint(WorkerService.java:221) at tachyon.thrift.WorkerService$Client.addCheckpoint(WorkerService.java:207) at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:124) ... 17 more

I don't think permissions are an issue because I've changed to permissions for both data and worker folders using chmod 777.
And the tachyon was built by specifying -Dhadoop.version as 2.6.
Also, I formatted tachyon again, but it did not change anything.

Any help would be much appreciated!

PS: I saw an ticket raised about the same issue but it said it is fixed in version 0.9? Is there any way to fix it in this version?

Thank you.

Cheng Chang

unread,
Dec 24, 2015, 4:56:24 AM12/24/15
to Tachyon Users, Ashwin Raaghav
Hey Ashwin,

Could you manually rename hdfs://10.10.3.16:8020/workers/1450942000003/77/139 to hdfs://10.10.3.16:8020/data/139 using hadoop’s native command? Also, just curious, would sc.saveAsTextFile work?

Best,
Cheng

在 December 24, 2015 4:00:15 PM, Ashwin Raaghav (ashra...@gmail.com) 写到:

--
You received this message because you are subscribed to the Google Groups "Tachyon Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tachyon-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ashwin Raaghav

unread,
Dec 24, 2015, 5:35:13 AM12/24/15
to Cheng Chang, Tachyon Users
Hi Cheng,

I am able to manually rename the files using hdfs dfs -mv  command. 
And I tried doing sc.saveAsTextFile but it didn't work either. Even if it worked, I cannot save it as a text file because I'm saving an RDD of MLlib models, which I need to reuse later. 

Gene Pang

unread,
Dec 24, 2015, 10:26:15 AM12/24/15
to Tachyon Users, mya...@gmail.com
Hi Ashwin,

Do you know which ticket mentioned this was fixed in 0.9?

Are there any useful information in the Tachyon worker log and the Tachyon master log?

Thanks,
Gene

Calvin Jia

unread,
Dec 27, 2015, 12:08:21 AM12/27/15
to Tachyon Users
Hi Ashwin,

One possible cause for this issue is if the format step was skipped before starting Tachyon (ie. bin/tachyon format).

Hope this helps,
Calvin

Ashwin Raaghav

unread,
Dec 28, 2015, 12:52:21 AM12/28/15
to Calvin Jia, Tachyon Users, gene...@gmail.com, Cheng Chang
Hi Calvin, 

I followed all the steps specified in the documentation. I even formatted it again. But it did not help. 

And Gene, the ticket I found it was this: https://tachyon.atlassian.net/browse/TACHYON-1339
I checked the logs and it said that the file does not exist.

2015-12-24 04:31:14,328 ERROR MASTER_LOGGER (MasterInfo.java:workerHeartbeat) - File 355 does not exist

I am not able to understand why it is not able to find the file.



--
You received this message because you are subscribed to a topic in the Google Groups "Tachyon Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tachyon-users/pkzm2nDVaAg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tachyon-user...@googlegroups.com.

Cheng Chang

unread,
Dec 28, 2015, 2:46:28 AM12/28/15
to Calvin Jia, Ashwin Raaghav, Tachyon Users, gene...@gmail.com
Hey Ashwin,

When you say you can successfully read from Tachyon via sc.textFile, did you do any operation on that RDD? 

If you just run sc.textFile without any further RDD operation, spark won’t actually read the file from Tachyon. 

sc.textFile(xxxx).first() will force the read. 

Best,
Cheng

在 December 28, 2015 1:52:20 PM, Ashwin Raaghav (ashra...@gmail.com) 写到:

Ashwin Raaghav

unread,
Dec 28, 2015, 3:38:42 AM12/28/15
to Cheng Chang, Calvin Jia, Tachyon Users, Gene Pang
Hi Cheng,

Yes I read and did operations on that RDD. It did not face any problems.

And also, I checked today after formatting again and it throws the same error but it saved the files on Tachyon. 
If it is able to save the files on Tachyon, why is it throwing the error?

Regards,
Ashwin.
--
Regards,

Ashwin Raaghav

Cheng Chang

unread,
Dec 28, 2015, 4:36:46 AM12/28/15
to Ashwin Raaghav, Tachyon Users, Calvin Jia, Gene Pang
Could you provide Tachyon master and worker logs?

Best,
Cheng

在 December 28, 2015 4:38:41 PM, Ashwin Raaghav (ashra...@gmail.com) 写到:

Ashwin Raaghav

unread,
Dec 28, 2015, 8:28:53 AM12/28/15
to Cheng Chang, Tachyon Users, Calvin Jia, Gene Pang
Hi Cheng,

I have attached the worker log below. I was not able to find anything in the master log as nothing came up in it when I was trying to save it.

Worker log says there is some permission issue, because the write is done by one user and tachyon is run by the another user.
I thought chmod would've changed this. But since owner was this user, the other user was not able to move those files to /data folder. Silly mistake. Sorry for taking up your time.

Thank you. :)
--
Regards,

Ashwin Raaghav
worker.log

Cheng Chang

unread,
Dec 28, 2015, 8:58:19 AM12/28/15
to Ashwin Raaghav, Tachyon Users, Calvin Jia, Gene Pang
Great to know you’ve worked it out! 

For further reference, I think the problem in this thread is similar to that in https://groups.google.com/forum/#!topic/tachyon-users/MEamF2hlStQ

Best,
Cheng

在 December 28, 2015 9:28:52 PM, Ashwin Raaghav (ashra...@gmail.com) 写到:

Reply all
Reply to author
Forward
0 new messages