FailedToCheckpointException and not writing the data into Tachyon

13 views
Skip to first unread message

mallikarju...@gmail.com

unread,
Dec 3, 2015, 7:44:03 AM12/3/15
to Tachyon Users

Hi All,

I am facing FailedToCheckpointException while saving Spark Dataframe into tachyon.
Using Spark 1.5.1 and tachyon V0.7.1

I am trying to connect to Tachyon installed in local mode from Spark Cluster for saving Dataframe into tachyon. Pls find the error log here. Attached master and worker logs also.

Can someone help in resolving this.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 141.0 failed 4 times, most recent failure: Lost task 1.3 in stage 141.0 (TID 2081, 162.44.115.221): java.io.IOException: FailedToCheckpointException(message:Failed to rename /tmp/tmp/tachyon/workers/1449036000001/214/505 to /tmp/tmp/tachyon/data/505)
        at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:130)
        at tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:228)
        at tachyon.client.FileOutStream.close(FileOutStream.java:105)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
        at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:108)
        at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:103)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1117)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1215)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: FailedToCheckpointException(message:Failed to rename /tmp/tmp/tachyon/workers/1449036000001/214/505 to /tmp/tmp/tachyon/data/505)
        at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3509)
        at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3477)
        at tachyon.thrift.WorkerService$addCheckpoint_result.read(WorkerService.java:3403)
        at tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at tachyon.thrift.WorkerService$Client.recv_addCheckpoint(WorkerService.java:221)
        at tachyon.thrift.WorkerService$Client.addCheckpoint(WorkerService.java:207)
        at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:124)
        ... 16 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1912)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1124)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1065)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:989)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:965)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:897)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:896)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1426)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1405)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1405)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1405)
        at com.databricks.spark.csv.package$CsvSchemaRDD.saveAsCsvFile(package.scala:169)
        at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:165)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:170)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
        at com.imshealth.jobrunner.SparkJobDriver.applyBusinessRulesOnDebitCreditDataFrames(SparkJobDriver.java:1211)
        at com.imshealth.jobrunner.SparkJobDriver.main(SparkJobDriver.java:521)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: FailedToCheckpointException(message:Failed to rename /tmp/tmp/tachyon/workers/1449036000001/214/505 to /tmp/tmp/tachyon/data/505)
        at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:130)
        at tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:228)
        at tachyon.client.FileOutStream.close(FileOutStream.java:105)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
        at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:108)
        at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:103)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1117)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1215)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: FailedToCheckpointException(message:Failed to rename /tmp/tmp/tachyon/workers/1449036000001/214/505 to /tmp/tmp/tachyon/data/505)
        at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3509)
        at tachyon.thrift.WorkerService$addCheckpoint_result$addCheckpoint_resultStandardScheme.read(WorkerService.java:3477)
        at tachyon.thrift.WorkerService$addCheckpoint_result.read(WorkerService.java:3403)
        at tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at tachyon.thrift.WorkerService$Client.recv_addCheckpoint(WorkerService.java:221)
        at tachyon.thrift.WorkerService$Client.addCheckpoint(WorkerService.java:207)
        at tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:124)
        ... 16 more



Thanks,
Mallikarjun
worker.log
master.log

Gene Pang

unread,
Dec 3, 2015, 10:59:53 AM12/3/15
to Tachyon Users
Hi Mallikarjun,

It seems like the worker cannot rename the file: /tmp/tmp/tachyon/workers/1449036000001/214/505 to /tmp/tmp/tachyon/data/505

Do those files exist on the worker, and do you have permissions on them?

Also, could you describe your setup? How many nodes are you using? Why are you using the localfs as the ufs?

Thanks,
Gene

mallikarju...@gmail.com

unread,
Dec 4, 2015, 4:00:29 AM12/4/15
to Tachyon Users
Hi Gene,

The file /tmp/tmp/tachyon/workers/1449036000001/214/505 which is trying to be renamed is not there under worker node. Most of the times, we are getting this error that those files for which command get executed to rename are absent at that location. 

The machine in which tachyon is set up doesn't have HDFS installed. So configured UFS as localfs which /tmp. Also tachyon is running at Ubuntu 14.0 and spark from where Tachyon worker getting called is in RHEL 6.6 
I have tried from spark node using shh to connect tachyon node and execute few commands using tfs that is working fine.  

Regarding setup it is apache open source running on HDFS 2.4 and Spark is in 1.5.1 and tachyon on 0.7.1.  

We are trying to save the data from spark cluster to tachyon(this host is not part of spark cluster). Checked that user has permissions to write by doing SSH.  


Thanks,
Mallikarjun


Gene Pang

unread,
Dec 4, 2015, 10:04:31 AM12/4/15
to Tachyon Users
Hi Mallikarjun,

I just want to clarify a few points.

You have 1 Tachyon master and 1 Tachyon worker, and they are both running on the same machine? Also, that machine is not on a spark machine, or an HDFS machine?

Also, you don't have to run Tachyon on an HDFS node, so you can always use any remote storage system as the UFS for Tachyon.

Also, what are the commands you are using to save the file into Tachyon?

Thanks,
Gene

mallikarju...@gmail.com

unread,
Dec 7, 2015, 1:49:26 AM12/7/15
to Tachyon Users
Hi Gene,

Tachyon master and worker are running in same machine(set up in local mode). That machine is not on a spark machine, or on a  HDFS machine.

I dont have HDFS in tachyon machine. so i am using ufs as localfs.

i am using the below command to save spark dataframe into tachton
dataframe.write().format("com.databricks.spark.csv").save(tachyon_path)


Thanks,
Mallikarjun Reddy

Gene Pang

unread,
Dec 7, 2015, 10:49:01 AM12/7/15
to Tachyon Users
Hi Mallikarjun,

I was wondering if you could use the HDFS ufs instead of localfs. I want to know if this issue happens even for an HDFS UFS, or if this is an issue with only localfs.

Also, when you stop Tachyon (to restart it), could you delete the logs, so the master and worker logs can be as small as possible, when you run the smallest case that runs into this issue?

Thanks,
Gene
Reply all
Reply to author
Forward
0 new messages