I was doing some fault tolerant testing of Spark Streaming with Tachyon as OFF_HEAP block store. As I said in earlier email, I could able to solve the BlockNotFound exception when I used Hierarchical Storage of Tachyon , which is good.
I continue doing some testing around storing the Spark Streaming WAL and CheckPoint files also in Tachyon . Here is few finding ..
When I store the Spark Streaming Checkpoint location in Tachyon , the throughput is much higher . I tested the Driver and Receiver failure cases , and Spark Streaming is able to recover without any Data Loss on Driver failure.
If I change the Checkpoint location back to HDFS , Spark Streaming can recover from both Driver and Receiver failure .
INFO : org.apache.spark.scheduler.DAGScheduler - Executor lost: 2 (epoch 1)
INFO : org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove executor 2 from BlockManagerMaster.
INFO : org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block manager BlockManagerId(2, 10.252.5.54, 45789)
INFO : org.apache.spark.storage.BlockManagerMaster - Removed 2 successfully in removeExecutor
INFO : org.apache.spark.streaming.scheduler.ReceiverTracker - Registered receiver for stream 2 from 10.252.5.62:47255
WARN : org.apache.spark.scheduler.TaskSetManager - Lost task 2.1 in stage 103.0 (TID 421, 10.252.5.62): org.apache.spark.SparkException:
Could not read data from write ahead log record FileBasedWriteAheadLogSegment(tachyon-ft://10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168)
at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:168)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalArgumentException: Seek position is past EOF: 645603894, fileSize = 0
at tachyon.hadoop.HdfsFileInputStream.seek(HdfsFileInputStream.java:239)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
at org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37)
at org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:104)
... 15 more
INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 2.2 in stage 103.0 (TID 422, 10.252.5.61, ANY, 1909 bytes)
INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 2.3 in stage 103.0 (TID 423, 10.252.5.62, ANY, 1909 bytes)
INFO : org.apache.spark.deploy.client.AppClient$ClientActor - Executor updated: app-20150511104442-0048/2 is now LOST (worker lost)
INFO : org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Executor app-20150511104442-0048/2 removed: worker lost
ERROR: org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Asked to remove non-existent executor 2
ERROR: org.apache.spark.scheduler.TaskSetManager - Task 2 in stage 103.0 failed 4 times; aborting job
INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 103.0, whose tasks have all completed, from pool
INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Cancelling stage 103
INFO : org.apache.spark.scheduler.DAGScheduler - ResultStage 103 (foreachRDD at Consumer.java:92) failed in 0.943 s
INFO : org.apache.spark.scheduler.DAGScheduler - Job 120 failed: foreachRDD at Consumer.java:92, took 0.953482 s
ERROR: org.apache.spark.streaming.scheduler.JobScheduler - Error running job streaming job 1431341145000 ms.0
at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168)
at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:168)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalArgumentException: Seek position is past EOF: 645603894, fileSize = 0
at tachyon.hadoop.HdfsFileInputStream.seek(HdfsFileInputStream.java:239)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
at org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37)
at org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:104)
... 15 more