Blockindex is out of bound when I run spark program on tachyon?why?

2 views
Skip to first unread message

daisyd...@gmail.com

unread,
Jul 30, 2014, 12:22:15 AM7/30/14
to tachyo...@googlegroups.com
Hi all,
  Recently, I ran an iterative algorithm on spark, each iteration result was saved in tachyon. As first dozens of iterations there is no mistake, but exception happened as follows:

14/07/30 11:20:59 INFO BlockManagerInfo: Added rdd_322_12 on tachyon on slave1:34801 (size: 19.8 MB)
14/07/30 11:20:59 INFO BlockManagerInfo: Added rdd_172_18 on tachyon on slave1:34801 (size: 19.8 MB)
14/07/30 11:20:59 INFO BlockManagerInfo: Added rdd_237_15 on tachyon on slave1:34801 (size: 19.8 MB)
14/07/30 11:20:59 INFO BlockManagerInfo: Added rdd_497_12 on tachyon on slave1:34801 (size: 19.8 MB)
14/07/30 11:20:59 INFO BlockManagerInfo: Added rdd_482_9 on tachyon on slave1:34801 (size: 19.8 MB)
14/07/30 11:20:59 WARN TaskSetManager: Lost TID 6172 (task 509.0:19)
14/07/30 11:20:59 WARN TaskSetManager: Loss was due to java.io.IOException
java.io.IOException: BlockIndex 0 is out of the bound in file ClientFileInfo(id:33697, name:rdd_517_19, path:/tmp_spark_tachyon/spar
k-828b9305-c646-4406-8fed-fa03b58cb858/1/spark-tachyon-20140730104735-9423/39/rdd_517_19, checkpointPath:, length:0, blockSizeByte:1
073741824, creationTimeMs:1406690018055, complete:false, folder:false, inMemory:true, needPin:false, needCache:true, blockIds:[], de
pendencyId:-1, inMemoryPercentage:100)
        at tachyon.client.TachyonFS.getClientBlockInfo(TachyonFS.java:486)
        at tachyon.client.TachyonFile.getLocationHosts(TachyonFile.java:85)
        at org.apache.spark.storage.TachyonStore.getBytes(TachyonStore.scala:112)
        at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:398)
        at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:341)

this problem will be repeated each time, how to solve it? And why it happened?
Thanks in advance.

Haoyuan Li

unread,
Aug 2, 2014, 8:52:18 PM8/2/14
to daisyd...@gmail.com, tachyo...@googlegroups.com, Rong Gu
You used rdd.persist(OFF_HEAP) API?

Best,

Haoyuan


--
You received this message because you are subscribed to the Google Groups "Tachyon Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tachyon-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Haoyuan Li
AMPLab, EECS, UC Berkeley

Manku Timma

unread,
Sep 3, 2014, 9:54:05 PM9/3/14
to tachyo...@googlegroups.com, daisyd...@gmail.com, gurong...@gmail.com
I am using rdd.persist(OFF_HEAP) and I am also facing the exact same problem.
Any clues on how to debug this. Also I tried setting tachyon.debug=true in conf/tachyon-env.sh.
Nothing extra seems to show up in the logs directory.

Haoyuan Li

unread,
Sep 16, 2014, 1:05:11 AM9/16/14
to Manku Timma, tachyo...@googlegroups.com, daisyd...@gmail.com, Rong Gu
Could you please copy and paste the corresponding tachyon worker's log?

Haoyuan

Haithem Turki

unread,
Oct 22, 2014, 8:36:10 PM10/22/14
to tachyo...@googlegroups.com, manku....@gmail.com, daisyd...@gmail.com, gurong...@gmail.com
I just ran into this bug as well with Spark 1.1 and Tachyon 0.5. Things worked fine for a few minutes and then one of my workers started spitting out that same error:

14/10/22 20:22:34 ERROR executor.Executor: Exception in task 50.8612 in stage 13.0 (TID 18187)
java.io.IOException: BlockIndex 0 is out of the bound in file ClientFileInfo(id:1560, name:rdd_15_50, path:/tmp_spark_tachyon/spark-31dc83d6-1a95-42e9-ae65-fb698bb5ba8a/4/spark-tachyon-20141022201558-81fb/37/rdd_15_50, ufsPath:, length:0, blockSizeByte:1073741824, creationTimeMs:1414023756663, isComplete:true, isFolder:false, isPinned:false, isCache:true, blockIds:[], dependencyId:-1, inMemoryPercentage:100)
	at tachyon.client.TachyonFS.getClientBlockInfo(TachyonFS.java:785)
	at tachyon.client.TachyonFile.getLocationHosts(TachyonFile.java:172)
	at org.apache.spark.storage.TachyonStore.getBytes(TachyonStore.scala:104)
	at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:435)
	at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:368)
	at org.apache.spark.storage.BlockManager.get(BlockManager.scala:552)
	at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
	at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:54)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
14/10/22 20:22:34 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18188
14/10/22 20:22:34 INFO executor.Executor: Running task 52.8630 in stage 13.0 (TID 18188)
14/10/22 20:22:34 ERROR executor.Executor: Exception in task 52.8630 in stage 13.0 (TID 18188)
java.io.IOException: BlockIndex 0 is out of the bound in file ClientFileInfo(id:1561, name:rdd_15_52, path:/tmp_spark_tachyon/spark-31dc83d6-1a95-42e9-ae65-fb698bb5ba8a/4/spark-tachyon-20141022201558-81fb/35/rdd_15_52, ufsPath:, length:0, blockSizeByte:1073741824, creationTimeMs:1414023756720, isComplete:true, isFolder:false, isPinned:false, isCache:true, blockIds:[], dependencyId:-1, inMemoryPercentage:100)
	at tachyon.client.TachyonFS.getClientBlockInfo(TachyonFS.java:785)
	at tachyon.client.TachyonFile.getLocationHosts(TachyonFile.java:172)
	at org.apache.spark.storage.TachyonStore.getBytes(TachyonStore.scala:104)
	at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:435)
	at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:368)
	at org.apache.spark.storage.BlockManager.get(BlockManager.scala:552)
	at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
	at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:54)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
14/10/22 20:22:34 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18189
Attached are the worker logs from around that time
sadWorkerLog.txt

顾荣

unread,
Oct 23, 2014, 8:18:39 AM10/23/14
to tachyo...@googlegroups.com
Hey,

I found from your log that the size of the rdd file stored in Tachyon is 0.
"java.io.IOException: BlockIndex 0 is out of the bound in file ClientFileInfo(id:33697, name:rdd_517_19, path:/tmp_spark_tachyon/spark-828b9305-c646-4406-8fed-fa03b58cb858/1/spark-tachyon-20140730104735-9423/39/rdd_517_19, checkpointPath:, length:0,"

Is this true for your application? If it's true, then it's possible to show this error from the tachyon.client.TachyonFS.getClientBlockInfo function.

Best,
Rong

shang wu

unread,
Dec 18, 2014, 5:07:05 AM12/18/14
to tachyo...@googlegroups.com
Hi, I got the same error 
Have you solved it?
Thanks.

在 2014年7月30日星期三UTC+8下午12时22分15秒,daisyd...@gmail.com写道:

John Yost

unread,
Dec 23, 2014, 5:36:44 PM12/23/14
to tachyo...@googlegroups.com
Yeah, I am seeing exactly the same thing as well.  I was able to persist a couple of RDDs, and then when I came back after a few minutes, subsquent persist invocations fail.  Anyone have an idea what's going on?

Mahmoud Hanafy

unread,
Mar 10, 2015, 8:47:04 AM3/10/15
to tachyo...@googlegroups.com
I have the same problem, I print the size of the RDD before persisting it and the size is not zero.
Did any one find a Solution !

Haoyuan Li

unread,
Mar 10, 2015, 1:34:41 PM3/10/15
to Mahmoud Hanafy, tachyo...@googlegroups.com
Mahmoud, which versions are you running?

Best,

Haoyuan

On Tue, Mar 10, 2015 at 5:47 AM, Mahmoud Hanafy <mahmoud...@badrit.com> wrote:
I have the same problem, I print the size of the RDD before persisting it and the size is not zero.
Did any one find a Solution !

--
You received this message because you are subscribed to the Google Groups "Tachyon Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tachyon-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mahmoud Hanafy

unread,
Mar 10, 2015, 3:08:58 PM3/10/15
to tachyo...@googlegroups.com, mahmoud...@badrit.com
I'm using:
Tachyon 0.5.0
Spark 1.1.0 
Hadoop 2.4

Calvin Jia

unread,
Mar 10, 2015, 6:06:04 PM3/10/15
to tachyo...@googlegroups.com, mahmoud...@badrit.com
Hi Mahmoud,

Could you provide the worker logs for when this happened?

Thanks,
Calvin

Mahmoud Hanafy

unread,
Mar 11, 2015, 5:41:54 AM3/11/15
to Calvin Jia, tachyo...@googlegroups.com
Hi Calvin,

Here is the log for one of the workers.

Thanks,
Mahmoud
worker.log

Calvin Jia

unread,
Mar 13, 2015, 12:24:31 AM3/13/15
to tachyo...@googlegroups.com
Thanks for the info, looking at the logs, it seems like the RDDs causing the issue are all 0 length? Do you mean that the size of the RDD is inconsistent? (ie. a non zero RDD was written but is now 0 length?)
Reply all
Reply to author
Forward
0 new messages