Blockindex is out of bound when I run spark program on tachyon?why?

daisyd...@gmail.com

unread,

Jul 30, 2014, 12:22:15 AM7/30/14

to tachyo...@googlegroups.com

Hi all,

Recently, I ran an iterative algorithm on spark, each iteration result was saved in tachyon. As first dozens of iterations there is no mistake, but exception happened as follows:

14/07/30 11:20:59 INFO BlockManagerInfo: Added rdd_322_12 on tachyon on slave1:34801 (size: 19.8 MB)

14/07/30 11:20:59 INFO BlockManagerInfo: Added rdd_172_18 on tachyon on slave1:34801 (size: 19.8 MB)

14/07/30 11:20:59 INFO BlockManagerInfo: Added rdd_237_15 on tachyon on slave1:34801 (size: 19.8 MB)

14/07/30 11:20:59 INFO BlockManagerInfo: Added rdd_497_12 on tachyon on slave1:34801 (size: 19.8 MB)

14/07/30 11:20:59 INFO BlockManagerInfo: Added rdd_482_9 on tachyon on slave1:34801 (size: 19.8 MB)

14/07/30 11:20:59 WARN TaskSetManager: Lost TID 6172 (task 509.0:19)

14/07/30 11:20:59 WARN TaskSetManager: Loss was due to java.io.IOException

java.io.IOException: BlockIndex 0 is out of the bound in file ClientFileInfo(id:33697, name:rdd_517_19, path:/tmp_spark_tachyon/spar

k-828b9305-c646-4406-8fed-fa03b58cb858/1/spark-tachyon-20140730104735-9423/39/rdd_517_19, checkpointPath:, length:0, blockSizeByte:1

073741824, creationTimeMs:1406690018055, complete:false, folder:false, inMemory:true, needPin:false, needCache:true, blockIds:[], de

pendencyId:-1, inMemoryPercentage:100)

at tachyon.client.TachyonFS.getClientBlockInfo(TachyonFS.java:486)

at tachyon.client.TachyonFile.getLocationHosts(TachyonFile.java:85)

at org.apache.spark.storage.TachyonStore.getBytes(TachyonStore.scala:112)

at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:398)

at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:341)

this problem will be repeated each time, how to solve it? And why it happened?

Thanks in advance.

Haoyuan Li

unread,

Aug 2, 2014, 8:52:18 PM8/2/14

to daisyd...@gmail.com, tachyo...@googlegroups.com, Rong Gu

You used rdd.persist(OFF_HEAP) API?

Best,

Haoyuan

--
You received this message because you are subscribed to the Google Groups "Tachyon Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tachyon-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Haoyuan Li

AMPLab, EECS, UC Berkeley

http://www.cs.berkeley.edu/~haoyuan/

Manku Timma

unread,

Sep 3, 2014, 9:54:05 PM9/3/14

to tachyo...@googlegroups.com, daisyd...@gmail.com, gurong...@gmail.com

I am using rdd.persist(OFF_HEAP) and I am also facing the exact same problem.
Any clues on how to debug this. Also I tried setting tachyon.debug=true in conf/tachyon-env.sh.
Nothing extra seems to show up in the logs directory.

Haoyuan Li

unread,

Sep 16, 2014, 1:05:11 AM9/16/14

to Manku Timma, tachyo...@googlegroups.com, daisyd...@gmail.com, Rong Gu

Could you please copy and paste the corresponding tachyon worker's log?

Haoyuan

Haithem Turki

unread,

Oct 22, 2014, 8:36:10 PM10/22/14

to tachyo...@googlegroups.com, manku....@gmail.com, daisyd...@gmail.com, gurong...@gmail.com

I just ran into this bug as well with Spark 1.1 and Tachyon 0.5. Things worked fine for a few minutes and then one of my workers started spitting out that same error:

14/10/22 20:22:34 ERROR executor.Executor: Exception in task 50.8612 in stage 13.0 (TID 18187)
java.io.IOException: BlockIndex 0 is out of the bound in file ClientFileInfo(id:1560, name:rdd_15_50, path:/tmp_spark_tachyon/spark-31dc83d6-1a95-42e9-ae65-fb698bb5ba8a/4/spark-tachyon-20141022201558-81fb/37/rdd_15_50, ufsPath:, length:0, blockSizeByte:1073741824, creationTimeMs:1414023756663, isComplete:true, isFolder:false, isPinned:false, isCache:true, blockIds:[], dependencyId:-1, inMemoryPercentage:100)
	at tachyon.client.TachyonFS.getClientBlockInfo(TachyonFS.java:785)
	at tachyon.client.TachyonFile.getLocationHosts(TachyonFile.java:172)
	at org.apache.spark.storage.TachyonStore.getBytes(TachyonStore.scala:104)
	at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:435)
	at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:368)
	at org.apache.spark.storage.BlockManager.get(BlockManager.scala:552)
	at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
	at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:54)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
14/10/22 20:22:34 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18188
14/10/22 20:22:34 INFO executor.Executor: Running task 52.8630 in stage 13.0 (TID 18188)
14/10/22 20:22:34 ERROR executor.Executor: Exception in task 52.8630 in stage 13.0 (TID 18188)
java.io.IOException: BlockIndex 0 is out of the bound in file ClientFileInfo(id:1561, name:rdd_15_52, path:/tmp_spark_tachyon/spark-31dc83d6-1a95-42e9-ae65-fb698bb5ba8a/4/spark-tachyon-20141022201558-81fb/35/rdd_15_52, ufsPath:, length:0, blockSizeByte:1073741824, creationTimeMs:1414023756720, isComplete:true, isFolder:false, isPinned:false, isCache:true, blockIds:[], dependencyId:-1, inMemoryPercentage:100)
	at tachyon.client.TachyonFS.getClientBlockInfo(TachyonFS.java:785)
	at tachyon.client.TachyonFile.getLocationHosts(TachyonFile.java:172)
	at org.apache.spark.storage.TachyonStore.getBytes(TachyonStore.scala:104)
	at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:435)
	at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:368)
	at org.apache.spark.storage.BlockManager.get(BlockManager.scala:552)
	at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
	at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:54)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
14/10/22 20:22:34 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18189

Attached are the worker logs from around that time

sadWorkerLog.txt

顾荣

unread,

Oct 23, 2014, 8:18:39 AM10/23/14

to tachyo...@googlegroups.com

Hey,

I found from your log that the size of the rdd file stored in Tachyon is 0.

"java.io.IOException: BlockIndex 0 is out of the bound in file ClientFileInfo(id:33697, name:rdd_517_19, path:/tmp_spark_tachyon/spark-828b9305-c646-4406-8fed-fa03b58cb858/1/spark-tachyon-20140730104735-9423/39/rdd_517_19, checkpointPath:, length:0,"

Is this true for your application? If it's true, then it's possible to show this error from the tachyon.client.TachyonFS.getClientBlockInfo function.

Best,
Rong

shang wu

unread,

Dec 18, 2014, 5:07:05 AM12/18/14

to tachyo...@googlegroups.com

Hi, I got the same error

Have you solved it?

Thanks.

在 2014年7月30日星期三UTC+8下午12时22分15秒，daisyd...@gmail.com写道：

John Yost

unread,

Dec 23, 2014, 5:36:44 PM12/23/14

to tachyo...@googlegroups.com

Yeah, I am seeing exactly the same thing as well. I was able to persist a couple of RDDs, and then when I came back after a few minutes, subsquent persist invocations fail. Anyone have an idea what's going on?

Mahmoud Hanafy

unread,

Mar 10, 2015, 8:47:04 AM3/10/15

to tachyo...@googlegroups.com

I have the same problem, I print the size of the RDD before persisting it and the size is not zero.

Did any one find a Solution !

Haoyuan Li

unread,

Mar 10, 2015, 1:34:41 PM3/10/15

to Mahmoud Hanafy, tachyo...@googlegroups.com

Mahmoud, which versions are you running?

Best,

Haoyuan

On Tue, Mar 10, 2015 at 5:47 AM, Mahmoud Hanafy <mahmoud...@badrit.com> wrote:

I have the same problem, I print the size of the RDD before persisting it and the size is not zero.
Did any one find a Solution !

--

You received this message because you are subscribed to the Google Groups "Tachyon Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tachyon-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mahmoud Hanafy

unread,

Mar 10, 2015, 3:08:58 PM3/10/15

to tachyo...@googlegroups.com, mahmoud...@badrit.com

I'm using:

Tachyon 0.5.0

Spark 1.1.0

Hadoop 2.4

Calvin Jia

unread,

Mar 10, 2015, 6:06:04 PM3/10/15

to tachyo...@googlegroups.com, mahmoud...@badrit.com

Hi Mahmoud,

Could you provide the worker logs for when this happened?

Thanks,

Calvin

Mahmoud Hanafy

unread,

Mar 11, 2015, 5:41:54 AM3/11/15

to Calvin Jia, tachyo...@googlegroups.com

Hi Calvin,

Here is the log for one of the workers.

Thanks,

Mahmoud

worker.log

Calvin Jia

unread,

Mar 13, 2015, 12:24:31 AM3/13/15

to tachyo...@googlegroups.com

Thanks for the info, looking at the logs, it seems like the RDDs causing the issue are all 0 length? Do you mean that the size of the RDD is inconsistent? (ie. a non zero RDD was written but is now 0 length?)

Reply all

Reply to author

Forward