Failed to get broadcast_0_piece0 of broadcast_0 when querying on heavy dataset with lucene stratio index

siddharth verma

unread,

Apr 29, 2016, 8:42:41 AM4/29/16

to spark-conn...@lists.datastax.com

Hi,

I made a spark job which queries on lucene startio index on cassandra, and does some processing on the loaded dataset.

Using spark-submit to run it.

No broadcast variables used. final variables used in transformations instead.

standalone cluster mode spark used : spark-1.6.1-bin-hadoop2.6

When the dataset is huge, the job gives an error and I get the following stack trace,

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 7, 10.41.55.57): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0

at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)

at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)

at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)

at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)

at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)

at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:89)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0

at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)

at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)

at scala.collection.immutable.List.foreach(List.scala:318)

at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)

at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)

at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)

... 11 more

Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)

at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

at scala.Option.foreach(Option.scala:236)

at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:912)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:910)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)

at org.apache.spark.rdd.RDD.foreach(RDD.scala:910)

at org.apache.spark.api.java.JavaRDDLike$class.foreach(JavaRDDLike.scala:332)

at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:46)

at myjob.SparkJob1.main(SparkJob1.java:126)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0

at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)

at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)

at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)

at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)

at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)

at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:89)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0

at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)

at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)

at scala.collection.immutable.List.foreach(List.scala:318)

at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)

at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)

at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)

... 11 more

Could someone help me as to why am I getting this error, how to rectify it.

Thanks

Siddharth Verma

Russell Spitzer

unread,

Apr 29, 2016, 12:07:26 PM4/29/16

to spark-conn...@lists.datastax.com

Usually i see this on OOMs or other system level failures. You may want to check all the executor logs for OOM killers or something like that.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--

Russell Spitzer
Software Engineer

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md
http://spark-packages.org/package/datastax/spark-cassandra-connector

siddharth verma

unread,

Apr 29, 2016, 3:24:40 PM4/29/16

to spark-conn...@lists.datastax.com

Hi Russell,

Is there a way to prevent this?
The configuration of the cluster is, 3 nodes(16GB RAM each) have cassandra running on them, (datacentre2)

each of these nodes serve as spark slaves. Each node runs 2 instances of spark. 4gb ram each, 2 cores each.

And everytime I run the job for that particular data, i get this error on one of the nodes. and there "free -m" gives approx 9GB free

siddharth verma

unread,

Apr 30, 2016, 7:14:30 AM4/30/16

to spark-conn...@lists.datastax.com

Reduced RAM per instance to from 4Gb to 1GB, still the same exception occurs.

If the data gets filed in memory, wont it spill to disk automatically? If it does then OOM shouldn't be affect it, as OOM won't occur.

When i load the data, should i call persist/cache before applying forEach?

Thanks

Siddharth Verma

Russell Spitzer

unread,

Apr 30, 2016, 10:51:41 AM4/30/16

to spark-conn...@lists.datastax.com

Spark isn't able to automatically spill to disk unless the Partitions of the RDD are already smaller than the ram provided to the executors. Increasing the memory or decreasing the number of cores may help. Also if in your foreach or other code you do something which increases the size in memory of the heap you can OOM that way. From the description of your setup it sounds like maybe you are over provisioning?

Have you looked at your executor logs yet?

siddharth verma

unread,

May 2, 2016, 1:54:56 AM5/2/16

to spark-conn...@lists.datastax.com

In this case, forEach doesn't cause anything which could increase the size of memory.

I checked the executor logs, and saw an exception with the query generated in some case.

Now, this is a speculation here, that when the exception was generated, the task was killed, and SparkContext might have been stopped, and while the other executors were running when SparkConext was stopped, this caused the broadcast exception (http://stackoverflow.com/questions/34457486/why-does-spark-fail-with-failed-to-get-broadcast-0-piece0-of-broadcast-0-in-lo)

This is the best guess i could come up with. Still not sure about it.

Reply all

Reply to author

Forward