Loss was due to java.util.NoSuchElementException

153 views
Skip to first unread message

surfer

unread,
Apr 10, 2013, 4:34:40 AM4/10/13
to spark...@googlegroups.com
Hi all,

I'm running a job and after a lot of computations there is a fail which
seems related to the shuffling task.
I set System.setProperty("spark.akka.frameSize","30") because of
the ~12MB output size being shuffled.
Any idea of what is going wrong?
thanks
giovanni


This is the excerpt from the output:

13/04/10 06:52:53 INFO spark.MapOutputTrackerActor: Asked to send map
output locations for shuffle 2 to entu151
13/04/10 06:53:02 INFO spark.MapOutputTracker: Size of output statuses
for shuffle 2 is 11730944 bytes
13/04/10 06:53:02 INFO spark.MapOutputTrackerActor: Asked to send map
output locations for shuffle 2 to entu122
13/04/10 06:53:03 INFO cluster.TaskSetManager: Lost TID 40071 (task 1.0:71)
13/04/10 06:53:03 INFO cluster.TaskSetManager: Loss was due to
java.util.NoSuchElementException
at spark.util.TimeStampedHashMap.apply(TimeStampedHashMap.scala:56)
at
spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:145)
at
spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:14)
at
spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:105)
at
spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:95)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:95)
at spark.RDD.computeOrReadCheckpoint(RDD.scala:206)
at spark.RDD.iterator(RDD.scala:195)
at spark.MappedValuesRDD.compute(PairRDDFunctions.scala:649)
at spark.RDD.computeOrReadCheckpoint(RDD.scala:206)
at spark.RDD.iterator(RDD.scala:195)
at spark.FlatMappedValuesRDD.compute(PairRDDFunctions.scala:659)
at spark.RDD.computeOrReadCheckpoint(RDD.scala:206)
at spark.RDD.iterator(RDD.scala:195)
at spark.rdd.MappedRDD.compute(MappedRDD.scala:12)
at spark.RDD.computeOrReadCheckpoint(RDD.scala:206)
at spark.RDD.iterator(RDD.scala:195)
at spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:19)
at spark.RDD.computeOrReadCheckpoint(RDD.scala:206)
at spark.RDD.iterator(RDD.scala:195)
at spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:125)
at spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:74)
at spark.executor.Executor$TaskRunner.run(Executor.scala:101)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Matei Zaharia

unread,
Apr 10, 2013, 1:08:26 PM4/10/13
to spark...@googlegroups.com
Is this with Spark Streaming or just Spark?

Matei
> --
> You received this message because you are subscribed to the Google Groups "Spark Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

surfer

unread,
Apr 11, 2013, 1:18:17 AM4/11/13
to spark...@googlegroups.com
On 04/10/2013 07:08 PM, Matei Zaharia wrote:
> Is this with Spark Streaming or just Spark?
It's just spark.

giovanni

>
> Matei
>
> On Apr 10, 2013, at 4:34 AM, surfer <sur...@crs4.it> wrote:
>
>> Hi all,
>>
>> I'm running a job and after a lot of computations there is a failure which

Nathan

unread,
Jul 19, 2013, 9:51:49 AM7/19/13
to spark...@googlegroups.com, sur...@crs4.it
I'm getting a very similar error.  On the client, the first error I get (and, really, all the other errors except the last "failed more than 4 times") is:

[INFO] 19 Jul 2013 09:21:03 - spark.Logging$class - Lost TID 4713 (task 1.0:205)
[INFO] 19 Jul 2013 09:21:03 - spark.Logging$class - Loss was due to java.util.NoSuchElementException
        at spark.util.TimeStampedHashMap.apply(TimeStampedHashMap.scala:56)
        at spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:135)
        at spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:16)
        at spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:10)
        at spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:31)
        at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
        at spark.RDD.iterator(RDD.scala:196)
        at spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:19)
        at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
        at spark.RDD.iterator(RDD.scala:196)
        at spark.rdd.MappedRDD.compute(MappedRDD.scala:12)
        at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
        at spark.RDD.iterator(RDD.scala:196)
        at spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:19)
        at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
        at spark.RDD.iterator(RDD.scala:196)
        at spark.scheduler.ResultTask.run(ResultTask.scala:77)
        at spark.executor.Executor$TaskRunner.run(Executor.scala:98)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)


On the worker node logs, I see three sets of output, interleaved:

13/07/19 13:20:53 INFO executor.StandaloneExecutorBackend: Got assigned task 4813
13/07/19 13:20:53 INFO executor.Executor: Running task ID 4813

I see this for many outputs.  A bunch of these for different tasks are interleaved with 

13/07/19 13:20:53 INFO executor.Executor: Its generation is -1
13/07/19 13:20:53 INFO spark.MapOutputTracker: Don't have map outputs for shuffle 1, fetching them

in a block, and the block is followed by a bunch of exceptions:

13/07/19 13:21:03 ERROR executor.Executor: Exception in task ID 4813
java.util.NoSuchElementException
	at spark.util.TimeStampedHashMap.apply(TimeStampedHashMap.scala:56)
	at spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:135)
	at spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:16)
	at spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:10)
	at spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:31)
	at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
	at spark.RDD.iterator(RDD.scala:196)
	at spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:19)
	at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
	at spark.RDD.iterator(RDD.scala:196)
	at spark.rdd.MappedRDD.compute(MappedRDD.scala:12)
	at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
	at spark.RDD.iterator(RDD.scala:196)
	at spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:19)
	at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
	at spark.RDD.iterator(RDD.scala:196)
	at spark.scheduler.ResultTask.run(ResultTask.scala:77)
	at spark.executor.Executor$TaskRunner.run(Executor.scala:98)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)

Interspersed among about 15 NoSuchElementExceptions is a single

13/07/19 13:21:03 ERROR executor.Executor: Exception in task ID 4513
spark.SparkException: Error communicating with MapOutputTracker
	at spark.MapOutputTracker.askTracker(MapOutputTracker.scala:68)
	at spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:147)
	at spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:16)
	at spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:10)
	at spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:31)
	at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
	at spark.RDD.iterator(RDD.scala:196)
	at spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:19)
	at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
	at spark.RDD.iterator(RDD.scala:196)
	at spark.rdd.MappedRDD.compute(MappedRDD.scala:12)
	at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
	at spark.RDD.iterator(RDD.scala:196)
	at spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:19)
	at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
	at spark.RDD.iterator(RDD.scala:196)
	at spark.scheduler.ResultTask.run(ResultTask.scala:77)
	at spark.executor.Executor$TaskRunner.run(Executor.scala:98)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000] milliseconds
	at akka.dispatch.DefaultPromise.ready(Future.scala:870)
	at akka.dispatch.DefaultPromise.result(Future.scala:874)
	at akka.dispatch.Await$.result(Future.scala:74)
	at spark.MapOutputTracker.askTracker(MapOutputTracker.scala:65)
	... 20 more

This then repeats (a block of missing shuffles, then a block of exceptions)  Occasionally another exception is thrown in for good measure:

13/07/19 13:21:15 WARN storage.BlockManagerMaster: Error sending message to BlockManagerMaster in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [10000] milliseconds
	at akka.dispatch.DefaultPromise.ready(Future.scala:870)
	at akka.dispatch.DefaultPromise.result(Future.scala:874)
	at akka.dispatch.Await$.result(Future.scala:74)
	at spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:136)
	at spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:39)
	at spark.storage.BlockManager.spark$storage$BlockManager$$heartBeat(BlockManager.scala:115)
	at spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:142)
	at akka.actor.DefaultScheduler$$anon$1.run(Scheduler.scala:142)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:94)
	at akka.jsr166y.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1381)
	at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
	at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
	at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
	at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

All this is when I split a job with around 200 partitions into around 4000.

I didn't see an answer to the original question, and I think it's the same issue - does anyone know what is going on with this, and why?

Thanks,
                    -Nathan Kronenfeld

Reply all
Reply to author
Forward
0 new messages