H2OContextUtils: SpreadRDD failure - IPs are not equal

125 views
Skip to first unread message

peter.s...@googlemail.com

unread,
Jul 12, 2015, 5:59:45 AM7/12/15
to h2os...@googlegroups.com
Hi h2o community,

I encountered some, at least for me, strange behaviour while running a sparkling-water application on a cluster.

I am running spark 1.3.1 on hadoop 2.4, using a spark master, not yarn.

After excluding some maven dependencies for building my application, I got the following exception on the cluster:


15/07/12 08:35:03 WARN TaskSetManager: Lost task 2.1 in stage 2.0 (TID 266, xxxx-9.xxx.tu-berlin.de): java.lang.AssertionError: assertion failed: SpreadRDD failure - IPs are not equal: (2,xxxx-3.xxx.tu-berlin.de,-1) != (5, xxxx-9.xxx.tu-berlin.de)
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.h2o.H2OContextUtils$$anonfun$5.apply(H2OContextUtils.scala:112)
at org.apache.spark.h2o.H2OContextUtils$$anonfun$5.apply(H2OContextUtils.scala:111)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1176)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:853)

15/07/12 08:35:03 INFO TaskSetManager: Starting task 2.2 in stage 2.0 (TID 269, xxxx-8.xxx.tu-berlin.de, ANY, 2414 bytes)
15/07/12 08:35:03 INFO AppClient$ClientActor: Executor updated: app-20150712083442-0124/6 is now LOADING
15/07/12 08:35:03 ERROR TaskSchedulerImpl: Lost executor 0 on xxxx-5.xxx.tu-berlin.de: remote Akka client disassociated


What I am thinking about is, that why does excluding some dependencies for the final jar influences the behaviour of the app? If something breaks, shouldn't it be something like 'NoClassDefFoundError', etc. ?


Here the the exclusions of my maven build plugin:

<!--artifactSet>
<excludes>
<exclude>org.scala-lang:scala-library</exclude>
<exclude>org.scala-lang:scala-compiler</exclude>
<exclude>org.scala-lang:scala-reflect</exclude>
<exclude>com.amazonaws:aws-java-sdk</exclude>
<exclude>com.typesafe.akka:akka-actor_*</exclude>
<exclude>com.typesafe.akka:akka-remote_*</exclude>
<exclude>com.typesafe.akka:akka-slf4j_*</exclude>
<exclude>io.netty:netty-all</exclude>
<exclude>io.netty:netty</exclude>
<exclude>org.eclipse.jetty:jetty-server</exclude>
<exclude>org.eclipse.jetty:jetty-continuation</exclude>
<exclude>org.eclipse.jetty:jetty-http</exclude>
<exclude>org.eclipse.jetty:jetty-io</exclude>
<exclude>org.eclipse.jetty:jetty-util</exclude>
<exclude>org.eclipse.jetty:jetty-security</exclude>
<exclude>org.eclipse.jetty:jetty-servlet</exclude>
<exclude>commons-fileupload:commons-fileupload</exclude>
<exclude>org.apache.avro:avro</exclude>
<exclude>commons-collections:commons-collections</exclude>
<exclude>org.codehaus.jackson:jackson-core-asl</exclude>
<exclude>org.codehaus.jackson:jackson-mapper-asl</exclude>
<exclude>com.thoughtworks.paranamer:paranamer</exclude>
<exclude>org.xerial.snappy:snappy-java</exclude>
<exclude>org.apache.commons:commons-compress</exclude>
<exclude>org.tukaani:xz</exclude>
<exclude>com.esotericsoftware.kryo:kryo</exclude>
<exclude>com.esotericsoftware.minlog:minlog</exclude>
<exclude>org.objenesis:objenesis</exclude>
<exclude>com.twitter:chill_*</exclude>
<exclude>com.twitter:chill-java</exclude>
<exclude>com.twitter:chill-avro_*</exclude>
<exclude>com.twitter:chill-bijection_*</exclude>
<exclude>com.twitter:bijection-core_*</exclude>
<exclude>com.twitter:bijection-avro_*</exclude>
<exclude>commons-lang:commons-lang</exclude>
<exclude>junit:junit</exclude>
<exclude>de.javakaffee:kryo-serializers</exclude>
<exclude>joda-time:joda-time</exclude>
<exclude>org.apache.commons:commons-lang3</exclude>
<exclude>org.slf4j:slf4j-api</exclude>
<exclude>org.slf4j:slf4j-log4j12</exclude>
<exclude>log4j:log4j</exclude>
<exclude>org.apache.commons:commons-math</exclude>
<exclude>org.apache.sling:org.apache.sling.commons.json</exclude>
<exclude>commons-logging:commons-logging</exclude>
<exclude>org.apache.httpcomponents:httpclient</exclude>
<exclude>org.apache.httpcomponents:httpcore</exclude>
<exclude>commons-codec:commons-codec</exclude>
<exclude>com.fasterxml.jackson.core:jackson-core</exclude>
<exclude>com.fasterxml.jackson.core:jackson-databind</exclude>
<exclude>com.fasterxml.jackson.core:jackson-annotations</exclude>
<exclude>org.codehaus.jettison:jettison</exclude>
<exclude>stax:stax-api</exclude>
<exclude>com.typesafe:config</exclude>
<exclude>org.uncommons.maths:uncommons-maths</exclude>
<exclude>com.github.scopt:scopt_*</exclude>
<exclude>org.mortbay.jetty:servlet-api</exclude>
<exclude>commons-io:commons-io</exclude>
<exclude>commons-cli:commons-cli</exclude>
</excludes>
</artifactSet-->

Michal Malohlava

unread,
Jul 13, 2015, 3:39:03 AM7/13/15
to h2os...@googlegroups.com
Hi Peter,

did you observe any change of Spark executors during launch?
And which Sparkling Water version are you using the latest 1.3.x?

The error is safe assertion to detect change of Spark cluster during H2O clouding.
It should not happen in normal case, only if Spark decides to change executors (i.e., restart them
on different node).

Michal

Dne 7/12/15 v 11:59 AM peter.s...@googlemail.com napsal(a):
Reply all
Reply to author
Forward
0 new messages