ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkE...@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 35, big26-itrc.bmwgroup.net): java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at org.apache.spark.h2o.H2OContextUtils$$anonfun$5.apply(H2OContextUtils.scala:112)
at org.apache.spark.h2o.H2OContextUtils$$anonfun$5.apply(H2OContextUtils.scala:111)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html.It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors:ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
The password-less caused this problem, you can set up password less to all the machine within your cluster or set up a environment variable SPARK_SSH_FOREGROUND and serially provide a password for each worker
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html.It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors:ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html.It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors:ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
--
You received this message because you are subscribed to the Google Groups "H2O & Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Mei,
in this case, the cluster died because it failed on assertion in the code.
I am now exploring assertion if it is true in yarn environment.
Let me remove it and make a new release for rel-1.3 branch.
Is it ok with you?
Thank you!
Michal
Dne 6/2/15 v 2:12 PM Mei Liang napsal(a):
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html.
It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors:
ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://s...@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client )
Hi Mei,
in this case, the cluster died because it failed on assertion in the code.
I am now exploring assertion if it is true in yarn environment.
Let me remove it and make a new release for rel-1.3 branch.
Is it ok with you?
Thank you!
Michal
Dne 6/2/15 v 2:12 PM Mei Liang napsal(a):
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html.
It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors:
ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://s...@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
Hi Michal,
Did you got it remove and released it yet?
While I am waiting, I found some interesting stuff, however I did not have an explanation yet, you might can help me understand it better.
After go through all the log files and tried to find out what is the problem, I noticed that the spark executor got changed after I launched the h2o cloud (with the command: val h2oContext = new H2OContext(sc).start() ) un-successfully. Do you know why this happened (the executor changed) ?
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html.It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors:ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].