cannot run sparkling-water on Hadoop ERROR YarnScheduler:Lost executor remote Akka client disassocia

422 views
Skip to first unread message

Mei Liang

unread,
Jun 2, 2015, 5:12:20 PM6/2/15
to h2os...@googlegroups.com
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html

It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors: 

ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated

WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkE...@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].


Then it showed the spark job has been aborted:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 35, big26-itrc.bmwgroup.net): java.lang.AssertionError: assertion failed

at scala.Predef$.assert(Predef.scala:165)

at org.apache.spark.h2o.H2OContextUtils$$anonfun$5.apply(H2OContextUtils.scala:112)

at org.apache.spark.h2o.H2OContextUtils$$anonfun$5.apply(H2OContextUtils.scala:111)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

at scala.collection.AbstractIterator.to(Iterator.scala:1157)

at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)

at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)


Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)

at scala.Option.foreach(Option.scala:236)

at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


Does anyone know what is wrong with my cluster? I was able to run sparkling-water with MASTER=local[*]

Thank,
Mei 
 
 


Mei Liang

unread,
Jun 2, 2015, 5:16:07 PM6/2/15
to h2os...@googlegroups.com
P.S I did not set up password less to all my machines in the cluster. Does this has any affect and causing this problem?


On Tuesday, June 2, 2015 at 5:12:20 PM UTC-4, Mei Liang wrote:
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html

It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors: 

ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated

WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Mei Liang

unread,
Jun 2, 2015, 5:28:16 PM6/2/15
to h2os...@googlegroups.com
Never mind, just solved this problem.

To someone who faced this problem as well: 

The password-less caused this problem, you can set up password less to all the machine within your cluster or set up a environment variable SPARK_SSH_FOREGROUND and serially provide a password for each worker
 
 

On Tuesday, June 2, 2015 at 5:12:20 PM UTC-4, Mei Liang wrote:
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html

It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors: 

ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated

WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Mei Liang

unread,
Jun 3, 2015, 12:32:50 PM6/3/15
to h2os...@googlegroups.com
Never mind, I thought I fixed it, but the error came back again, however this time is happen with some other nodes. Can someone help me please?


On Tuesday, June 2, 2015 at 5:12:20 PM UTC-4, Mei Liang wrote:
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html

It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors: 

ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated

WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Michal Malohlava

unread,
Jun 3, 2015, 1:26:51 PM6/3/15
to h2os...@googlegroups.com
Hi Mei,

in this case, the cluster died because it failed on assertion in the code.

I am now exploring assertion if it is true in yarn environment.
Let me remove it and make a new release for rel-1.3 branch.

Is it ok with you?

Thank you!
Michal


Dne 6/2/15 v 2:12 PM Mei Liang napsal(a):
--
You received this message because you are subscribed to the Google Groups "H2O & Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mei Liang

unread,
Jun 3, 2015, 2:01:51 PM6/3/15
to h2os...@googlegroups.com, mic...@h2oai.com
Yes, Michal. Thank you so much, just let me know when you got it.



On Wednesday, June 3, 2015 at 1:26:51 PM UTC-4, Michal Malohlava wrote:
Hi Mei,

in this case, the cluster died because it failed on assertion in the code.

I am now exploring assertion if it is true in yarn environment.
Let me remove it and make a new release for rel-1.3 branch.

Is it ok with you?

Thank you!
Michal


Dne 6/2/15 v 2:12 PM Mei Liang napsal(a):
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html

It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors: 

ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated

WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://s...@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Mei Liang

unread,
Jun 3, 2015, 6:06:37 PM6/3/15
to h2os...@googlegroups.com, mic...@h2oai.com
Hi Michal,

Did you got it remove and released it yet?

While I am waiting, I found some interesting stuff, however I did not have an explanation yet, you might can help me understand it better.

After go through all the log files and tried to find out what is the problem, I noticed that the spark executor got changed after I launched the h2o cloud (with the command: val h2oContext = new H2OContext(sc).start()  ) un-successfully.  Do you know why this happened (the executor changed) ? 

NOTE: I was able to run regular spark job with this spark cluster, (and I launched the cluster with the command: 

     bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client )


On Wednesday, June 3, 2015 at 1:26:51 PM UTC-4, Michal Malohlava wrote:
Hi Mei,

in this case, the cluster died because it failed on assertion in the code.

I am now exploring assertion if it is true in yarn environment.
Let me remove it and make a new release for rel-1.3 branch.

Is it ok with you?

Thank you!
Michal


Dne 6/2/15 v 2:12 PM Mei Liang napsal(a):
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html

It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors: 

ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated

WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://s...@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Michal Malohlava

unread,
Jun 3, 2015, 6:46:52 PM6/3/15
to Mei Liang, h2os...@googlegroups.com
Hi Mei,

Dne 6/3/15 v 3:06 PM Mei Liang napsal(a):
Hi Michal,

Did you got it remove and released it yet?
Not yet, would like to have more changes + update of H2O (we are middle of release of new H2O version)


While I am waiting, I found some interesting stuff, however I did not have an explanation yet, you might can help me understand it better.

After go through all the log files and tried to find out what is the problem, I noticed that the spark executor got changed after I launched the h2o cloud (with the command: val h2oContext = new H2OContext(sc).start()  ) un-successfully.  Do you know why this happened (the executor changed) ?
Yes, the H2OContext failed on internal assertion which killed JVM, Spark+Yarn just relaunched JVM again.

Michal

Mei Liang

unread,
Jun 4, 2015, 8:48:05 AM6/4/15
to h2os...@googlegroups.com, mic...@h2oai.com, me...@g.clemson.edu
Thanks Michal, but for as far as my problem, is there a quick fix so I can continue my tasks?

Thanks, 
Mei 

Mei Liang

unread,
Jun 4, 2015, 3:21:23 PM6/4/15
to h2os...@googlegroups.com
I have this error is because some of the nodes in my clusters has bad interface. H2O tried to used the bad interface (that was cache somewhere in the node) to send UDP package. This is causing the cluster died.  After I reboot the machine, things work great now!

On Tuesday, June 2, 2015 at 5:12:20 PM UTC-4, Mei Liang wrote:
I am trying to run sparkling-water on Hadoop with MASTER="yarn-client", followed the step provided in here http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.3/1/index.html

It launched a spark cluster, but when I tried to create h2o cloud inside spark cluster, it failed on the command bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client with the errors: 

ERROR YarnScheduler: Lost executor 3 on spark-slave.net: remote Akka client disassociated

WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-slave.net:39288] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Michal Malohlava

unread,
Jun 4, 2015, 3:40:44 PM6/4/15
to h2os...@googlegroups.com
Wow, thanks for debugging the issue Mei!

In H2O we use the same interface as Spark's akka is using.
So I expect Spark was also confused.
But have to create JIRA issue for this situation.

Thanks again for help and let us know if you have any questions,
michal



Dne 6/4/15 v 12:21 PM Mei Liang napsal(a):
Reply all
Reply to author
Forward
0 new messages