Could not connect spark-shell to standalone remote cluster

1,583 views
Skip to first unread message

e...@ooyala.com

unread,
Mar 15, 2013, 4:27:59 PM3/15/13
to spark...@googlegroups.com
Hi guys, when I try to connect my local spark-shell to a remote standalone cluster I set up, spark-shell seems to hang, like this:

13/03/14 15:59:29 INFO spark.SparkContext: Starting job: reduce at <console>:21
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Registering RDD 1 (map at <console>:17)
13/03/14 15:59:29 INFO spark.CacheTracker: Registering RDD ID 1 with cache
13/03/14 15:59:29 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 partitions
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Registering parent RDD 1 (map at <console>:17)
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Registering parent RDD 0 (parallelize at <console>:17)
13/03/14 15:59:29 INFO spark.CacheTracker: Registering RDD ID 0 with cache
13/03/14 15:59:29 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 partitions
13/03/14 15:59:29 INFO spark.CacheTrackerActor: Asked for current cache locations
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Got job 0 (reduce at <console>:21) with 2 output partitions
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Final stage: Stage 0 (map at <console>:17)
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Parents of final stage: List()
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Missing parents: List()
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Submitting Stage 0 (map at <console>:17), which has no missing parents
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0
13/03/14 15:59:29 INFO cluster.ClusterScheduler: Adding task set 0.0 with 2 tasks

After that it does not do anything, for forever, until I disconnect the network connection, at which point it prints:

13/03/14 22:53:00 INFO actor.ActorSystemImpl: RemoteClientShutdown@akka://spark@cassandra-staging1:7077
13/03/14 22:53:00 ERROR client.Client$ClientActor: Connection to master failed; stopping client
13/03/14 22:53:00 ERROR cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster!

On the master, here are the corresponding log lines:

13/03/14 22:56:53 INFO ActorSystemImpl: RemoteClientStarted@akka://sp...@192.168.1.82:64106
13/03/14 22:56:53 ERROR NettyRemoteTransport(null): dropping message RegisterJob(JobDescription(Spark shell)) for non-local recipient akka://spark@cassandra-staging1:7077/user/Master at akka://sp...@u9-r1.mtv:7077 local is akka://sp...@u9-r1.mtv:7077
13/03/14 22:56:53 ERROR NettyRemoteTransport(null): dropping message DaemonMsgWatch(Actor[akka://sp...@192.168.1.82:64106/user/$a],Actor[akka://spark@cassandra-staging1:7077/user/Master]) for non-local recipient akka://spark@cassandra-staging1:7077/remote at akka://sp...@u9-r1.mtv:7077 local is akka://sp...@u9-r1.mtv:7077
13/03/14 22:57:54 INFO ActorSystemImpl: RemoteClientShutdown@akka://sp...@192.168.1.82:64106
13/03/14 22:58:02 INFO ActorSystemImpl: RemoteClientStarted@akka://sp...@192.168.1.82:64122
13/03/14 22:58:02 ERROR NettyRemoteTransport(null): dropping message RegisterJob(JobDescription(Spark shell)) for non-local recipient akka://spark@cassandra-staging1:7077/user/Master at akka://sp...@u9-r1.mtv:7077 local is akka://sp...@u9-r1.mtv:7077
13/03/14 22:58:02 ERROR NettyRemoteTransport(null): dropping message DaemonMsgWatch(Actor[akka://sp...@192.168.1.82:64122/user/$a],Actor[akka://spark@cassandra-staging1:7077/user/Master]) for non-local recipient akka://spark@cassandra-staging1:7077/remote at akka://sp...@u9-r1.mtv:7077 local is akka://sp...@u9-r1.mtv:7077

I assume this is a supported use case, right?

thanks,
Evan

Shivaram Venkataraman

unread,
Mar 15, 2013, 4:53:16 PM3/15/13
to spark...@googlegroups.com
It looks like the workers are not able to connect to the master. Do
you see workers registered in the Spark Web UI (master_hostname:8080)
? From the logs it looks like the workers are trying to connect to
cassandra-staging1:7077 while the master identifies itself as
u9-r1.mtv:7077

If they are referring to the same machine, you can usually bind the
master to a specific IP to fix this.
http://spark-project.org/docs/latest/spark-standalone.html has details
on how to set SPARK_MASTER_IP.

Thanks
Shivaram
> --
> You received this message because you are subscribed to the Google Groups
> "Spark Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to spark-users...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Evan Chan

unread,
Mar 15, 2013, 5:11:12 PM3/15/13
to spark...@googlegroups.com
I was able to connect workers to the master, they show up in the master UI.  I did have to specify the exact same IP address. 

I got farther by specifying the exact same URL as listed on the master web UI to spark shell.  It still errors out though, and it seems master restarted after some failure, but the web UI is now not operative:

13/03/15 21:06:27 INFO Master: Registering job Spark shell
13/03/15 21:06:27 INFO Master: Registered job Spark shell with ID job-20130315210627-0000
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/0 on worker worker-20130314192829-u11-r1.mtv-46401
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/1 on worker worker-20130314192749-u10-r1.mtv-43827
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/2 on worker worker-20130314192342-u9-r1.mtv-52555
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/0 because it is FAILED
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/3 on worker worker-20130314192829-u11-r1.mtv-46401
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/1 because it is FAILED
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/4 on worker worker-20130314192749-u10-r1.mtv-43827
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/2 because it is FAILED
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/5 on worker worker-20130314192342-u9-r1.mtv-52555
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/3 because it is FAILED
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/6 on worker worker-20130314192829-u11-r1.mtv-46401
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/4 because it is FAILED
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/7 on worker worker-20130314192749-u10-r1.mtv-43827
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/5 because it is FAILED
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/8 on worker worker-20130314192342-u9-r1.mtv-52555
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/6 because it is FAILED
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/9 on worker worker-20130314192829-u11-r1.mtv-46401
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/7 because it is FAILED
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/10 on worker worker-20130314192749-u10-r1.mtv-43827
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/8 because it is FAILED
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/11 on worker worker-20130314192342-u9-r1.mtv-52555
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/9 because it is FAILED
13/03/15 21:06:27 INFO Master: Launching executor job-20130315210627-0000/12 on worker worker-20130314192829-u11-r1.mtv-46401
13/03/15 21:06:27 INFO Master: Removing executor job-20130315210627-0000/10 because it is FAILED
13/03/15 21:06:27 ERROR Master: Job Spark shell wth ID job-20130315210627-0000 failed 11 times.
spark.SparkException: Job Spark shell wth ID job-20130315210627-0000 failed 11 times.
at spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:106)
at spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:65)
at akka.actor.Actor$class.apply(Actor.scala:318)
at spark.deploy.master.Master.apply(Master.scala:18)
at akka.actor.ActorCell.invoke(ActorCell.scala:626)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
at akka.dispatch.Mailbox.run(Mailbox.scala:179)
at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
13/03/15 21:06:27 ERROR Master: Job Spark shell wth ID job-20130315210627-0000 failed 11 times.
spark.SparkException: Job Spark shell wth ID job-20130315210627-0000 failed 11 times.
at spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:106)
at spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:65)
at akka.actor.Actor$class.apply(Actor.scala:318)
at spark.deploy.master.Master.apply(Master.scala:18)
at akka.actor.ActorCell.invoke(ActorCell.scala:626)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
at akka.dispatch.Mailbox.run(Mailbox.scala:179)
at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
13/03/15 21:06:27 INFO Master: Starting Spark master at spark://u9-r1.mtv:7077
13/03/15 21:06:27 INFO IoWorker: IoWorker thread 'spray-io-worker-1' started
13/03/15 21:06:27 ERROR Master: Failed to create web UI
akka.actor.InvalidActorNameException:actor name HttpServer is not unique!
[339049e0-8db4-11e2-900b-003048c63b0c]
at akka.actor.ActorCell.actorOf(ActorCell.scala:392)
at akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.liftedTree1$1(ActorRefProvider.scala:394)
at akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:394)
at akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:392)
at akka.actor.Actor$class.apply(Actor.scala:318)
at akka.actor.LocalActorRefProvider$Guardian.apply(ActorRefProvider.scala:388)
at akka.actor.ActorCell.invoke(ActorCell.scala:626)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
at akka.dispatch.Mailbox.run(Mailbox.scala:179)
at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

I'm getting a bunch of errors like this in each worker

java.io.IOException: Cannot run program "/Users/ev/src/vendor/spark-0.6.2/run" (in directory "/opt/spark-0.6.2/work/job-20130315210627-0000/2"): java.io.IOException: error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
        at spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:126)
        at spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:36)
Caused by: java.io.IOException: java.io.IOException: error=2, No such file or directory
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)

It seems that Spark worker on the cluster somehow got my local Spark run script path, which is weird.  Does the local spark-shell need to have the same path as the one on the remote workers?

thanks,
Evan


You received this message because you are subscribed to a topic in the Google Groups "Spark Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/spark-users/sbYOyXBpj60/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to spark-users...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.





--
--
Evan Chan
Senior Software Engineer | 
e...@ooyala.com | (650) 996-4600
www.ooyala.com | blog | @ooyala

Shivaram Venkataraman

unread,
Mar 15, 2013, 5:42:41 PM3/15/13
to spark...@googlegroups.com
spark-shell tries to start executors using the 'run' command on the slaves. The path for the run command is passed as $SPARK_HOME/run -- So when you start spark-shell it passes on the path from your local shell. I haven't tried this before so I am not sure if it will work, but you could try setting SPARK_HOME on your local machine to be the path to spark directory on the slaves. 

Thanks
Shivaram

Evan Chan

unread,
Mar 15, 2013, 6:14:45 PM3/15/13
to spark...@googlegroups.com
Yeah, now every time I start spark-shell, it seems the cluster automatically tries to execute the failed job from before, with the wrong path.... and then the master and workers all exit.   Seems like the state needs to be cleared somehow.

Evan Chan

unread,
Mar 15, 2013, 6:18:52 PM3/15/13
to spark...@googlegroups.com
Even if I remove the work subdir in my Spark install directory, it gets recreated every time with a new job whenever I start my local spark-shell.

Maybe someone who runs spark-shell remotely can comment on what setup works for them?   Or do you all run spark-shell locally on the cluster?

Jim Donahue

unread,
Mar 16, 2013, 1:25:34 AM3/16/13
to spark...@googlegroups.com
I tried to do something like this and got the following response when I couldn't get it to work -- basically, you can't unless you're careful:

The spark shell itself launches a few internal servers (inside of SparkEnv/SparkContext) that the slaves make TCP connections to. This will, in general, only work if the shell is being run on a machine with a publicly routable IP address (i.e. not behind a home or office NAT) and the hostname the machine broadcasts is understandable by the slaves.

I wanted to run a cluster of EC2 instances and talk to them from outside AWS -- that just isn't going to work.  So what I'm doing now is running an Apache SSH server on an EC2 instance and then have the SSH server run the Spark shell when you connect to it. It took some work to set up, but it works.

Jim Donahue

On Friday, March 15, 2013 1:27:59 PM UTC-7, e...@ooyala.com wrote:
Hi guys, when I try to connect my local spark-shell to a remote standalone cluster I set up, spark-shell seems to hang, like this:

13/03/14 15:59:29 INFO spark.SparkContext: Starting job: reduce at <console>:21
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Registering RDD 1 (map at <console>:17)
13/03/14 15:59:29 INFO spark.CacheTracker: Registering RDD ID 1 with cache
13/03/14 15:59:29 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 partitions
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Registering parent RDD 1 (map at <console>:17)
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Registering parent RDD 0 (parallelize at <console>:17)
13/03/14 15:59:29 INFO spark.CacheTracker: Registering RDD ID 0 with cache
13/03/14 15:59:29 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 partitions
13/03/14 15:59:29 INFO spark.CacheTrackerActor: Asked for current cache locations
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Got job 0 (reduce at <console>:21) with 2 output partitions
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Final stage: Stage 0 (map at <console>:17)
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Parents of final stage: List()
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Missing parents: List()
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Submitting Stage 0 (map at <console>:17), which has no missing parents
13/03/14 15:59:29 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0
13/03/14 15:59:29 INFO cluster.ClusterScheduler: Adding task set 0.0 with 2 tasks

After that it does not do anything, for forever, until I disconnect the network connection, at which point it prints:

13/03/14 22:53:00 INFO actor.ActorSystemImpl: RemoteClientShutdown@akka://spark@cassandra-staging1:7077
13/03/14 22:53:00 ERROR client.Client$ClientActor: Connection to master failed; stopping client
13/03/14 22:53:00 ERROR cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster!

On the master, here are the corresponding log lines:

13/03/14 22:56:53 INFO ActorSystemImpl: RemoteClientStarted@akka://spar...@192.168.1.82:64106
13/03/14 22:56:53 ERROR NettyRemoteTransport(null): dropping message RegisterJob(JobDescription(Spark shell)) for non-local recipient akka://spark@cassandra-staging1:7077/user/Master at akka://sp...@u9-r1.mtv:7077 local is akka://sp...@u9-r1.mtv:7077
13/03/14 22:56:53 ERROR NettyRemoteTransport(null): dropping message DaemonMsgWatch(Actor[akka://spa...@192.168.1.82:64106/user/$a],Actor[akka://spark@cassandra-staging1:7077/user/Master]) for non-local recipient akka://spark@cassandra-staging1:7077/remote at akka://sp...@u9-r1.mtv:7077 local is akka://sp...@u9-r1.mtv:7077
13/03/14 22:57:54 INFO ActorSystemImpl: RemoteClientShutdown@akka://spa...@192.168.1.82:64106
13/03/14 22:58:02 INFO ActorSystemImpl: RemoteClientStarted@akka://spar...@192.168.1.82:64122
13/03/14 22:58:02 ERROR NettyRemoteTransport(null): dropping message RegisterJob(JobDescription(Spark shell)) for non-local recipient akka://spark@cassandra-staging1:7077/user/Master at akka://sp...@u9-r1.mtv:7077 local is akka://sp...@u9-r1.mtv:7077
13/03/14 22:58:02 ERROR NettyRemoteTransport(null): dropping message DaemonMsgWatch(Actor[akka://spa...@192.168.1.82:64122/user/$a],Actor[akka://spark@cassandra-staging1:7077/user/Master]) for non-local recipient akka://spark@cassandra-staging1:7077/remote at akka://sp...@u9-r1.mtv:7077 local is akka://sp...@u9-r1.mtv:7077

Evan Chan

unread,
Mar 19, 2013, 3:11:58 AM3/19/13
to spark...@googlegroups.com
By the way, I was able to connect to the Spark cluster, either by running spark-shell from one of the worker nodes, or by running spark-shell from my laptop / local directory, but the local directory had to have the same name as the directory on the worker nodes.

What I'm really hoping to do is to run the shell from SBT itself, so it can have some of my own classes loaded in the classpath.   Has anyone managed to do this?

thanks,
Evan

--
You received this message because you are subscribed to a topic in the Google Groups "Spark Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/spark-users/sbYOyXBpj60/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Matei Zaharia

unread,
Mar 20, 2013, 5:34:22 PM3/20/13
to spark...@googlegroups.com
You can't run the Scala shell out of SBT unfortunately, because we modified the Scala shell a little to work with class (to ship code to the cluster). Your best bet is to package your project into a single JAR and add it to SPARK_CLASSPATH on both the head node and workers.

Matei

You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.

Evan Chan

unread,
Mar 22, 2013, 6:37:46 PM3/22/13
to spark...@googlegroups.com
We could just run the spark REPL out of SBT though, right?

Matei Zaharia

unread,
Mar 24, 2013, 4:22:30 PM3/24/13
to spark...@googlegroups.com
Ah, that might work, but it would be good to make sure SBT forks it into a separate JVM. Otherwise, SBT's input wrapping stuff combined with the REPL's use of jline might get a bit hairy.

Matei
Reply all
Reply to author
Forward
0 new messages