Not able to run spark on mesos cluster

1,262 views
Skip to first unread message

tushar todi

unread,
Oct 5, 2012, 8:11:39 AM10/5/12
to spark...@googlegroups.com
Hi,

I just started using spark. Reading the wiki i was able to bring up mesos cluster, but when i try to run example program(./run spark.examples.SparkPi master) on cluster i get the following error:

12/10/06 02:03:38 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes = 339585269
12/10/06 02:03:38 INFO spark.CacheTrackerActor: Registered actor on port 7077
12/10/06 02:03:38 INFO spark.CacheTrackerActor: Started slave cache (size 323.9MB) on BLRCSLTBDV03
12/10/06 02:03:38 INFO spark.MapOutputTrackerActor: Registered actor on port 7077
12/10/06 02:03:38 INFO spark.ShuffleManager: Shuffle dir: /tmp/spark-local-e208a879-73c5-45c6-b62b-bbf2857de26e/shuffle
12/10/06 02:03:38 INFO server.Server: jetty-7.5.3.v20111011
12/10/06 02:03:38 INFO server.AbstractConnector: Started SelectChann...@0.0.0.0:56694 STARTING
12/10/06 02:03:38 INFO spark.ShuffleManager: Local URI: http://127.0.0.1:56694
12/10/06 02:03:38 INFO server.Server: jetty-7.5.3.v20111011
12/10/06 02:03:38 INFO server.AbstractConnector: Started SelectChann...@0.0.0.0:34963 STARTING
12/10/06 02:03:38 INFO broadcast.HttpBroadcast: Broadcast server started at http://127.0.0.1:34963
12/10/06 02:03:38 INFO spark.MesosScheduler: Temp directory for JARs: /tmp/spark-76712b65-452b-4cd4-a8d2-0f4aa628947c
12/10/06 02:03:38 INFO server.Server: jetty-7.5.3.v20111011
12/10/06 02:03:38 INFO server.AbstractConnector: Started SelectChann...@0.0.0.0:42154 STARTING
12/10/06 02:03:38 INFO spark.MesosScheduler: JAR server started at http://127.0.0.1:42154
12/10/06 02:03:38 ERROR spark.MesosScheduler: Mesos error: Cannot parse '@0.0.0.0:0'
12/10/06 02:03:38 INFO spark.MesosScheduler: driver.run() returned with code DRIVER_ABORTED

I am also attaching the masters and slaves that i put under /mesos/deploy,
mesos.conf under /mesos/conf,
and spark-env.sh under /spark/conf

i am able to perform following:
./run spark.examples.SparkPi local

Please help me finding out where i am going wrong.

Thanks,
Tushar


masters.txt
mesos.conf.txt
slaves.txt
spark-env.sh.txt

Matei Zaharia

unread,
Oct 5, 2012, 8:51:45 PM10/5/12
to spark...@googlegroups.com
I think you gave it the wrong master URL. What did you pass as that? It should just be host:port, where host is the Mesos master, and port is 5050 by default. You seem to have an @ in there.

Matei
> <masters.txt><mesos.conf.txt><slaves.txt><spark-env.sh.txt>

tushar todi

unread,
Oct 6, 2012, 11:38:35 AM10/6/12
to spark...@googlegroups.com
Thanks Matei for the reply. I tried :
./run spark.examples.SparkPi master 

Pls have a look at the conf files. I tried setting up a mesos cluster of just two machines. Have changed the extensions of the files with .txt to upload them.

I even tried ./run spark.examples.SparkPi 10.129.146.13:5050 , but this even failed.

The mesos cluster seems to come up as i can see connections being established between slave and master nodes. UI pages come up on masters 8080 and slaves 8081 ports, though web page on masters 8080 port no slaves show up. Pls have a look at the conf files.

Would really appreciate your help. 

Thanks,
Tushar

Matei Zaharia

unread,
Oct 7, 2012, 3:07:35 AM10/7/12
to spark...@googlegroups.com
Hi Tushar,

It looks like you didn't attach any config files. However, my guess is that either the master IP address is wrong (maybe the machine has multiple network interfaces and it is binding to the wrong one), or the worker or job client's address is wrong. If you can't see the slave in the master web UI then you have a connectivity problem between them. You can use the --ip argument to mesos-master or mesos-slave to have the bind to a specific IP address. Make sure that's an address that the other machine can reach.

Matei

tushar todi

unread,
Oct 8, 2012, 3:28:18 AM10/8/12
to spark...@googlegroups.com
Thanks Matei. I attached the conf files in the starting post. I tried bringing up cluster by using :

./mesos-master --ip=10.129.146.13 (on master)

and
 
./mesos-slave --ip=10.129.146.12 --master=10.129.146.13:5050 (on slave)

on slave node's (/tmp/mesos/slaves/.../runs/..) directory i see the following exception :

java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:529)
        at java.net.Socket.connect(Socket.java:478)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
        at sun.net.www.http.HttpClient.New(HttpClient.java:306)
        at sun.net.www.http.HttpClient.New(HttpClient.java:323)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1172)
        at java.net.URL.openStream(URL.java:1010)
        at spark.Executor.spark$Executor$$downloadFile(Executor.scala:161)
        at spark.Executor$$anonfun$createClassLoader$2.apply(Executor.scala:132)
        at spark.Executor$$anonfun$createClassLoader$2.apply(Executor.scala:129)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
        at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
        at spark.Executor.createClassLoader(Executor.scala:129)
        at spark.Executor.registered(Executor.scala:42)
java.lang.NullPointerException
        at spark.Executor.launchTask(Executor.scala:61)
java.lang.NullPointerException
        at spark.Executor.launchTask(Executor.scala:61)
Exception in thread "Thread-1" Exception in thread "Thread-2" Exception in thread "Thread-3"

If i try to bring up ./spark-executor

it shows :
Running spark-executor with framework dir = .
expecting MESOS_SLAVE_PID in environment (exec/exec.cpp:391)

Please help me debug again, where i am going wrong.

Thanks,
Tushar



Matei Zaharia

unread,
Oct 8, 2012, 1:20:21 PM10/8/12
to spark...@googlegroups.com
Do you see the slave on the master's web UI?

Matei

tushar todi

unread,
Oct 9, 2012, 6:35:04 AM10/9/12
to spark...@googlegroups.com
Hi Matei,

When i run mesos-master it starts properly and console shows:


[bigdatalab@BLRCSLTBDV03 sbin]$ ./mesos-master --ip=10.129.146.13
I1009 23:41:30.909895 31698 main.cpp:115] Build: 2012-10-04 21:45:22 by bigdatalab
I1009 23:41:30.910416 31698 main.cpp:116] Starting Mesos master
I1009 23:41:30.910719 31701 master.cpp:303] Master started on 10.129.146.13:5050
I1009 23:41:30.910820 31701 master.cpp:318] Master ID: 201210092341-227705098-5050-31698
W1009 23:41:30.911164 31702 master.cpp:78] No whitelist given. Advertising offers for all slaves
I1009 23:41:30.913694 31701 master.cpp:560] Elected as master!
I1009 23:41:30.950111 31704 webui.cpp:61] Loading webui script at '/home/bigdatalab/mesos/share/mesos/webui/master/webui.py'
Bottle server starting up (using WSGIRefServer())...
Listening on http://0.0.0.0:8080/
Use Ctrl-C to quit.


Similarly, when i run slave the console shows:

[bigdatalab@BLRCSLTBDV02 sbin]$ ./mesos-slave --master=10.129.146.13:5050 --ip=10.129.146.12
I1009 23:41:54.936717 11538 main.cpp:123] Creating "process" isolation module
I1009 23:41:54.937088 11538 main.cpp:131] Build: 2012-10-04 21:44:11 by bigdatalab
I1009 23:41:54.937139 11538 main.cpp:132] Starting Mesos slave
I1009 23:41:54.937572 11540 slave.cpp:162] Slave started on 1)@10.129.146.12:58054
I1009 23:41:54.937607 11540 slave.cpp:163] Slave resources: cpus=4; mem=14517; disk=164109
I1009 23:41:54.939043 11540 slave.cpp:353] New master detected at mas...@10.129.146.13:5050
I1009 23:41:54.941954 11542 slave.cpp:373] Registered with master; given slave ID 201210092341-227705098-5050-31698-0
I1009 23:41:54.942158 11542 gc.cpp:97] Scheduling /tmp/mesos/slaves/201210082101-227705098-5050-25738-0 for removal
I1009 23:41:54.942301 11542 gc.cpp:97] Scheduling /tmp/mesos/slaves/201210082105-227705098-5050-25958-1 for removal
I1009 23:41:54.960240 11544 webui.cpp:61] Loading webui script at '/home/bigdatalab/mesos/share/mesos/webui/slave/webui.py'
Bottle server starting up (using WSGIRefServer())...
Listening on http://0.0.0.0:8081/
Use Ctrl-C to quit.


After running slave the console on the master node shows:

I1010 00:38:20.669260 31702 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 13.33us
I1010 00:38:21.670289 31702 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 14.58us

But when i try to see web ui of master and slave it shows 500 error:

Error 500: Internal Server Error

Sorry, the requested URL http://10.129.146.12:8081/ caused an error:

Unhandled exception

Exception:

IOError('socket error', error(111, 'Connection refused'))

Traceback:

Traceback (most recent call last):
  File "/home/bigdatalab/mesos/share/mesos/webui/bottle-0.8.3/bottle.py", line 499, in handle
    return handler(**args)
  File "/home/bigdatalab/mesos/share/mesos/webui/slave/webui.py", line 61, in index
    slave_port = slave_port, log_dir = log_dir)
  File "/home/bigdatalab/mesos/share/mesos/webui/bottle-0.8.3/bottle.py", line 1796, in template
    return TEMPLATES[tpl].render(**kwargs)
  File "/home/bigdatalab/mesos/share/mesos/webui/bottle-0.8.3/bottle.py", line 1775, in render
    self.execute(stdout, **args)
   
Error 500: Internal Server Error

Sorry, the requested URL http://10.129.146.13:8080/ caused an error:

Unhandled exception

Exception:

IOError('socket error', error(111, 'Connection refused'))

Traceback:

Traceback (most recent call last):
  File "/home/bigdatalab/mesos/share/mesos/webui/bottle-0.8.3/bottle.py", line 499, in handle
    return handler(**args)
  File "/home/bigdatalab/mesos/share/mesos/webui/master/webui.py", line 51, in index
    return template("index", master_port = master_port, log_dir = log_dir)
  File "/home/bigdatalab/mesos/share/mesos/webui/bottle-0.8.3/bottle.py", line 1796, in template
    return TEMPLATES[tpl].render(**kwargs)
  File "/home/bigdatalab/mesos/share/mesos/webui/bottle-0.8.3/bottle.py", line 1775, in rende


Hope now it would be clear where I am going wrong.

Thanks,
Tushar

Matei Zaharia

unread,
Oct 9, 2012, 5:46:00 PM10/9/12
to spark...@googlegroups.com
I think the web UI bug is unrelated, but it is probably due to the machine's hostname not resolving to its IP address. If you do hostname on each machine, and then try to ping that hostname, does it ping the right IP address?

Anyway, it seems like the slave did register with the cluster. Does running jobs on it work?

Matei

tushar todi

unread,
Oct 10, 2012, 12:01:31 AM10/10/12
to spark...@googlegroups.com
Hi Matei,

Thanks for the response.

I am not using hostname. I am using --ip option while starting master and slave.
If I dont use --ip option then the UI comes up but slave does not get registered to master.

With --ip option used to bring up master and slave and slave registered running ./run spark.examples.SparkPi 10.129.146.13:5050

slave node's (/tmp/mesos/slaves/.../runs/..
Thanks,
Tushar
 

Matei Zaharia

unread,
Oct 12, 2012, 2:00:37 AM10/12/12
to spark...@googlegroups.com
Ah, got it. In this case you also need to do System.setProperty("spark.master.host", "<YOUR_IP>") on the Spark driver program before you create a SparkContext. The issue is that Spark itself also binds to the default IP, which is wrong in this case.

Matei

tushar todi

unread,
Oct 15, 2012, 6:49:52 AM10/15/12
to spark...@googlegroups.com
Thanks Matei. I even tried spark/shark on ec2 cluster.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages