Some questions about GPS on cluster

74 views
Skip to first unread message

Lyuwei

unread,
Mar 13, 2014, 9:02:10 AM3/13/14
to stanford...@googlegroups.com
Hello,
When I run GPS on my cluster, it occurs following problem in slave1: 

ERROR gps.communication.mina.ClientConnectionsEstablisher  - Failed to connect to slave1 at port: 2347. Waiting for: 1000 millis.
org.apache.mina.core.RuntimeIoException: Failed to get the session.

the error will happen again and again even though I change the port or wait for minutes. So I want to know how to solve it. ( the firewall of slave1 is closed)

Also, after I run GPS, I find there are two lines strange in Ubuntu bash shell:

/bin/sh: 2: -hcf: not found  (the lines happen in both master and slave)

/home/hadoop/trunk/scripts/start_gps_node.sh: line 3: /home/hadoop/trunk/scripts/../conf/gps-env.sh: No such file or directory    ( this lines happen in slaves,but in GPS documentation this document is not supposed to be set )


So I am willing to know if these two lines means  problems. Thank you !

Best wishes.
Lyuwei

Semih Salihoglu

unread,
Mar 13, 2014, 2:33:16 PM3/13/14
to Lyuwei, stanford...@googlegroups.com
Hi,

So the first line of Failed to connect, should go away after a while. The slaves and master will output that until they successfully connect. So the give the system some time, say 20 seconds, and if that line is not going away, then somehow your slaves are not connecting to each other.

You do need to fix the other errors. For the gps-env.sh not exist error, I'm not sure why it can't locate it, but somehow you might have to change either where you're calling your start_gps_node.sh script or the start_gps_node.sh script directly to point to gps-env.sh. For the -hcf error, you should specify that flag to point to your hadoop core-site.xml file (where you downloaded hadoop).

Hope this helps,

semih 


--
You received this message because you are subscribed to the Google Groups "stanfordgpsusers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stanfordgpsuse...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lyuwei

unread,
Mar 15, 2014, 7:34:39 AM3/15/14
to stanford...@googlegroups.com, Lyuwei, se...@stanford.edu
Thank you. My slave and master can SSH without password eachother and the directory of core-site.xml is right. So Is there other possibility of these two problems?

在 2014年3月14日星期五UTC+8上午2时33分16秒,Semih Salihoglu写道:

lyu...@sslab.cs.nthu.edu.tw

unread,
Mar 18, 2014, 2:51:23 AM3/18/14
to stanford...@googlegroups.com, Lyuwei, se...@stanford.edu
Thank you. However,my slave and master can SSH without password each other and the directory of core-site.xml is right. So Is there other possibility of these two problems?

Semih Salihoglu

unread,
Mar 18, 2014, 6:29:21 AM3/18/14
to lyu...@sslab.cs.nthu.edu.tw, stanford...@googlegroups.com, Lyuwei

I can't think of anything off the top of my head. Are you sure your job is failing? As I said its ok if you see some can't connects in the beginning but they should eventually connect.

If you can't resolve it can you attach your log file? But I'm traveling so it might be a few days before I can take a look.

Best,

semih

lyu...@sslab.cs.nthu.edu.tw

unread,
Mar 19, 2014, 4:44:31 AM3/19/14
to stanford...@googlegroups.com, lyu...@sslab.cs.nthu.edu.tw, Lyuwei, se...@stanford.edu
I am sure it is down because I usually wait for half a minute. Thank you very much
quick-start-machine0-output.txt
quick-start-machine-1-output.txt

Semih Salihoglu

unread,
Mar 19, 2014, 11:49:05 AM3/19/14
to lyu...@sslab.cs.nthu.edu.tw, stanford...@googlegroups.com, Lyuwei
I can take a look at this on Sunday, I only have access to internet through my phone and my phone is causing problems when opening this file. Can someone else try to help Lyuwei until then?

Arash Jalal Zadeh Fard

unread,
Mar 19, 2014, 1:04:03 PM3/19/14
to Semih Salihoglu, lyu...@sslab.cs.nthu.edu.tw, stanford...@googlegroups.com, Lyuwei

Hi Lyuwei,


I try to help you while Semih is travelling; nevertheless, I am a user of GPS and therefor not master in reading its log file.


Clearly, your problem is a network problem. First, please make sure that no firewall is enabled on any of your nodes. If you want to keep the firewalls enabled for any reason, you need to unblock incoming and outgoing traffic for all the required ports. Successful ssh between all the nodes only shows that the network connection is available on ssh ports, but is not an indication for other ports. Then please make sure that you have correctly set your machine_config and slaves files.


Looking at your log files,  I can see that the node with localMachineId 0, which should be slave1, has bound to port 2346. Ironically, then it tries to connect to the node with localMachineId 1, slave2, at the same port and fails. I can also see that numEstablishedConnection on your master increases from 0 to 1 which is a good sign. As far as I understand it indicates that one slave is connected to the master. The supersteps will start when all the slaves connect to the master. Can you please also send your machine_config and slaves files​? They may help to understand the problem.


Best,

Arash


From: stanford...@googlegroups.com <stanford...@googlegroups.com> on behalf of Semih Salihoglu <se...@stanford.edu>
Sent: Wednesday, March 19, 2014 11:49 AM
To: lyu...@sslab.cs.nthu.edu.tw
Cc: stanford...@googlegroups.com; Lyuwei
Subject: Re: Some questions about GPS on cluster
 

Arash Jalal Zadeh Fard

unread,
Mar 19, 2014, 1:31:01 PM3/19/14
to Semih Salihoglu, lyu...@sslab.cs.nthu.edu.tw, stanford...@googlegroups.com, Lyuwei

By the way, you may need to wait more than half a minute (based on your system) to have the connections being established.


From: stanford...@googlegroups.com <stanford...@googlegroups.com> on behalf of Arash Jalal Zadeh Fard <ar...@uga.edu>
Sent: Wednesday, March 19, 2014 1:04 PM
To: Semih Salihoglu; lyu...@sslab.cs.nthu.edu.tw
Cc: stanford...@googlegroups.com; Lyuwei
Subject: RE: Some questions about GPS on cluster
 

lyu...@sslab.cs.nthu.edu.tw

unread,
Mar 20, 2014, 7:20:05 AM3/20/14
to stanford...@googlegroups.com, Semih Salihoglu, lyu...@sslab.cs.nthu.edu.tw, Lyuwei, ar...@uga.edu
Thanks for everyone. In last experiment, the ports are all set to 2346, but the problem is the same as the ports are different. So this time I set the port of slave2 at 2222 and the logs and config file are attached. By the way, In my other experiments, I first set 3 VMs and the port of slave1 cannot be accessed while master and slave2 seem to run successfully. So I remove the slave 1 both in GPS and Hadoop and here is the result.   
double.cfg
quick-start-machine0-output.txt
quick-start-machine-1-output.txt

ARASH JALAL ZADEH FARD

unread,
Mar 20, 2014, 7:39:42 PM3/20/14
to lyu...@sslab.cs.nthu.edu.tw, stanford...@googlegroups.com, Semih Salihoglu, Lyuwei

Hi Lyuwei,


Looking at your double.cfg file, I expect that you have "master" and "slave2" at the beginning of your slaves file in the correct order and in subsequent lines. Looking at your log files, we can see that the first worker which has machineId 0 and is located on "master" has been successfully connected to the master process of GPS which has machineId -1. We can also see that both fail to connect to the 2nd worker on slave2 (machineId 1) on port 2222. There must be another log file on slave2 which shows that logs the activity of machine1. Apparently, you have connectivity problem between the two separate nodes of your cluster. It might have many different reasons of connection problem, but the most tricky one is existence of a firewall. You mentioned that you run GPS on a set of VMs, so you need to investigate how they are free to connect via different ports.


Best,

Arash




From: stanford...@googlegroups.com <stanford...@googlegroups.com> on behalf of lyu...@sslab.cs.nthu.edu.tw <lyu...@sslab.cs.nthu.edu.tw>
Sent: Thursday, March 20, 2014 7:20 AM
To: stanford...@googlegroups.com
Cc: Semih Salihoglu; lyu...@sslab.cs.nthu.edu.tw; Lyuwei; ARASH JALAL ZADEH FARD
Reply all
Reply to author
Forward
0 new messages