--
You received this message because you are subscribed to the Google Groups "stanfordgpsusers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stanfordgpsuse...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Lyuwei,
I try to help you while Semih is travelling; nevertheless, I am a user of GPS and therefor not master in reading its log file.
Clearly, your problem is a network problem. First, please make sure that no firewall is enabled on any of your nodes. If you want to keep the firewalls enabled for any reason, you need to unblock incoming and outgoing traffic for all the required ports.
Successful ssh between all the nodes only shows that the network connection is available on ssh ports, but is not an indication for other ports. Then please make sure that you have correctly set your machine_config and slaves files.
Looking at your log files, I can see that the node with localMachineId 0, which should be slave1, has bound to port 2346. Ironically, then it tries to connect to the node with localMachineId 1,
slave2, at the same port and fails. I can also see that numEstablishedConnection on your master increases from 0 to 1 which is a good sign. As far as I understand it indicates that one slave is connected to the master. The supersteps will start when all the
slaves connect to the master. Can you please also send your machine_config and slaves files?
They may help to understand the problem.
Best,
Arash
By the way, you may need to wait more than half a minute (based on your system) to have the connections being established.
Hi Lyuwei,
Looking at your double.cfg file, I expect that you have "master" and "slave2" at the beginning of your slaves file in the correct order and in subsequent lines. Looking at your log files, we can see that the first worker which has machineId 0 and is located on "master" has been successfully connected to the master process of GPS which has machineId -1. We can also see that both fail to connect to the 2nd worker on slave2 (machineId 1) on port 2222. There must be another log file on slave2 which shows that logs the activity of machine1. Apparently, you have connectivity problem between the two separate nodes of your cluster. It might have many different reasons of connection problem, but the most tricky one is existence of a firewall. You mentioned that you run GPS on a set of VMs, so you need to investigate how they are free to connect via different ports.
Best,
Arash