mpich2中mpd启动错误处理

53 views
Skip to first unread message

milan

unread,
Jun 22, 2007, 4:12:15 AM6/22/07
to 中国高性能计算论坛
我在启动mpd时候至少碰到了两次,写下来让大家参考!
error message:

$ mpdboot -f file :

failed to ping mpd on ... ; recvd output={}

solution:

1. just make sure, you got a correct working /etc/hosts file ...

2. verify, you are using the SAME "secretword" defined in your home
directory: ~/.mpd.conf on every machine. contain secretword=12345

3. verify, mpich2 is available on every machine in the same
directory
and the environment variables PATH and LD_LIBRARY_PATH are set
properly

milan

unread,
Jun 22, 2007, 4:15:03 AM6/22/07
to 中国高性能计算论坛
网友CZ碰到的问题如下:

I always met the problem you post here....it is gonna kill me...

My hosts file looks like:

127.0.0.1 localhost.localdomain localhost
192.168.0.1 pnode1
192.168.0.2 pnode2
192.168.0.3 pnode3
192.168.0.4 pnode4
192.168.0.5 pnode5
192.168.0.6 pnode6
192.168.0.7 pnode7
192.168.0.8 pnode8

It works well for both ssh and rsh..

my /etc/mpd.conf file is:

secretword=123456

All machines are cloned from one machine and I am pretty sure that the
path and LD path is exact the same and are both correct. I can start
mpd on each node but when i tried to start multi-process, it jumps
your error message.

I am, in fact, becoming mad about that.

Do you have any suggestions?

Thanks

CZ


答复:些改/etc/mpd.conf可能不行,试试在你的用户目录里添加.mpd.conf文件,
并增加密码。这是mpich文档标准的配置方法。注意是文件名是: .mpd.conf

ps:上次的帖子已经关闭,新发了此贴,做了些更正。

lyu...@gmail.com

unread,
Jun 22, 2007, 9:20:33 AM6/22/07
to 中国高性能计算论坛
I have exactly the same problem, and I am pretty sure that 1., 2. and
3. are correct!

any suggestions?

Is this a bug in MPICH2?

Zhou, Chenggang

unread,
Jun 22, 2007, 10:28:32 AM6/22/07
to chin...@googlegroups.com
Probably, it may because that your firewall is working. turn off your firewall and it will work.

Zhou, Chenggang
Ph. D. Candidate.
Inst. of Theo. Chem. & Comput. Mat. Sci.
China Uni. of Geosciences, Wuhan, China P.R.C.
Lumo Street 388#,Wuhan,Hubei,China P.R.C. 430074
Cell Phone: 86-27-6104-8010
Office Phone: 86-27-6788-3049
Fax Number:  86-27-6788-3431
Email address: cgz...@cug.edu.cn
******************************************************

lyu...@gmail.com

unread,
Jun 22, 2007, 10:30:50 AM6/22/07
to 中国高性能计算论坛
guys, it works.. I just followed the steps in the installation guide
- Appendix A:

A.1.1 Following the steps
1. Install mpich2, and thus mpd.
A TROUBLESHOOTING MPDS 28
2. Make sure the mpich2 bin directory is in your path. Below, we will
refer to it as MPDDIR.
3. Kill old mpd processes. If you are coming to this guide from
elsewhere,
e.g. a Quick Start guide for mpich2, because you encountered mpd
problems, you should make sure that all mpd processes are terminated
on the hosts where you have been testing. mpdallexit may assist in
this, but probably not if you were having problems. You may need to
use the Unix kill command to terminate the processes.
4. Run a first mpd (alone on a first node). As mentioned above, mpd
uses client-server communications to perform its work. So, before
running
an mpd, let's run a simpler program (mpdcheck) to verify that
these communications are likely to be successful. Even on hosts where
communications are well supported, sometimes there are problems
associated
with hostname resolution, etc. So, it is worth the effort to
proceed a bit slowly. Below, we assume that you have installed mpd
and have it in your path.
Select a test node, let's call it n1. Login to n1.
First, we will run mpdcheck as a server and a client. To run it as a
server, get into a window with a command-line and run this:
n1 $ mpdcheck -s
It will print something like this:
server listening at INADDR_ANY on: n1 1234
Now, run the client side (in another window if convenient) and see
if it can find the server and communicate. Be sure to use the same
hostname and portnumber printed by the server (above: n1 1234):
n1 $ mpdcheck -c n1 1234
If all goes well, the server will print something like:
server has conn on
<socket._socketobject object at 0x40200f2c>
from ('192.168.1.1', 1234)
server successfully recvd msg from client:
hello_from_client_to_server
A TROUBLESHOOTING MPDS 29
and the client will print:
client successfully recvd ack from server:
ack_from_server_to_client
If the experiment failed, you have some network or machine
configuration
problem which will also be a problem later when you try to use
mpd. Even if the experiment succeeded, but the hostname printed by
the server was localhost, then you will probably have problems later
if you try to use mpd on n1 in conjunction with other hosts. In either
case, skip to Section A.2 "Debugging host/network configuration
problems."
If the experiment succeeded, then you should be ready to try mpd on
this one host. To start an mpd, you will use the mpd command. To
run parallel programs, you will use the mpiexec program. All mpd
commands accept the -h or -help arguments, e.g.:
n1 $ mpd --help
n1 $ mpiexec --help
Try a few tests:
n1 $ mpd &
n1 $ mpiexec -n 1 /bin/hostname
n1 $ mpiexec -l -n 4 /bin/hostname
n1 $ mpiexec -n 2 PATH_TO_MPICH2_EXAMPLES/cpi
where PATH TO MPICH2 EXAMPLES is the path to the mpich2-1.0.3/examples
directory.
To terminate the mpd:
n1 $ mpdallexit
5. Run a second mpd (alone on a second node). To verify that things
are fine on a second host (say n2 ), login to n2 and perform the same
set of tests that you did on n1. Make sure that you use mpdallexit to
terminate the mpd so you will be ready for further tests.
A TROUBLESHOOTING MPDS 30
6. Run a ring of two mpds on two hosts. Before running a ring of mpds
on n1 and n2, we will again use mpdcheck, but this time between the
two machines. We do this because the two nodes may have trouble
locating each other or communicating between them and it is easier
to check this out with the smaller program.
First, we will make sure that a server on n1 can service a client from
n2. On n1:
n1 $ mpdcheck -s
which will print a hostname (hopefully n1) and a portnumber (say
3333 here). On n2:
n2 $ mpdcheck -c n1 3333
If this experiment fails, skip to Section A.2 "Debugging host/network
configuration problems".
Second, we will make sure that a server on n2 can service a client
from
n1. On n2:
n2 $ mpdcheck -s
which will print a hostname (hopefully n2) and a portnumber (say
7777 here). On n2:
n2 $ mpdcheck -c n2 7777
If this experiment fails, skip to Section A.2 "Debugging host/network
configuration problems".
If all went well, we are ready to try a pair of mpds on n1 and n2.
First, make sure that all mpds have terminated on both n1 and n2.
Use mpdallexit or simply kill them with:
kill -9 PID_OF_MPD
where you have obtained the PID OF MPD by some means such as the
ps command.
On n1:
A TROUBLESHOOTING MPDS 31
n1 $ mpd &
n1 $ mpdtrace -l
This will print a list of machines in the ring, in this case just n1.
The
output will be something like:
n1_6789 (192.168.1.1)
The 6789 is the port that the mpd is listeneing on for connections
from other mpds wishing to enter the ring. We will use that port in a
moment to get an mpd from n2 into the ring. The value in parentheses
should be the IP address of n1.
On n2:
n2 $ mpd -h n1 -p 6789 &
where 6789 is the listening port on n1 (from mpdtrace above). Now
try:
n2 $ mpdtrace -l
You should see both mpds in the ring.
To run some programs in parallel:
n1 $ mpiexec -n 2 /bin/hostname
n1 $ mpiexec -n 4 /bin/hostname
n1 $ mpiexec -l -n 4 /bin/hostname
n1 $ mpiexec -l -n 4 PATH_TO_MPICH2_EXAMPLES/cpi
where PATH TO MPICH2 EXAMPLES is the path to the mpich2-1.0.5/examples
directory.
To bring down the ring of mpds:
n1 $ mpdallexit
7. Boot a ring of two mpds via mpdboot. Please be aware that mpdboot
uses ssh by default to start remote mpds. It will expect that you can
run ssh from n1 to n2 (and from n2 to n1) without entering a password.
First, make sure that you terminate the mpd processes from any prior
tests.
On n1, create a file named mpd.hosts containing the name of n2:
A TROUBLESHOOTING MPDS 32
n2
Then, on n1 run:
n1 $ mpdboot -n 2
n1 $ mpdtrace -l
n1 $ mpiexec -l -n 2 /bin/hostname
The mpdboot command should read the mpd.hosts file created above
and run an mpd on each of the two machines. The mpdtrace and
mpiexec show the ring up and functional. Options that may be useful
are:
· --help use this one for extra details on all options
· -v (verbose)
· --chkup tries to verify that the hosts are up before starting mpds
· --chkuponly only performs the verify step, then ends
To bring the ring down:
n1 $ mpdallexit
If mpdboot works on the two machines n1 and n2, it will probably work
on your others as well. But, there could be configuration problems
using a new machine on which you have not yet tested mpd. An
easy way to check, is to gradually add them to mpd.hosts and try an
mpdboot with a -n arg that uses them all each time. Use mpdallexit
after each test.

THANKS A LOT!

> --
> ******************************************************

Zhou, Chenggang

unread,
Jun 22, 2007, 12:44:13 PM6/22/07
to chin...@googlegroups.com
Great!

I've successfully configured, compiled all necessary programs, i,e PGI, MPI, GOTOblas, ... to run vasp. I bought a few PCs to build such a cluster..unfortunately,, although serial job or parallel inside one node works well, the parallel over machines looks pretty disgusting..

Once run, it will kill one of my PCs. I used mpdboot to startup the mpd ring..but after tried 3 times i lost 3 machines.

I think the problem is due to the bad quality of my gigabyte switch. I may need to change a good one. do you guys have any suggestions on either my problem or the new switch?

Thanks,

CZ

milan

unread,
Jun 23, 2007, 12:45:45 AM6/23/07
to 中国高性能计算论坛
what do you mean "lost 3 machines"?

> Lumo Street 388#,Wuhan,Hubei,China P.R.C. 430074 ...
>
> read more

Zhou, Chenggang

unread,
Jun 23, 2007, 3:37:04 AM6/23/07
to chin...@googlegroups.com
The system of those 3 machines halted and need reboot..
Reply all
Reply to author
Forward
0 new messages