xmvapich w/ mpich2 on xcpu-cluster

6 views
Skip to first unread message

Thomas Eckert

unread,
Aug 11, 2009, 7:18:19 AM8/11/09
to xcpu mailing list
Hi list,

there was a similar thread last year w/o a solution posted w/ subject:
"(s)xcpu and MPI".

Daniel Gruner reported problems w/ running mpi-jobs (cpi) on multiple
nodes. Part of the problem was the name-resolving / hostnames but the
thread ended with at least 2 open problems:
a) running a job on the same node did not work, i.e. hang: "xmvapich
node1,node1 ./cpi"
b) running a job on the headnode did not work either

My setup is a follows:
- 1 headnode (the headnode is part of the xcpu-cluster as "n00")
- 2 compute-nodes (n01 and n02)
- interconnect: ethernet
- c-nodes boot a initramfs via PXE
- mpich2-1.0.3

- hostnames are ok:
xcpu-head01 examples # xrx -pa hostname
n00: xcpu-head01.local
n02: n02
n01: n01

- basic mpi-jobs too:
xcpu-head01 examples # xmvapich -a ./hellow
Hello world from process 0 of 3
Hello world from process 2 of 3
Hello world from process 1 of 3

- more complex ones not:
xcpu-head01 examples # xmvapich -a ./cpi
Process 0 of 3 is on xcpu-head01.local
Process 2 of 3 is on n02
Process 1 of 3 is on n01
(hang)
^C

- BUT running on only one node is ok:
xcpu-head01 examples # xmvapich n01 ./cpi
Process 0 of 1 is on n01
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000258

- running 2 procs on the head works:
xcpu-head01 examples # xmvapich n00,n00 ./cpi
Process 0 of 2 is on xcpu-head01.local
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.001787
Process 1 of 2 is on xcpu-head01.local

- while running 2 on either n01 or n02 hangs:
xcpu-head01 examples # xmvapich -D n01,n02 ./cpi
-pmi-> 0: cmd=initack pmiid=1
<-pmi- 0: cmd=initack rc=0
<-pmi- 0: cmd=set rc=0 size=2
<-pmi- 0: cmd=set rc=0 rank=0
<-pmi- 0: cmd=set rc=0 debug=0
-pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 0: cmd=response_to_init rc=0
-pmi-> 0: cmd=get_maxes
<-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
-pmi-> 1: cmd=initack pmiid=1
<-pmi- 1: cmd=initack rc=0
<-pmi- 1: cmd=set rc=0 size=2
<-pmi- 1: cmd=set rc=0 rank=1
<-pmi- 1: cmd=set rc=0 debug=0
-pmi-> 0: cmd=get_appnum
<-pmi- 0: cmd=appnum rc=0 appnum=0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 1: cmd=response_to_init rc=0
-pmi-> 1: cmd=get_maxes
<-pmi- 1: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=get_appnum
<-pmi- 1: cmd=appnum rc=0 appnum=0
-pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
value=port#39217$description#n01$
<-pmi- 0: cmd=put_result rc=0
-pmi-> 0: cmd=barrier_in
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard
value=port#55159$description#n02$
<-pmi- 1: cmd=put_result rc=0
-pmi-> 1: cmd=barrier_in
<-pmi- 0: cmd=barrier_out rc=0
<-pmi- 1: cmd=barrier_out rc=0
-pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard
<-pmi- 0: cmd=get_result rc=0 value=port#55159$description#n02$
Process 0 of 2 is on n01
Process 1 of 2 is on n02
^C

I assume that Daniel has found a solution but unfortunately this did
not make it to the list.

Any ideas?

Thanks,

Thomas

Thomas Eckert

unread,
Aug 11, 2009, 6:27:47 AM8/11/09
to xc...@googlegroups.com
Hi list,

Any ideas?

Thanks,

Tom

Reply all
Reply to author
Forward
0 new messages