Thomas Eckert
unread,Aug 11, 2009, 7:18:19 AM8/11/09Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to xcpu mailing list
Hi list,
there was a similar thread last year w/o a solution posted w/ subject:
"(s)xcpu and MPI".
Daniel Gruner reported problems w/ running mpi-jobs (cpi) on multiple
nodes. Part of the problem was the name-resolving / hostnames but the
thread ended with at least 2 open problems:
a) running a job on the same node did not work, i.e. hang: "xmvapich
node1,node1 ./cpi"
b) running a job on the headnode did not work either
My setup is a follows:
- 1 headnode (the headnode is part of the xcpu-cluster as "n00")
- 2 compute-nodes (n01 and n02)
- interconnect: ethernet
- c-nodes boot a initramfs via PXE
- mpich2-1.0.3
- hostnames are ok:
xcpu-head01 examples # xrx -pa hostname
n00: xcpu-head01.local
n02: n02
n01: n01
- basic mpi-jobs too:
xcpu-head01 examples # xmvapich -a ./hellow
Hello world from process 0 of 3
Hello world from process 2 of 3
Hello world from process 1 of 3
- more complex ones not:
xcpu-head01 examples # xmvapich -a ./cpi
Process 0 of 3 is on xcpu-head01.local
Process 2 of 3 is on n02
Process 1 of 3 is on n01
(hang)
^C
- BUT running on only one node is ok:
xcpu-head01 examples # xmvapich n01 ./cpi
Process 0 of 1 is on n01
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000258
- running 2 procs on the head works:
xcpu-head01 examples # xmvapich n00,n00 ./cpi
Process 0 of 2 is on xcpu-head01.local
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.001787
Process 1 of 2 is on xcpu-head01.local
- while running 2 on either n01 or n02 hangs:
xcpu-head01 examples # xmvapich -D n01,n02 ./cpi
-pmi-> 0: cmd=initack pmiid=1
<-pmi- 0: cmd=initack rc=0
<-pmi- 0: cmd=set rc=0 size=2
<-pmi- 0: cmd=set rc=0 rank=0
<-pmi- 0: cmd=set rc=0 debug=0
-pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 0: cmd=response_to_init rc=0
-pmi-> 0: cmd=get_maxes
<-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
-pmi-> 1: cmd=initack pmiid=1
<-pmi- 1: cmd=initack rc=0
<-pmi- 1: cmd=set rc=0 size=2
<-pmi- 1: cmd=set rc=0 rank=1
<-pmi- 1: cmd=set rc=0 debug=0
-pmi-> 0: cmd=get_appnum
<-pmi- 0: cmd=appnum rc=0 appnum=0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 1: cmd=response_to_init rc=0
-pmi-> 1: cmd=get_maxes
<-pmi- 1: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=get_appnum
<-pmi- 1: cmd=appnum rc=0 appnum=0
-pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
value=port#39217$description#n01$
<-pmi- 0: cmd=put_result rc=0
-pmi-> 0: cmd=barrier_in
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard
value=port#55159$description#n02$
<-pmi- 1: cmd=put_result rc=0
-pmi-> 1: cmd=barrier_in
<-pmi- 0: cmd=barrier_out rc=0
<-pmi- 1: cmd=barrier_out rc=0
-pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard
<-pmi- 0: cmd=get_result rc=0 value=port#55159$description#n02$
Process 0 of 2 is on n01
Process 1 of 2 is on n02
^C
I assume that Daniel has found a solution but unfortunately this did
not make it to the list.
Any ideas?
Thanks,
Thomas