hello-mpi works if we run it on a compute node
smcdougall@sf0-m0n0:~/mpi>srun -p sf0 hello-mpi
Hello from mpi task number 0
Task 0 is running on the processor named sf0-m0n0.scsystem
It also works if we submit it as a batch job on a compute node
smcdougall@sf0-m0n0:~/mpi>cat hello-mpi.sh
#!/bin/bash
srun ./hello-mpi
smcdougall@sf0-m0n0:~/mpi>srun -p sf0 -b hello-mpi.sh
srun: jobid 27133 submitted
smcdougall@sf0-m0n0:~/mpi>cat slurm-27133.out
Hello from mpi task number 0
Task 0 is running on the processor named sf0-m0n0.scsystem
It works if we submit it directly from the management node
smcdougall@ssp024:~/mpi>srun -p sf0 hello-mpi
Hello from mpi task number 0
Task 0 is running on the processor named sf0-m0n0.scsystem
But it fails if we submit it as a batch job from the management node
smcdougall@ssp024:~/mpi>srun -p sf0 -b hello-mpi.sh
srun: jobid 27135 submitted
smcdougall@ssp024:~/mpi>cat slurm-27135.out
hello-mpi: error: slurm_send_kvs_comm_set: Connection refused
Fatal error in MPI_Init: Error message texts are not available,
error stack:
(unknown)(): Error message texts are not available
srun: error: sf0-m0n0: task0: Exited with exit code 13
I've poked around a little bit and don't see any obvious problems.
We have SLURM_SRUN_COMM_IFHN set correctly on the management node
(or else hello-mpi wouldn't run at all on the management node).
Any suggestions?
--
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Morris "Moe" Jette jet...@llnl.gov 925-423-4856
Integrated Computational Resource Management Group fax 925-423-6961
Livermore Computing Lawrence Livermore National Laboratory
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
You should probably not be using the SLURM_SRUN_COMM_IFHN setting when
you do a batch job from the management node. Even though you run the
"srun -b" command on the management node, your batch script will run
somewhere else. Your batch script will always run on the first compute
node of your job allocation, regardless of where you ran the "srun -b"
command. So if you set SLURM_SRUN_COMM_IFHN before calling "srun -b",
it will be the wrong value for the srun command inside of the batch script.
Chris