[slurm-dev] MPI batch job problem

Steven McDougall

unread,

Jul 25, 2007, 1:59:48 PM7/25/07

to slur...@lists.llnl.gov

We have a cluster with a management node and a bunch of compute nodes.
We have a hello-mpi program.

hello-mpi works if we run it on a compute node

smcdougall@sf0-m0n0:~/mpi>srun -p sf0 hello-mpi
Hello from mpi task number 0
Task 0 is running on the processor named sf0-m0n0.scsystem

It also works if we submit it as a batch job on a compute node

smcdougall@sf0-m0n0:~/mpi>cat hello-mpi.sh
#!/bin/bash
srun ./hello-mpi

smcdougall@sf0-m0n0:~/mpi>srun -p sf0 -b hello-mpi.sh
srun: jobid 27133 submitted

smcdougall@sf0-m0n0:~/mpi>cat slurm-27133.out
Hello from mpi task number 0
Task 0 is running on the processor named sf0-m0n0.scsystem

It works if we submit it directly from the management node

smcdougall@ssp024:~/mpi>srun -p sf0 hello-mpi
Hello from mpi task number 0
Task 0 is running on the processor named sf0-m0n0.scsystem

But it fails if we submit it as a batch job from the management node

smcdougall@ssp024:~/mpi>srun -p sf0 -b hello-mpi.sh
srun: jobid 27135 submitted

smcdougall@ssp024:~/mpi>cat slurm-27135.out
hello-mpi: error: slurm_send_kvs_comm_set: Connection refused
Fatal error in MPI_Init: Error message texts are not available,
error stack:
(unknown)(): Error message texts are not available
srun: error: sf0-m0n0: task0: Exited with exit code 13

I've poked around a little bit and don't see any obvious problems.

We have SLURM_SRUN_COMM_IFHN set correctly on the management node
(or else hello-mpi wouldn't run at all on the management node).

Any suggestions?

jet...@llnl.gov

unread,

Jul 25, 2007, 2:20:14 PM7/25/07

to slur...@lists.llnl.gov, Steven McDougall

When you submit a batch job from any node in the cluster, the
only difference would be the environment being propagated.
Take a look at your environment on the management node and
the compute nodes. For example, the PATH, LD_LIBRARY_PATH,
etc. will get propagated. If those environment variables
are not applicable for the node where the script runs,
there could be a problem. This shouldn't be an issue, but
the ulimits also get propagated (although that can be disabled
in the slurm.conf or by the user).

--
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Morris "Moe" Jette jet...@llnl.gov 925-423-4856
Integrated Computational Resource Management Group fax 925-423-6961
Livermore Computing Lawrence Livermore National Laboratory
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Christopher J. Morrone

unread,

Jul 25, 2007, 3:13:24 PM7/25/07

to slur...@lists.llnl.gov

You should probably not be using the SLURM_SRUN_COMM_IFHN setting when
you do a batch job from the management node. Even though you run the
"srun -b" command on the management node, your batch script will run
somewhere else. Your batch script will always run on the first compute
node of your job allocation, regardless of where you ran the "srun -b"
command. So if you set SLURM_SRUN_COMM_IFHN before calling "srun -b",
it will be the wrong value for the srun command inside of the batch script.

Chris

Reply all

Reply to author

Forward