[slurm-users] Submitting jobs across multiple nodes fails

1,899 views
Skip to first unread message

Andrej Prsa

unread,
Feb 2, 2021, 12:15:26 AM2/2/21
to slurm...@lists.schedmd.com
Dear list,

I'm struggling with what seems to be very similar to this thread:

https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html

I'm using slurm 20.11.3 patched with this fix to detect pmixv4:

    https://bugs.schedmd.com/show_bug.cgi?id=10683

and this is what I'm seeing:

andrej@terra:~$ salloc -N 2 -n 2
salloc: Granted job allocation 841
andrej@terra:~$ srun hostname
srun: launch/slurm: launch_p_step_launch: StepId=841.0 aborted before
step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 0 launch failed: Unspecified error
srun: error: task 1 launch failed: Unspecified error

In slurmctld.log I have this:

[2021-02-01T23:58:13.683] sched: _slurm_rpc_allocate_resources JobId=841
NodeList=node[9-10] usec=572
[2021-02-01T23:58:19.817] error: mpi_hook_slurmstepd_prefork failure for
0x557e7480bcb0s on node9
[2021-02-01T23:58:19.829] error: mpi_hook_slurmstepd_prefork failure for
0x55f568e00cb0s on node10

and in slurmd.log I have this for node9:

[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000
GID:1000 HOST:192.168.1.1 PORT:35508
[2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841
implicit auto binding: cores, dist 1
[2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:
_task_layout_lllp_cyclic
[2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:
_lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
[2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_utils.c:108
[pmixp_usock_create_srv] mpi/pmix: ERROR: Cannot bind() UNIX socket
/var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)
[2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_server.c:387
[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
[2021-02-01T23:58:19.814] [841.0] error: (null) [0] mpi_pmix.c:169
[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
[2021-02-01T23:58:19.817] [841.0] error: Failed mpi_hook_slurmstepd_prefork
[2021-02-01T23:58:19.845] [841.0] error: job_manager exiting abnormally,
rc = -1
[2021-02-01T23:58:19.892] [841.0] done with job

and this for node10:

[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000
GID:1000 HOST:192.168.1.1 PORT:38918
[2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841
implicit auto binding: cores, dist 1
[2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:
_task_layout_lllp_cyclic
[2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:
_lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
[2021-02-01T23:58:19.825] [841.0] error: node10 [1]
pmixp_client_v2.c:246 [pmixp_lib_init] mpi/pmix: ERROR: PMIx_server_init
failed with error -2
: Success (0)
[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_client.c:518
[pmixp_libpmix_init] mpi/pmix: ERROR: PMIx_server_init failed with error -1
: Success (0)
[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_server.c:423
[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_libpmix_init() failed
[2021-02-01T23:58:19.826] [841.0] error: (null) [1] mpi_pmix.c:169
[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
[2021-02-01T23:58:19.829] [841.0] error: Failed mpi_hook_slurmstepd_prefork
[2021-02-01T23:58:19.853] [841.0] error: job_manager exiting abnormally,
rc = -1
[2021-02-01T23:58:19.899] [841.0] done with job

It seems that the culprit is the bind() failure, but I can't make much
sense of it. I checked that /etc/hosts has everything correct and
consistent with the info in slurm.conf.

Other potentially relevant info: all compute nodes are diskless, they
are pxe-booted from a NAS image and running ubuntu server 20.04. Running
jobs on a single node is fine.

Thanks for any insight and suggestions.

Cheers,
Andrej


Andrej Prsa

unread,
Feb 4, 2021, 4:21:27 PM2/4/21
to slurm...@lists.schedmd.com
Gentle bump on this, if anyone has suggestions as I weed through the scattered slurm docs. :) 

Thanks, 
Andrej

Brian Andrus

unread,
Feb 4, 2021, 5:47:04 PM2/4/21
to slurm...@lists.schedmd.com

Did you compile slurm with mpi support?

Your mpi libraries should be the same as that version and they should be available in the same locations for all nodes.
Also, ensure they are accessible (PATH, LD_LIBRARY_PATH, etc are set)

Brian Andrus

Andrej Prsa

unread,
Feb 4, 2021, 7:56:15 PM2/4/21
to slurm...@lists.schedmd.com
Hi Brian,

Thanks for your response!

> Did you compile slurm with mpi support?
>

Yep:

andrej@terra:~$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
srun: pmix
srun: pmix_v4

> Your mpi libraries should be the same as that version and they should
> be available in the same locations for all nodes. Also, ensure they
> are accessible (PATH, LD_LIBRARY_PATH, etc are set)
>

They are: I have openmpi-4.1.0 installed cluster-wide. Running jobs via
rsh across multiple nodes works just fine, but through slurm they do not
(within salloc):

mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96
python testmpi.py # works
mpirun -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96
python testmpi.py # doesn't work

Thus, I believe that mpi works just fine. I passed this by the
ompi-devel folks and they are convinced that the issue is in slurm
configuration. I'm trying to figure out what's causing this error to pop
up in the logs:

mpi/pmix: ERROR: Cannot bind() UNIX socket
/var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)

I wonder if the culprit is how srun calls openmpi's --bind-to?

Thanks again,
Andrej


Brian Andrus

unread,
Feb 4, 2021, 9:34:43 PM2/4/21
to slurm...@lists.schedmd.com
try:

export SLURM_OVERLAP=1
export SLURM_WHOLE=1

before your salloc and see if that helps. I have seen some mpi issues
that were resolved with that.

You can also try it using just the regular mpirun on the nodes
allocated. That will help with a datapoint as well.

Brian Andrus

Andrej Prsa

unread,
Feb 4, 2021, 9:57:43 PM2/4/21
to slurm...@lists.schedmd.com
Hi Brian,


try:

export SLURM_OVERLAP=1
export SLURM_WHOLE=1

before your salloc and see if that helps. I have seen some mpi issues that were resolved with that.

Unfortunately no dice:

andrej@terra:~$ export SLURM_OVERLAP=1
andrej@terra:~$ export SLURM_WHOLE=1
andrej@terra:~$ salloc -N2 -n2
salloc: Granted job allocation 864
andrej@terra:~$ srun hostname
srun: launch/slurm: launch_p_step_launch: StepId=864.0 aborted before step completely launched.

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 1 launch failed: Unspecified error
srun: error: task 0 launch failed: Unspecified error

You can also try it using just the regular mpirun on the nodes allocated. That will help with a datapoint as well.

Same as above, unfortunately.

But: I can get it to work correctly if I replace MpiDefault=pmix with MpiDefault=none. It looks like there's something amiss with pmix support in slurm?

andrej@terra:~$ salloc -N2 -n2
salloc: Granted job allocation 866
andrej@terra:~$ srun hostname
node11
node10

Cheers,
Andrej

Reply all
Reply to author
Forward
0 new messages