[slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

2,410 views
Skip to first unread message

Andrés Marín Díaz

unread,
Jun 5, 2019, 1:05:22 PM6/5/19
to slurm...@lists.schedmd.com
Hello, since we have updated to the new slurm version (19.05) every time
a jobstep is launched with mpirun it ends with the following error message:

    An ORTE daemon has unexpectedly failed after launch and before
    communicating back to mpirun. This could be caused by a number
    of factors, including an inability to create a connection back
    to mpirun due to lack of common network interfaces and / or no
    route found between them. Please check network connectivity
    (including firewalls and network routing requirements).

This only happens when it is launched to more than one node. If all
tasks run within the same node it works without problems

We have tested with different versions of OpenMPI (2.1.2, 3.1.1, 3.1.3),
all they compiled with the flags --with-slurm and --with-pmi. And in all
cases if the job is launched to nodes with slurm 18.05 it works with
both srun and mpirun. But if it is launched to nodes with slurm 19.05 it
works with srun but it fails with mpirun.

Can it be a bug in the new version?

Thank you.


--
Andrés Marín Díaz

Servicio de Infraestructura e Innovación
Universidad Politécnica de Madrid

Centro de Supercomputación y Visualización de Madrid (CeSViMa)
Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
ama...@cesvima.upm.es | tel 910679676

www.cesvima.upm.es | www.twitter.com/cesvima | www.fb.com/cesvima



Chris Samuel

unread,
Jun 6, 2019, 2:41:22 AM6/6/19
to slurm...@lists.schedmd.com
On Wednesday, 5 June 2019 10:04:11 AM PDT Andrés Marín Díaz wrote:

> Can it be a bug in the new version?

If it's working with srun but not with mpirun it sounds like there's some
incompatibility between how mpirun is calling srun to launch orted and what
Slurm is doing now.

You'd need to find a way to trace mpirun - I think it's just a shell script so
running it with "bash -x mpirun {etc}" would probably do it.

That said you're probably better off just using srun anyway.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA




Andrés Marín Díaz

unread,
Jun 6, 2019, 6:12:29 AM6/6/19
to slurm...@lists.schedmd.com
Thank you very much for the help, I update some information.

- If we use Intel MPI (IMPI) mpirun it works correctly.
- If we use mpirun without using the scheduler it works correctly.
- If we use srun with software compiled with OpenMPI it works correctly.
- If we use SLURM 18.08.6 it works correctly.
- If we use SLURM 19.05.0 and mpirun inside the sbatch script then we
get the error:
--------------------------------------------------------------------------
    An ORTE daemon has unexpectedly failed after launch and before
    communicating back to mpirun. This could be caused by a number
    of factors, including an inability to create a connection back
    to mpirun due to lack of common network interfaces and / or no
    route found between them. Please check network connectivity
    (including firewalls and network routing requirements).
--------------------------------------------------------------------------

Trying to trace the problem:
- Mpirun is a binary and can not be traced with batch -x.
- I've done a "strace mpirun hostname" to see if it helps, but i am not
able to see where the problem may be.

Here is the exit from the strace:
https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW

And here the slurmd log with verbose level 5:
Main node (slurmd log):
    2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]: _run_prolog:
run job script took usec=7
    2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]: _run_prolog:
prolog with lock for job 11057 ran for 0 seconds
    2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]:
task_p_slurmd_batch_request: 11057
    2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]:
task/affinity: job 11057 CPU input mask for node: 0x0000000001
    2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]:
task/affinity: job 11057 CPU final HW mask for node: 0x0000000001
    2019-06-06T09:51:54.279614+00:00 r1n1 slurmstepd[108548]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
    2019-06-06T09:51:54.280312+00:00 r1n1 slurmstepd[108548]: Munge
credential signature plugin loaded
    2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
    2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]:
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
    2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]: Launching
batch job 11057 for UID 2000
    2019-06-06T09:51:54.353196+00:00 r1n1 slurmstepd[108556]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
    2019-06-06T09:51:54.353899+00:00 r1n1 slurmstepd[108556]: Munge
credential signature plugin loaded
    2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
    2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]:
task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
    2019-06-06T09:51:54.393325+00:00 r1n1 slurmstepd[108556]: debug
level = 2
    2019-06-06T09:51:54.393754+00:00 r1n1 slurmstepd[108556]: starting
1 tasks
    2019-06-06T09:51:54.401243+00:00 r1n1 slurmstepd[108556]: task 0
(108561) started 2019-06-06T09:51:54
    2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]:
task_p_pre_launch: Using sched_affinity for tasks
    2019-06-06T09:51:56.514908+00:00 r1n1 slurmstepd[108556]: task 0
(108561) exited with exit code 1.
    2019-06-06T09:51:56.554430+00:00 r1n1 slurmstepd[108556]: job 11057
completed with slurm_rc = 0, job_rc = 256
    2019-06-06T09:51:56.554847+00:00 r1n1 slurmstepd[108556]: sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
    2019-06-06T09:51:56.559856+00:00 r1n1 slurmstepd[108556]: done with job
    2019-06-06T09:51:56.596762+00:00 r1n1 slurmstepd[108548]: Sent
signal 18 to 11057.4294967295
    2019-06-06T09:51:56.598072+00:00 r1n1 slurmstepd[108548]: Sent
signal 15 to 11057.4294967295
    2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]:
_oom_event_monitor: oom-kill event count: 1
    2019-06-06T09:51:56.641170+00:00 r1n1 slurmstepd[108548]: done with job

Secundary node (slurmd log):
    2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]: _run_prolog:
run job script took usec=7
    2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]: _run_prolog:
prolog with lock for job 11057 ran for 0 seconds
    2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
    2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]: Munge
credential signature plugin loaded
    2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
    2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]:
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
    2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]: Sent
signal 18 to 11057.4294967295
    2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]: Sent
signal 15 to 11057.4294967295
    2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]:
_oom_event_monitor: oom-kill event count: 1
    2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]: done with job

Thank you very much again.

Sean Crosby

unread,
Jun 6, 2019, 6:46:38 AM6/6/19
to Slurm User Community List
Hi Andrés,

Did you recompile OpenMPI after updating to SLURM 19.05?

Sean

--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of Physics
University of Melbourne

Andrés Marín Díaz

unread,
Jun 6, 2019, 7:12:40 AM6/6/19
to slurm...@lists.schedmd.com

Hello,

Yes, we have recompiled OpenMPI with integration with SLURM 19.05 but the problem remains.

We have also tried to recompile OpenMPI without integration with SLURM. In this case executions fail with srun, but with mpirun it continues to work in SLURM 18.08 and fails in 19.05 in the same way.

Thank you very much once more.

Sean Crosby

unread,
Jun 6, 2019, 7:28:31 AM6/6/19
to Slurm User Community List
How did you compile SLURM? Did you add the contribs/pmi and/or contribs/pmi2 plugins to the install? Or did you use PMIx?

Sean

--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of Physics
University of Melbourne

Andrés Marín Díaz

unread,
Jun 6, 2019, 9:59:22 AM6/6/19
to slurm...@lists.schedmd.com

Hello,

We have tried to compile it in 2 ways, in principle we had compiled it with pmix in the following way:
rpmbuild -ta slurm-19.05.0.tar.bz2 --define = '_ with_pmix --with-pmix = / opt / pmix / 3.1.2 /'

But we have also tried compiling it without pmix:
rpmbuild -ta slurm-19.05.0.tar.bz2

In both cases the result is the same.

In the slurm.conf we have defined:
MpiDefault = pmi2

Thank you,
A greeting.

Levi Morrison

unread,
Jun 6, 2019, 1:16:59 PM6/6/19
to slurm...@schedmd.com
Slurm 19.05 removed support for `--cpu_bind`, which is what /all/
released versions of OpenMPI are using when they call into srun. This
issue was fixed 24 days ago in [OpenMPI's git repo][1].

This means /all/ OpenMPI programs that end up calling `srun` on Slurm
19.05 will fail.

This enormous amount of breakage for such a minor "gain" seems unwise. I
think this [change][2] should be backed out and converted to a warning
message to allow time for the OpenMPI changes to be backported,
released, and adopted.

Levi Morrison
Brigham Young University

  [1]:
https://github.com/open-mpi/ompi/commit/7dad74032e30259506da7fa582dd8c4351e6e0a1
  [2]:
https://github.com/SchedMD/slurm/commit/d78af893e4a60e933a2319b0c36a0e40c7dd1b02


Levi Morrison

unread,
Jun 6, 2019, 1:21:43 PM6/6/19
to slurm...@lists.schedmd.com
Slurm 19.05 removed support for `--cpu_bind`, which is what all released
versions of OpenMPI are using when they call into srun. This issue was
fixed 24 days ago in [OpenMPI's git repo][1].

This means all OpenMPI programs that end up calling `srun` on Slurm
19.05 will fail.

This enormous amount of breakage for such a minor "gain" seems unwise. I
think this [change][2] should be backed out and converted to a warning
message to allow time for the OpenMPI changes to be backported,
released, and adopted. Theoretically they were given time with the 17.11
release (I think?) but since it's only just landed...

Christopher Samuel

unread,
Jun 6, 2019, 2:15:23 PM6/6/19
to slurm...@lists.schedmd.com
On 6/6/19 10:21 AM, Levi Morrison wrote:

> This means all OpenMPI programs that end up calling `srun` on Slurm
> 19.05 will fail.

Sounds like a good reason to file a bug. We're not on 19.05 yet so
we're not affected (yet) but this may cause us some pain when we get to
that point (though at least "use srun" should fix it).

Kilian Cavalotti

unread,
Jun 6, 2019, 3:03:04 PM6/6/19
to Slurm User Community List
On Thu, Jun 6, 2019 at 11:16 AM Christopher Samuel <ch...@csamuel.org> wrote:
> Sounds like a good reason to file a bug.

Levi did already. Everybody can vote at
https://bugs.schedmd.com/show_bug.cgi?id=7191 :)

Cheers,
--
Kilian

Christopher Samuel

unread,
Jun 6, 2019, 3:43:57 PM6/6/19
to slurm...@lists.schedmd.com
On 6/6/19 12:01 PM, Kilian Cavalotti wrote:

> Levi did already.

Aha, race condition between searching bugzilla and writing the email. ;-)

Andrés Marín Díaz

unread,
Jun 7, 2019, 4:24:48 AM6/7/19
to slurm...@lists.schedmd.com
Good morning, thank you very much to all for helping us find the problem.

I join Levi's proposal to reverse the change.

Is there any way to temporarily patch the slurm code 19.05 while
analyzing the proposal so that you do not have to patch and recompile
the different versions of OpenMPI?

Levi Morrison

unread,
Jun 7, 2019, 10:16:23 AM6/7/19
to slurm...@lists.schedmd.com
See Tim Wickberg's comment and patch from this morning
(https://bugs.schedmd.com/show_bug.cgi?id=7191#c7); especially:

> Some variant of this patch - albeit with a warning message added in
to note that --cpu-bind is the correct spelling - will be in 19.05.1
when released, and supported through the 19.05 release cycle.

Levi Morrison

Andrés Marín Díaz

unread,
Jun 10, 2019, 4:26:02 AM6/10/19
to slurm...@lists.schedmd.com
Hello, I have already applied the patch and recompiled and everything
works correctly.

Now to wait for the 19.05.1.

Thank you.
Reply all
Reply to author
Forward
0 new messages