[slurm-users] Question about PMIX ERROR messages being emitted by some child of srun process

1,783 views
Skip to first unread message

Pritchard Jr., Howard

unread,
May 19, 2023, 3:09:50 PM5/19/23
to Slurm User Community List

HI,

 

So I’m testing the use of Open MPI 5.0.0 pre-release with the Slurm/PMIx setup currently on NERSC Perlmutter system.

First off, if I use the PRRte launch system, I don’t see the issue I’m raising here.

 

But, many NERSC users prefer to use the srun “native” launch method with applications compiled against Open MPI, hence this emal.

 

The SLURM version on Perlmutter is currently 2023.02.2

 

The PMIx version that the admins used to build slurm against is pmix-4.2.3.  I’ve attached the output of  pmix_info.

 

I’ve tested with Open MPI 5.0.0rc11 (or HEAD of 5.0.x) with both the PMIx embedded in the Open MPI and using the external PMIx 4.2.3 install.

I get the same results below when my app is linked either against the system PMIx or the embedded one.

 

My test application “works” but if I use srun, I get these types of messages:

 

srun -n 2 -N 2 --mpi=pmix ./ring_c

[cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750

[cn315:1037721] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750

[cn315:1037721] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750

[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268

[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624

[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417

[cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750

[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268

[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624

[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417

 

After a lot of stracing and adding debug statements to the PMIx I have control over – the one in the embedded Open MPI tarball, I realized that these
messages are not coming from the app, but some transient process between the srun/slurmd processes and the application processes.
The pids in these error messages are the parents of the MPI processes.

 

I’ve tried various things like turning off the PMIX GDS shmem but that doesn’t help.  Also I’ve toggled the various SLURM_PMIX env. variables but to no effect.

This problem does not appear to be related to a recent slurm/pmix patch - https://bugs.schedmd.com/show_bug.cgi?id=16306#a0 and anyway it looks like that patch should be in 2023.02.2.

 

Another bit of info:

 

scontrol show config | grep -i pmix

PMIxCliTmpDirBase       = (null)

PMIxCollFence           = (null)

PMIxDebug               = 0

PMIxDirectConn          = yes

PMIxDirectConnEarly     = no

PMIxDirectConnUCX       = no

PMIxDirectSameArch      = no

PMIxEnv                 = (null)

PMIxFenceBarrier        = no

PMIxNetDevicesUCX       = (null)

PMIxTimeout             = 300

PMIxTlsUCX              = (null)

 

Now I myself don’t care too much about these messages. 

But for users it might be disconcerting and also may cause automated regression testing frameworks to report lots of errors.

 

Should I ask NERSC to file a ticket with SchedMD or does someone know how to turn these messages off if in fact they are not important, or better yet know why a slurm process may be emitting these errors and how to fix it?

 

Thanks for any ideas,

 

Howard

 


 

signature_61897647

Howard Pritchard

Research Scientist

HPC-ENV

 

Los Alamos National Laboratory

how...@lanl.gov

 

signature_2560999014signature_3849187500signature_1777390047signature_210780453

 

 

pmix_info.pmutter.txt

Tommi Tervo

unread,
May 22, 2023, 3:16:48 AM5/22/23
to Slurm User Community List
> So I’m testing the use of Open MPI 5.0.0 pre-release with the Slurm/PMIx setup
> currently on NERSC Perlmutter system.
>
> The SLURM version on Perlmutter is currently 2023.02.2
>
> The PMIx version that the admins used to build slurm against is pmix-4.2.3.
> I’ve attached the output of pmix_info.
>
> My test application “works” but if I use srun, I get these types of messages:
>
> srun -n 2 -N 2 --mpi=pmix ./ring_c
>
> [cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at
> line 750

Hi,

23.02.2 contains PMIx permission regression, it may be worth to check if it's case?

https://bugs.schedmd.com/show_bug.cgi?id=16687

commit 1f9386909230cd73506d88f02f75126924d3f41e
Author: Danny Auble <d...@schedmd.com>
Date: Mon May 15 18:35:25 2023 +0200

mpi/pmix - fix PMIx shmem backed files permissions regression.

Introduced in 23.02.2 commit d23cad68df.

Bug 16687


BR,
Tommi

Christopher Samuel

unread,
May 22, 2023, 5:32:23 PM5/22/23
to slurm...@lists.schedmd.com
Hi Tommi, Howard,

On 5/22/23 12:16 am, Tommi Tervo wrote:

> 23.02.2 contains PMIx permission regression, it may be worth to check if it's case?

I confirmed I could replicate the UNPACK-INADEQUATE-SPACE messages
Howard is seeing on a test system, so I tried that patch on that same
system without any change. :-(

Looking at the PMIx code base the messages appear to come from that code
(the triggers are in src/mca/bfrops/) and I saw I could set
PMIX_DEBUG=verbose to get more info on the problem, but when I set that
these messages go away entirely. :-/

Very odd.

--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA


Pritchard Jr., Howard

unread,
May 23, 2023, 1:34:23 PM5/23/23
to Slurm User Community List
Thanks Christopher,

This doesn't seem to be related to Open MPI at all except that for our 5.0.0 and newer one has to use PMix to talk to the job launcher.
I built MPICH 4.1 on Perlmutter using the --with-pmix option and see a similar message from srun --mpi=pmix

hpp@nid008589:~/ompi/examples> (v5.0.x *)srun -u -n 2 --mpi=pmix ./hello_c
srun: Job 9369984 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for StepId=9369984.2
[nid008589:104119] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid008593:11389] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
Hello, world, I am 0 of 2, (MPICH Version: 4.1

I too noticed that if I set PMIX_DEBUG=1 the chatter from srun stops.

Howard
Chris Samuel : https://urldefense.com/v3/__http://www.csamuel.org/__;!!Bt8fGhp8LhKGRg!HEanFYm_RnpHRRRiPnt-564dlqBGqhwqAIL-Bxhnyx4ulsJP12Zc4ghc32V8Pb_-SYPXWQA5oFYyfZM$ <https://urldefense.com/v3/__http://www.csamuel.org/__;!!Bt8fGhp8LhKGRg!HEanFYm_RnpHRRRiPnt-564dlqBGqhwqAIL-Bxhnyx4ulsJP12Zc4ghc32V8Pb_-SYPXWQA5oFYyfZM$> : Berkeley, CA, USA







Christopher Samuel

unread,
May 23, 2023, 3:59:54 PM5/23/23
to slurm...@lists.schedmd.com
On 5/23/23 10:33 am, Pritchard Jr., Howard wrote:

> Thanks Christopher,

No worries!

> This doesn't seem to be related to Open MPI at all except that for our 5.0.0 and newer one has to use PMix to talk to the job launcher.
> I built MPICH 4.1 on Perlmutter using the --with-pmix option and see a similar message from srun --mpi=pmix

That's right, these messages are coming from PMIx code rather than MPI.

> I too noticed that if I set PMIX_DEBUG=1 the chatter from srun stops.

Yeah, it looks like setting PMIX_DEBUG to anything (I tried "hello")
stops these messages from being emitted.

Slurm RPMs with that patch will go on to Perlmutter in the Thursday
maintenance.

All the best,
Chris
Reply all
Reply to author
Forward
0 new messages