HI,
So I’m testing the use of Open MPI 5.0.0 pre-release with the Slurm/PMIx setup currently on NERSC Perlmutter system.
First off, if I use the PRRte launch system, I don’t see the issue I’m raising here.
But, many NERSC users prefer to use the srun “native” launch method with applications compiled against Open MPI, hence this emal.
The SLURM version on Perlmutter is currently 2023.02.2
The PMIx version that the admins used to build slurm against is pmix-4.2.3. I’ve attached the output of pmix_info.
I’ve tested with Open MPI 5.0.0rc11 (or HEAD of 5.0.x) with both the PMIx embedded in the Open MPI and using the external PMIx 4.2.3 install.
I get the same results below when my app is linked either against the system PMIx or the embedded one.
My test application “works” but if I use srun, I get these types of messages:
srun -n 2 -N 2 --mpi=pmix ./ring_c
[cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[cn315:1037721] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[cn315:1037721] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268
[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624
[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417
[cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268
[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624
[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417
After a lot of stracing and adding debug statements to the PMIx I have control over – the one in the embedded Open MPI tarball, I realized that these
messages are not coming from the app, but some transient process between the srun/slurmd processes and the application processes.
The pids in these error messages are the parents of the MPI processes.
I’ve tried various things like turning off the PMIX GDS shmem but that doesn’t help. Also I’ve toggled the various SLURM_PMIX env. variables but to no effect.
This problem does not appear to be related to a recent slurm/pmix patch - https://bugs.schedmd.com/show_bug.cgi?id=16306#a0 and anyway it looks like that patch should be in 2023.02.2.
Another bit of info:
scontrol show config | grep -i pmix
PMIxCliTmpDirBase = (null)
PMIxCollFence = (null)
PMIxDebug = 0
PMIxDirectConn = yes
PMIxDirectConnEarly = no
PMIxDirectConnUCX = no
PMIxDirectSameArch = no
PMIxEnv = (null)
PMIxFenceBarrier = no
PMIxNetDevicesUCX = (null)
PMIxTimeout = 300
PMIxTlsUCX = (null)
Now I myself don’t care too much about these messages.
But for users it might be disconcerting and also may cause automated regression testing frameworks to report lots of errors.
Should I ask NERSC to file a ticket with SchedMD or does someone know how to turn these messages off if in fact they are not important, or better yet know why a slurm process may be emitting these errors and how to fix it?
Thanks for any ideas,
Howard
—