Hello all:
We upgraded from 20.11.8 to 21.08.5 (CentOS 7.9, Slurm built without
pmix support) recently. After that, we found that in many cases,
'mpirun' was causing multi-node MPI jobs to have all MPI ranks within
a node run on the same core. We've moved on to 'srun'.
Now we see a problem in which the OOM killer is in some cases
predictably killing job steps who don't seem to deserve it. In some
cases these are job scripts and input files which ran fine before our
Slurm upgrade. More details follow, but that's it the issue in a
nutshell.
Other than the version, our one Slurm config change was to remove the
deprecated 'TaskPluginParam=Sched' from slurm.conf, giving it its
default 'null' value. Our TaskPlugin remains
'task/affinity,task/cgroup'.
We've had apparently correct cgroup-based mem limit enforcement in
place for a long time, so the OOM-killing of the jobs I’m referencing is a
change in behavior.
Below are some of our support team's findings. I haven't finished
trying to correlate the anomalous job events with specific OOM
complaints, or recorded job resource usage at those times. I'm just
throwing out this message in case what we've seen so far, or the
painfully obvious thing I’m missing, looks familiar to anyone. Thanks!
Application: VASP 6.1.2 launched with srun
MPI libraries: intel/2019b
Observations:
Test 1. QDR-fabric Intel nodes (20 nodes x 10 cores/node) outcome:
job failed right away, no output generated error text: 20
occurrences resembling in form "[13:ra8-10] unexpected reject
event from 9:ra8-9"
Test 2. EDR-fabric Intel nodes (20 nodes x 10 cores/node)
outcome: job ran for 12 minutes, generated some output data that look fine
error text: no error messages, job failed.
Test 3. AMD Rome (20 nodes x 10 cores/node)
outcome: job completed successfully after 31 minutes, user
confirmed the results are fine
Application: Quantum Espresso 6.5 launched with srun
MPI libraries: intel/2019b
Observations:
- Works correctly when using: 1 node x 64 cores 64 MPI processes), 1x128 (128 MPI processes) (other
QE parameters -nk 1 -nt 4 , mem-per-cpu=1500mb)
- A few processes get OOM killed after a while when using: 4 nodes x 32
cores (128 MPI processes), 4 nodes x 64 cores (256 MPI processes)
- job fails within seconds: 16 nodes x 8 cores
--
Paul Brunk, system administrator
Georgia Advanced Resource Computing Center
Enterprise IT Svcs, the University of Georgia
Now we see a problem in which the OOM killer is in some cases
predictably killing job steps who don't seem to deserve it. In some
cases these are job scripts and input files which ran fine before our
Slurm upgrade. More details follow, but that's it the issue in a
nutshell.
I'm not sure if this is the case but it might help to keep in mind the difference between mpirun and srun.
With srun you let slurm create tasks with the appropriate mem/cpu
etc limits and the mpi ranks will run directly in a task.
With mpirun you usually let your MPI distribution start on task per node which will spawn the mpi manager which will start the actual mpi program.
You might very well end up with different memory limits per
process which could be the cause of your OOM issue. Especially if
not all MPI ranks use the same amount of memory.
Ward
We also noticed the same thing with 21.08.5. In the 21.08 series SchedMD changed the way they handle cgroups to set the stage for cgroups v2 (see: https://slurm.schedmd.com/SLUG21/Roadmap.pdf). The 21.08.5 introduced a bug fix which then caused mpirun to not pin properly (particularly for older versions of MPI): https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/NEWS What we've recommended to users who have hit this was to swap over to using srun instead of mpirun and the situation clears up.
-Paul Edmon-
Hi:
Thanks for your feedback guys :).
We continue to find srun behaving properly re: core placement.
BTW, we've further established that only MVAPICH (and therefore also Intel MPI) jobs are encountering the OOM issue.
==
Paul Brunk, system administrator
Georgia Advanced Resource Computing Center
Enterprise IT Svcs, the University of Georgia