[slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

Paul Brunk

unread,

Feb 10, 2022, 8:34:06 AM2/10/22

to slurm...@lists.schedmd.com

Hello all:

We upgraded from 20.11.8 to 21.08.5 (CentOS 7.9, Slurm built without

pmix support) recently. After that, we found that in many cases,

'mpirun' was causing multi-node MPI jobs to have all MPI ranks within

a node run on the same core. We've moved on to 'srun'.

Now we see a problem in which the OOM killer is in some cases

predictably killing job steps who don't seem to deserve it. In some

cases these are job scripts and input files which ran fine before our

Slurm upgrade. More details follow, but that's it the issue in a

nutshell.

Other than the version, our one Slurm config change was to remove the

deprecated 'TaskPluginParam=Sched' from slurm.conf, giving it its

default 'null' value. Our TaskPlugin remains

'task/affinity,task/cgroup'.

We've had apparently correct cgroup-based mem limit enforcement in

place for a long time, so the OOM-killing of the jobs I’m referencing is a

change in behavior.

Below are some of our support team's findings. I haven't finished

trying to correlate the anomalous job events with specific OOM

complaints, or recorded job resource usage at those times. I'm just

throwing out this message in case what we've seen so far, or the

painfully obvious thing I’m missing, looks familiar to anyone. Thanks!

Application: VASP 6.1.2 launched with srun

MPI libraries: intel/2019b

Observations:

Test 1. QDR-fabric Intel nodes (20 nodes x 10 cores/node) outcome:

job failed right away, no output generated error text: 20

occurrences resembling in form "[13:ra8-10] unexpected reject

event from 9:ra8-9"

Test 2. EDR-fabric Intel nodes (20 nodes x 10 cores/node)

outcome: job ran for 12 minutes, generated some output data that look fine

error text: no error messages, job failed.

Test 3. AMD Rome (20 nodes x 10 cores/node)

outcome: job completed successfully after 31 minutes, user

confirmed the results are fine

Application: Quantum Espresso 6.5 launched with srun

MPI libraries: intel/2019b

Observations:

- Works correctly when using: 1 node x 64 cores 64 MPI processes), 1x128 (128 MPI processes) (other

QE parameters -nk 1 -nt 4 , mem-per-cpu=1500mb)

- A few processes get OOM killed after a while when using: 4 nodes x 32

cores (128 MPI processes), 4 nodes x 64 cores (256 MPI processes)

- job fails within seconds: 16 nodes x 8 cores

--

Paul Brunk, system administrator

Georgia Advanced Resource Computing Center

Enterprise IT Svcs, the University of Georgia

Ward Poelmans

unread,

Feb 10, 2022, 8:59:59 AM2/10/22

to slurm...@lists.schedmd.com

Hi Paul,

On 10/02/2022 14:33, Paul Brunk wrote:

Now we see a problem in which the OOM killer is in some cases

predictably killing job steps who don't seem to deserve it. In some

cases these are job scripts and input files which ran fine before our

Slurm upgrade. More details follow, but that's it the issue in a

nutshell.

I'm not sure if this is the case but it might help to keep in mind the difference between mpirun and srun.

With srun you let slurm create tasks with the appropriate mem/cpu etc limits and the mpi ranks will run directly in a task.

With mpirun you usually let your MPI distribution start on task per node which will spawn the mpi manager which will start the actual mpi program.

You might very well end up with different memory limits per process which could be the cause of your OOM issue. Especially if not all MPI ranks use the same amount of memory.

Ward

Paul Edmon

unread,

Feb 10, 2022, 9:29:38 AM2/10/22

to slurm...@lists.schedmd.com

We also noticed the same thing with 21.08.5. In the 21.08 series SchedMD changed the way they handle cgroups to set the stage for cgroups v2 (see: https://slurm.schedmd.com/SLUG21/Roadmap.pdf). The 21.08.5 introduced a bug fix which then caused mpirun to not pin properly (particularly for older versions of MPI): https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/NEWS What we've recommended to users who have hit this was to swap over to using srun instead of mpirun and the situation clears up.

-Paul Edmon-

Paul Brunk

unread,

Feb 14, 2022, 9:18:39 AM2/14/22

to Slurm User Community List

Hi:

Thanks for your feedback guys :).

We continue to find srun behaving properly re: core placement.

BTW, we've further established that only MVAPICH (and therefore also Intel MPI) jobs are encountering the OOM issue.

==

Paul Brunk, system administrator

Georgia Advanced Resource Computing Center

Enterprise IT Svcs, the University of Georgia

Reply all

Reply to author

Forward