Slurm is great to use, I've developed several plugins on it. Now I'm working on an issue in slurm.
I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task is killed after a few hours. This can be reproduced several times. After turning off cgroup, it disappears.
Linux kernel: 3.10.0-327.36.3.el7.x86_64
Slurm version: 15.08-11
srun: error ip-65: task 42: Killed
sun: Terminating job step 10346.0
slurmstepd: *** STEP 10346.0 ON ip-54 CANCELLED AT 2021-06-07T02:35:36 ***
srun: error: ip-65: tasks 40,46 Killed
srun: error: ip-65: tasks 45 Killed
srun: error: ip-57: tasks 19-21 Killed
$ sacct -j 10310646 --format=JobID,State,ExitCode,DerivedExitCode,start
JobID State ExitCode DerivedExitCode Start
------------ ---------- -------- --------------- -------------------
10310646 COMPLETED 0:9 0:0 2021-06-06T19:34:04I only enabled ConstrainCores:
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=no
#ConstrainKmemSpace=no #avoid known Kernel issues
#ConstrainRAMSpace=yes
#AllowedRAMSpace=80
#ConstrainSwapSpace=yes
TaskAffinity=no #use task/affinity plugin instead ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinityMaybe slurm or os's oom-killer?
I checked worker nodes dmesg logs: grep -i 'killed process' /var/log/messages, grep -i 'oom' /var/log/messagesand find nothing
So any clues about how to fix this?
PS: upgrading the slurm version is almost impossible. I'm familiar with slurm code, so I want to fix it in slurm 15.08