[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

93 views
Skip to first unread message

Jack Chen

unread,
Jul 2, 2021, 1:35:01 AM7/2/21
to slurm...@lists.schedmd.com

Slurm is great to use, I've developed several plugins on it. Now I'm working on an issue in slurm.

I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task is killed after a few hours. This can be reproduced several times. After turning off cgroup, it disappears. 

Linux kernel: 3.10.0-327.36.3.el7.x86_64

Slurm version: 15.08-11

example of killed job log:

srun: error ip-65: task 42: Killed
sun: Terminating job step 10346.0
slurmstepd: *** STEP 10346.0 ON ip-54 CANCELLED AT 2021-06-07T02:35:36 ***
srun: error: ip-65: tasks 40,46 Killed
srun: error: ip-65: tasks 45 Killed
srun: error: ip-57: tasks 19-21 Killed

job logs:

$ sacct -j 10310646 --format=JobID,State,ExitCode,DerivedExitCode,start
       JobID      State ExitCode DerivedExitCode               Start
------------ ---------- -------- --------------- -------------------
10310646      COMPLETED      0:9             0:0 2021-06-06T19:34:04

cgroup.conf:

I only enabled ConstrainCores:

AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=no
#ConstrainKmemSpace=no #avoid known Kernel issues
#ConstrainRAMSpace=yes
#AllowedRAMSpace=80
#ConstrainSwapSpace=yes
TaskAffinity=no #use task/affinity plugin instead

changes in slurm.conf to enable cgroup cpu

 ProctrackType=proctrack/cgroup
 TaskPlugin=task/cgroup,task/affinity

Maybe slurm or os's oom-killer?

I checked worker nodes dmesg logs: grep -i 'killed process' /var/log/messagesgrep -i 'oom' /var/log/messagesand find nothing

So any clues about how to fix this? 


PS: upgrading the slurm version is almost impossible. I'm familiar with slurm code, so I want to fix it in slurm 15.08

Ole Holm Nielsen

unread,
Jul 2, 2021, 2:11:02 AM7/2/21
to slurm...@lists.schedmd.com
On 7/2/21 7:34 AM, Jack Chen wrote:
> Slurm is great to use, I've developed several plugins on it. Now I'm
> working on an issue in slurm.
>
> I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task
> is killed after a few hours. This can be reproduced several times. After
> turning off cgroup, it disappears.
>
> Linux kernel: 3.10.0-327.36.3.el7.x86_64
>
> Slurm version: 15.08-11

For Cgroups support I believe you need to upgrade to a much more recent
Slurm version!! Probably Slurm 17.02.5 or later, see
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#cgroup-configuration

> PS: upgrading the slurm version is almost impossible. I'm familiar with
> slurm code, so I want to fix it in slurm 15.08

IMHO, you will suffer many problems if you stick with this old 15.08
release. It is definitely feasible to upgrade Slurm, although you have to
take great care with the database upgrade if upgrading from 17.02 or
older. Upgrading between recent versions is quite straightforward, but it
is imperative that you upgrade by at most 2 versions at a time!

I have collected upgrading experience and documentation here:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

Best regards,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

Jack Chen

unread,
Jul 2, 2021, 6:40:20 AM7/2/21
to Slurm User Community List
ok, thanks for your quick response, I will find a way to upgrade it.
Reply all
Reply to author
Forward
0 new messages