[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Robert Kudyba

unread,

Nov 30, 2020, 12:52:59 PM11/30/20

to Slurm User Community List

I've seen where this was a bug that was fixed https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally still. A user cancels his/her job and a node gets drained. UnkillableStepTimeout=120 is set in slurm.conf

Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2

Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED, ExitCode 0
Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
update_node: node node001 reason set to: Kill task failed
update_node: node node001 state set to DRAINING
error: slurmd error running JobId=6908 on node(s)=node001: Kill task failed

update_node: node node001 reason set to: hung
update_node: node node001 state set to DOWN
update_node: node node001 state set to IDLE
error: Nodes node001 not responding

scontrol show config | grep kill
UnkillableStepProgram = (null)
UnkillableStepTimeout = 120 sec

Do we just increase the timeout value?

Paul Edmon

unread,

Nov 30, 2020, 1:01:28 PM11/30/20

to slurm...@lists.schedmd.com

That can help. Usually this happens due to laggy storage the job is
using taking time flushing the job's data. So making sure that your
storage is up, responsive, and stable will also cut these down.

-Paul Edmon-

Robert Kudyba

unread,

Nov 30, 2020, 1:48:09 PM11/30/20

to Slurm User Community List

Sure I've seen that in some of the posts here, e.g., a NAS. But in this case it's a NFS share to the local RAID10 storage. There aren't any other settings that deal with this to not drain a node?

On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon <ped...@cfa.harvard.edu> wrote:

That can help. Usually this happens due to laggy storage the job is
using taking time flushing the job's data. So making sure that your
storage is up, responsive, and stable will also cut these down.

-Paul Edmon-

On 11/30/2020 12:52 PM, Robert Kudyba wrote:
> I've seen where this was a bug that was fixed

> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e= > but this happens

Alex Chekholko

unread,

Nov 30, 2020, 1:55:00 PM11/30/20

to Slurm User Community List

This may be more "cargo cult" but I've advised users to add a "sleep 60" to the end of their job scripts if they are "I/O intensive". Sometimes they are somehow able to generate I/O in a way that slurm thinks the job is finished, but the OS is still catching up on the I/O, and then slurm tries to kill the job...

William Markuske

unread,

Dec 1, 2020, 12:24:21 PM12/1/20

to slurm...@lists.schedmd.com

Hello Robert,

I've been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It seems to have started to occur when I enabled proctrack/cgroup and changed select/linear to select/con_tres.

Are you using cgroup process tracking and have you manipulated the cgroup.conf file? Do jobs complete correctly when not cancelled?

Regards,

Willy Markuske
HPC Systems Engineer
Research Data Services
P: (858) 246-5593

OpenPGP_0xD42F81D406AC0BA2.asc

OpenPGP_signature

Robert Kudyba

unread,

Dec 2, 2020, 10:19:44 AM12/2/20

to Slurm User Community List

been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It seems to have started to occur when I enabled proctrack/cgroup and changed select/linear to select/con_tres.

Our slurm.conf has the same setting:

SelectType=select/cons_tres
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60
EnforcePartLimits=YES

We enabled MPS too. Not sure if that's relevant.

Are you using cgroup process tracking and have you manipulated the cgroup.conf file?

Here's what we have in ours:

CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=no
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
TaskAffinity=no
ConstrainCores=no
ConstrainRAMSpace=no
ConstrainSwapSpace=no
ConstrainDevices=no
ConstrainKmemSpace=yes
AllowedRamSpace=100
AllowedSwapSpace=0
MinKmemSpace=30
MaxKmemPercent=100
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

Do jobs complete correctly when not cancelled?

Yes they do and canceling doesn't always result in a node draining.

So would this be a Slurm issue or Bright? I'm telling users to add 'sleep 60' as the last line in their sbatch files.

Reply all

Reply to author

Forward