[slurm-users] Node in drain state

Gestió Servidors via slurm-users

unread,

Sep 16, 2025, 1:40:52 AMSep 16

to slurm...@lists.schedmd.com

Hello,

I have got a node in “drain” state after finishing a job that was running on it. Log in node reports this information:

[...]

[2025-09-07T11:09:26.980] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 59238

[2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU input mask for node: 0xFFF

[2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU final HW mask for node: 0xFFF

[2025-09-07T11:09:26.980] Launching batch job 59238 for UID 21310

[2025-09-07T11:09:27.006] cred/munge: init: Munge credential signature plugin loaded

[2025-09-07T11:09:27.007] [59238.batch] debug: auth/munge: init: loaded

[2025-09-07T11:09:27.009] [59238.batch] debug: Reading cgroup.conf file /soft/slurm-23.11.0/etc/cgroup.conf

[2025-09-07T11:09:27.025] [59238.batch] debug: cgroup/v1: init: Cgroup v1 plugin loaded

[2025-09-07T11:09:27.025] [59238.batch] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded

[2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: core enforcement enabled

[2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: device enforcement enabled

[2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: Tasks containment cgroup plugin loaded

[2025-09-07T11:09:27.026] [59238.batch] task/affinity: init: task affinity plugin loaded with CPU mask 0xfff

[2025-09-07T11:09:27.027] [59238.batch] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded

[2025-09-07T11:09:27.027] [59238.batch] topology/default: init: topology Default plugin loaded

[2025-09-07T11:09:27.030] [59238.batch] debug: gpu/generic: init: init: GPU Generic plugin loaded

[2025-09-07T11:09:27.031] [59238.batch] debug: laying out the 12 tasks on 1 hosts clus09 dist 2

[2025-09-07T11:09:27.031] [59238.batch] debug: close_slurmd_conn: sending 0: No error

[2025-09-07T11:09:27.031] [59238.batch] debug: Message thread started pid = 910040

[2025-09-07T11:09:27.031] [59238.batch] debug: Setting slurmstepd(910040) oom_score_adj to -1000

[2025-09-07T11:09:27.031] [59238.batch] debug: spank: opening plugin stack /soft/slurm-23.11.0/etc/plugstack.conf

[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-11'

[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-11'

[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-11'

[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-11'

[2025-09-07T11:09:27.090] [59238.batch] debug levels are stderr='error', logfile='debug', syslog='fatal'

[2025-09-07T11:09:27.090] [59238.batch] starting 1 tasks

[2025-09-07T11:09:27.090] [59238.batch] task 0 (910044) started 2025-09-07T11:09:27

[2025-09-07T11:09:27.098] [59238.batch] debug: task/affinity: task_p_pre_launch: affinity StepId=59238.batch, task:0 bind:mask_cpu

[2025-09-07T11:09:27.098] [59238.batch] _set_limit: RLIMIT_NPROC : reducing req:255366 to max:159631

[2025-09-07T11:09:27.398] [59238.batch] task 0 (910044) exited with exit code 2.

[2025-09-07T11:09:27.399] [59238.batch] debug: task/affinity: task_p_post_term: affinity StepId=59238.batch, task 0

[2025-09-07T11:09:27.399] [59238.batch] debug: signaling condition

[2025-09-07T11:09:27.399] [59238.batch] debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded

[2025-09-07T11:09:27.400] [59238.batch] debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded

[2025-09-07T11:09:27.400] [59238.batch] job 59238 completed with slurm_rc = 0, job_rc = 512

[2025-09-07T11:09:27.410] [59238.batch] debug: Message thread exited

[2025-09-07T11:09:27.410] [59238.batch] stepd_cleanup: done with step (rc[0x200]:Unknown error 512, cleanup_rc[0x0]:No error)

[2025-09-07T11:09:27.411] debug: _rpc_terminate_job: uid = 1000 JobId=59238

[2025-09-07T11:09:27.411] debug: credential for job 59238 revoked

[...]

“sinfo” shows:

[root@login-node ~]# sinfo

PARTITION TIMELIMIT AVAIL STATE NODELIST CPU_LOAD NODES(A/I) NODES(A/I/O/T) CPUS CPUS(A/I/O/T) REASON

node.q* 4:00:00 up drained clus09 0.00 0/0 0/0/1/1 12 0/0/12/12 Kill task faile

node.q* 4:00:00 up allocated clus[10-11] 13.82-15.8 2/0 2/0/0/2 12 24/0/0/24 none

node.q* 4:00:00 up idle clus[01-06,12] 0.00 0/7 0/7/0/7 12 0/84/0/84 none

But it seems there is no error in node... Slurmctld.log in server seems correct, too.

Is there any way to reset node to “state=idle” after errors in the same way?

Thanks.

Ole Holm Nielsen via slurm-users

unread,

Sep 16, 2025, 3:06:59 AMSep 16

to slurm...@lists.schedmd.com

On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
> [root@login-node ~]# sinfo
>
>     PARTITION     TIMELIMIT      AVAIL      STATE
> NODELIST                                 CPU_LOAD   NODES(A/I) NODES(A/I/
> O/T)       CPUS CPUS(A/I/O/T) REASON
>

> *node.q*       4:00:00         up    drained
> clus09                                   0.00              0/0
> 0/0/1/1         12      0/0/12/12 Kill task faile*

The *Kill task failed* reason is due to the UnkillableStepTimeout [1]
configuration:

> The length of time, in seconds, that Slurm will wait before deciding that processes in a job step are unkillable (after they have been signaled with SIGKILL) and execute UnkillableStepProgram. The default timeout value is 60 seconds or five times the value of MessageTimeout, whichever is greater. If exceeded, the compute node will be drained to prevent future jobs from being scheduled on the node.

> But it seems there is no error in node... Slurmctld.log in server seems
> correct, too.

The slurmctld won't have any errors. The node has errors due to
UnkillableStepTimeout and therefore Slurm has drained it.

> Is there any way to reset node to “state=idle” after errors in the same way?

First you have to investigate if the jobid's user has any processes left
behind on the compute node. It may very well be stale I/O from the job to
a network file server.

It may also happen that the I/O was actually completed *after* Slurm
drained the node, and all user processes have completed. In this case you
may simply "resume" the node xxx:

$ scontrol update nodename=xxx state=resume

However, if stale user processes continue to exist, your only choice is to
reboot the node and tell Slurm to resume node xxx:

$ scontrol reboot asap nextstate=resume reason="Kill task failed" xxx

IHTH,
Ole

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Gestió Servidors via slurm-users

unread,

Sep 18, 2025, 6:14:45 AMSep 18

to slurm...@lists.schedmd.com

Hi,

After reading answer from Ole Holm Nielsen, I have increased “MessageTimeout” to 20s (by default is 5s) and “UnkillableStepTimeout” to 150s (by default is 60s and, always 5 times larger than “MessageTimeout”). However, I have also read that UnkillableStepProgram indicates the program to use in that cases... but, by default there is no program assigned to that parameter (no program to run). So my question is if someone uses a customized “UnkillableStepProgram” and if he/she could explain that.

Thanks a lot!

Lorenzo Bosio via slurm-users

unread,

Sep 18, 2025, 6:41:13 AMSep 18

to slurm...@lists.schedmd.com

Hello,

as an example, my UnkillableStepProgram is just a bash script collecting recent logs and processes and mailing me about the error. Nothing special.

Best regards,

--
Lorenzo Bosio

Tecnico di Ricerca - Laboratorio HPC4AI

Dipartimento di Informatica

Università degli Studi di Torino

Corso Svizzera, 185 - 10149 Torino

tel. +39 340 216 8249

tel. +39 011 670 6836

Ole Holm Nielsen via slurm-users

unread,

Sep 19, 2025, 3:04:23 AMSep 19

to slurm...@lists.schedmd.com

On 9/18/25 12:39, Lorenzo Bosio via slurm-users wrote:
> as an example, my UnkillableStepProgram is just a bash script collecting
> recent logs and processes and mailing me about the error. Nothing special.

We use Slurm "triggers" to get alerts from many different types of events, see
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers

Relevant here is the "notify_nodes_drained" Trigger script for node
drained state

We don't use an UnkillableStepProgram. In my experience the *Kill task
failed* events discussed earlier in this thread require a manual
examination of why the job failed to die, and I think it will be hard to
write a script to examine all kinds of possible errors.

The most common scenario is stale I/O from the job to a network file
server, and I described in a previous post how we deal with this.

BTW we use this parameter: UnkillableStepTimeout = 180 sec

> Il giorno gio 18 set 2025 alle ore 12:22 Gestió Servidors via slurm-users

> <slurm...@lists.schedmd.com <mailto:slurm...@lists.schedmd.com>> ha
> scritto:

>
> After reading answer from Ole Holm Nielsen, I have increased
> “MessageTimeout” to 20s (by default is 5s) and “UnkillableStepTimeout”
> to 150s (by default is 60s and, always 5 times larger than
> “MessageTimeout”). However, I have also read that
> UnkillableStepProgram indicates the program to use in that cases...
> but, by default there is no program assigned to that parameter (no
> program to run). So my question is if someone uses a customized

> “UnkillableStepProgram” and if he/she could explain that.____

IHTH,
Ole

Ole Holm Nielsen via slurm-users

unread,

Sep 19, 2025, 2:10:18 PMSep 19

to slurm...@lists.schedmd.com

> On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
>> Is there any way to reset node to “state=idle” after errors in the
>> same way?
>
> First you have to investigate if the jobid's user has any processes left
> behind on the compute node. It may very well be stale I/O from the job
> to a network file server.
>
> It may also happen that the I/O was actually completed *after* Slurm
> drained the node, and all user processes have completed. In this case
> you may simply "resume" the node xxx:
>
> $ scontrol update nodename=xxx state=resume
>
> However, if stale user processes continue to exist, your only choice is
> to reboot the node and tell Slurm to resume node xxx:
>
> $ scontrol reboot asap nextstate=resume reason="Kill task failed" xxx

We just now had a "Kill task failed" event on a node which caused it to
drain, and Slurm Triggers then sent an E-mail alert to the sysadmin.

Logging in to the node I found a user process left behind after the
Slurm job had been killed:

$ ps auxw | sed /root/d
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
username 29160 97.4 1.3 13770416 10415916 ? D Sep17 2926:25
/home/username/...

As you can see, the process state is "D". According to the "ps" manual
D means "uninterruptible sleep (usually IO)".

In this case the only possible fix is to reboot the node, thereby
forcibly terminating the frozen I/O on the network file server.

IHTH,
Ole

Patrick Begou via slurm-users

unread,

Sep 22, 2025, 1:41:35 AMSep 22

to slurm...@lists.schedmd.com

Hi,

I also see twice a node reaching this "drain state" these last weeks. It
is the first time on this cluster (Slurm is 24.05 on the latest setup)
and I'm running slurm for many years (with Slurm 20.11 on the oldest
cluster).
No user process found, I've just resumed the node.

Patrick

Ole Holm Nielsen via slurm-users

unread,

Sep 22, 2025, 2:20:13 AMSep 22

to slurm...@lists.schedmd.com

Hi Patrick,

On 9/22/25 07:39, Patrick Begou via slurm-users wrote:
> I also see twice a node reaching this "drain state" these last weeks. It
> is the first time on this cluster (Slurm is 24.05 on the latest setup) and
> I'm running slurm for many years (with Slurm 20.11 on the oldest cluster).
> No user process found, I've just resumed the node.

This may happen when the job's I/O takes too long time, and the
UnkillableStepTimeout gets exceeded, but later on the I/O actually
completes and the user's processes ultimately exit.

It is informative to ask Slurm for any events on the affected nodes by
using the sacctmgr command:

$ sacctmgr show event
Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where
nodes=<nodenames>

This will show you why the node became "drained".

Default period of start of events is 00:00:00 of the previous day, but
this can be changed with the Start= option.

Best regards,
Ole

Reply all

Reply to author

Forward