Hello,
I have got a node in “drain” state after finishing a job that was running on it. Log in node reports this information:
[...]
[2025-09-07T11:09:26.980] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 59238
[2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU input mask for node: 0xFFF
[2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU final HW mask for node: 0xFFF
[2025-09-07T11:09:26.980] Launching batch job 59238 for UID 21310
[2025-09-07T11:09:27.006] cred/munge: init: Munge credential signature plugin loaded
[2025-09-07T11:09:27.007] [59238.batch] debug: auth/munge: init: loaded
[2025-09-07T11:09:27.009] [59238.batch] debug: Reading cgroup.conf file /soft/slurm-23.11.0/etc/cgroup.conf
[2025-09-07T11:09:27.025] [59238.batch] debug: cgroup/v1: init: Cgroup v1 plugin loaded
[2025-09-07T11:09:27.025] [59238.batch] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded
[2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: core enforcement enabled
[2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: device enforcement enabled
[2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: Tasks containment cgroup plugin loaded
[2025-09-07T11:09:27.026] [59238.batch] task/affinity: init: task affinity plugin loaded with CPU mask 0xfff
[2025-09-07T11:09:27.027] [59238.batch] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2025-09-07T11:09:27.027] [59238.batch] topology/default: init: topology Default plugin loaded
[2025-09-07T11:09:27.030] [59238.batch] debug: gpu/generic: init: init: GPU Generic plugin loaded
[2025-09-07T11:09:27.031] [59238.batch] debug: laying out the 12 tasks on 1 hosts clus09 dist 2
[2025-09-07T11:09:27.031] [59238.batch] debug: close_slurmd_conn: sending 0: No error
[2025-09-07T11:09:27.031] [59238.batch] debug: Message thread started pid = 910040
[2025-09-07T11:09:27.031] [59238.batch] debug: Setting slurmstepd(910040) oom_score_adj to -1000
[2025-09-07T11:09:27.031] [59238.batch] debug: spank: opening plugin stack /soft/slurm-23.11.0/etc/plugstack.conf
[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-11'
[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-11'
[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-11'
[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-11'
[2025-09-07T11:09:27.090] [59238.batch] debug levels are stderr='error', logfile='debug', syslog='fatal'
[2025-09-07T11:09:27.090] [59238.batch] starting 1 tasks
[2025-09-07T11:09:27.090] [59238.batch] task 0 (910044) started 2025-09-07T11:09:27
[2025-09-07T11:09:27.098] [59238.batch] debug: task/affinity: task_p_pre_launch: affinity StepId=59238.batch, task:0 bind:mask_cpu
[2025-09-07T11:09:27.098] [59238.batch] _set_limit: RLIMIT_NPROC : reducing req:255366 to max:159631
[2025-09-07T11:09:27.398] [59238.batch] task 0 (910044) exited with exit code 2.
[2025-09-07T11:09:27.399] [59238.batch] debug: task/affinity: task_p_post_term: affinity StepId=59238.batch, task 0
[2025-09-07T11:09:27.399] [59238.batch] debug: signaling condition
[2025-09-07T11:09:27.399] [59238.batch] debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded
[2025-09-07T11:09:27.400] [59238.batch] debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded
[2025-09-07T11:09:27.400] [59238.batch] job 59238 completed with slurm_rc = 0, job_rc = 512
[2025-09-07T11:09:27.410] [59238.batch] debug: Message thread exited
[2025-09-07T11:09:27.410] [59238.batch] stepd_cleanup: done with step (rc[0x200]:Unknown error 512, cleanup_rc[0x0]:No error)
[2025-09-07T11:09:27.411] debug: _rpc_terminate_job: uid = 1000 JobId=59238
[2025-09-07T11:09:27.411] debug: credential for job 59238 revoked
[...]
“sinfo” shows:
[root@login-node ~]# sinfo
PARTITION TIMELIMIT AVAIL STATE NODELIST CPU_LOAD NODES(A/I) NODES(A/I/O/T) CPUS CPUS(A/I/O/T) REASON
node.q* 4:00:00 up drained clus09 0.00 0/0 0/0/1/1 12 0/0/12/12 Kill task faile
node.q* 4:00:00 up allocated clus[10-11] 13.82-15.8 2/0 2/0/0/2 12 24/0/0/24 none
node.q* 4:00:00 up idle clus[01-06,12] 0.00 0/7 0/7/0/7 12 0/84/0/84 none
But it seems there is no error in node... Slurmctld.log in server seems correct, too.
Is there any way to reset node to “state=idle” after errors in the same way?
Thanks.
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com
Hi,
After reading answer from Ole Holm Nielsen, I have increased “MessageTimeout” to 20s (by default is 5s) and “UnkillableStepTimeout” to 150s (by default is 60s and, always 5 times larger than “MessageTimeout”). However, I have also read that UnkillableStepProgram indicates the program to use in that cases... but, by default there is no program assigned to that parameter (no program to run). So my question is if someone uses a customized “UnkillableStepProgram” and if he/she could explain that.
Thanks a lot!