[slurm-users] Nodes do not return to service after scontrol reboot

31 views
Skip to first unread message

David Baker

unread,
Jun 16, 2020, 11:17:09 AM6/16/20
to Slurm User Community List
Hello,

We are running Slurm v19.05.5 and I am experimenting with the scontrol reboot command. I find that compute nodes reboot, but they are not returned to service. Rather they remain down following the reboot..

navy55         1    debug*        down   80   2:20:2 192000        0   2000   (null) Reboot ASAP : reboot

This is a diskfull node and so it doesn't take too long to reboot. For the sake of the argument I have set ResumeTimeOut to 1000 seconds which is well over what's needed...

[root@navy51 slurm]# grep -i resume slurm.conf
ResumeTimeout=1000
[root@navy51 slurm]# grep -i return slurm.conf
ReturnToService=0
[root@navy51 slurm]# grep -i nhc slurm.conf
# LBNL Node Health Check (NHC)
#HealthCheckProgram=/usr/sbin/nhc

For this experiment I have disabled the health checker, and I don't think setting ReturnToService=1 helps. Could anyone please help with this? We are about to update the node firmware and ensuring that the nodes are returned to service following their reboot would be useful.

Best regards,
David

Christopher Samuel

unread,
Jun 16, 2020, 1:16:51 PM6/16/20
to slurm...@lists.schedmd.com
On 6/16/20 8:16 am, David Baker wrote:

> We are running Slurm v19.05.5 and I am experimenting with the *scontrol
> reboot * command. I find that compute nodes reboot, but they are not
> returned to service. Rather they remain down following the reboot..

How are you using "scontrol reboot" ?

We do:

scontrol reboot ASAP nextstate=resume reason=$REASON $NODE

Which works for us (and we have health checks in our epilog that can
trigger this for known issues like running low on unfragmented huge pages).

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

David Baker

unread,
Jun 18, 2020, 2:32:53 AM6/18/20
to slurm...@lists.schedmd.com
Hello Chris,

Thank you for your comments. The scontrol reboot command is now working as expected. 

Best regards,
David


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Christopher Samuel <ch...@csamuel.org>
Sent: 16 June 2020 18:16
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Nodes do not return to service after scontrol reboot
 

Chris Samuel

unread,
Jun 18, 2020, 2:36:03 AM6/18/20
to slurm...@lists.schedmd.com
On 17/6/20 11:32 pm, David Baker wrote:

> Thank you for your comments. The scontrol reboot command is now working
> as expected.

Fantastic!

For those who don't know, using scontrol reboot in this way also allows
Slurm to take these rebooting nodes into account for scheduling; so if
you have a large job needing a lot of nodes waiting to begin with high
priority and you need to reboot some nodes then Slurm won't give up on
them and put smaller jobs on the system on all the other nodes, delaying
the larger job for no good reason.

All the best,
Chris
--
Reply all
Reply to author
Forward
0 new messages