[slurm-users] Rolling reboot with at most N machines down simultaneously?

184 views
Skip to first unread message

Phil Chiu

unread,
Aug 3, 2022, 11:38:04 AM8/3/22
to slurm...@schedmd.com
Occasionally I need to all the compute nodes in my system. However, I have a parallel file system which is converged, i.e., each compute node contributes a disk to the file system. The file system can tolerate having N nodes down simultaneously.

Therefore my problem is this - "Reboot all nodes, permitting N nodes to be rebooting simultaneously."

I have thought about the following options
  • A mass scontrol reboot - It doesn't seem like there is the ability to control how many nodes are being rebooted at once.
  • A job array - Job arrays can be easily configured to allow at most N jobs to be running simultaneously. However, I would need each array task to execute on a specific node, which does not appear to be possible.
  • Individual slurm jobs which reboot nodes - With a for loop, I could submit a reboot job for each node. But I'm not sure how to limit this so at most N jobs are running simultaneously. Perhaps a special partition is needed for this?
Open to hearing any other ideas.

Thanks!
Phil

Benjamin Arntzen

unread,
Aug 3, 2022, 2:48:29 PM8/3/22
to Slurm User Community List
At risk of being a heretic, why not something like Ansible to handle this? Slurm "should" be able to do it but feels like a bit of a weird fit for the job.


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Phil Chiu <whoph...@gmail.com>
Sent: Wednesday, 3 August 2022, 5:51 pm
To: slurm...@schedmd.com <slurm...@schedmd.com>
Subject: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Brian Andrus

unread,
Aug 3, 2022, 5:21:12 PM8/3/22
to slurm...@lists.schedmd.com


So an example of using slurm to reboot all nodes 3 at a time:

    sinfo -h -o %n|xargs --max-procs=3 scontrol reboot {}

If you want to get fancy, make a script that does the reboot and waits for the node to be back up before exiting and use that instead of the 'scontrol reboot' part.

Brian Andrus

Christopher Samuel

unread,
Aug 4, 2022, 1:08:59 AM8/4/22
to slurm...@lists.schedmd.com
On 8/3/22 8:37 am, Phil Chiu wrote:

> Therefore my problem is this - "Reboot all nodes, permitting N nodes to
> be rebooting simultaneously."

I think currently the only way to do that would be to have a script that
does:

* issue the `scontrol reboot ASAP nextstate=resume [...]` for 3 nodes
* wait for 1 to come back to being online
* issue an `scontrol reboot` for another node
* wait for 1 more to come back
* lather rinse repeat.

This does assume you've got your nodes configured to come back cleanly
on a reboot with slurmd up and no manual intervention required (which is
what we do).

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA


Christopher Samuel

unread,
Aug 4, 2022, 1:11:43 AM8/4/22
to slurm...@lists.schedmd.com
On 8/3/22 11:47 am, Benjamin Arntzen wrote:

> At risk of being a heretic, why not something like Ansible to handle this?

Nothing heretical about that, but for us the reason is that `scontrol
reboot ASAP` is integrated nicely into the scheduling of jobs, we have
health checks and node epilogs that can recognise certain conditions
that require a node reboot (too many fragmented huge pages for instance)
and can trigger that automatically without it disrupting scheduling of
large jobs.

What used to happen was that when a node was rebooted Slurm would
consider it indefinitely unavailable and so think it couldn't schedule a
large job and instead pack in smaller jobs, pushing back the start time
of the large job.

Gerhard Strangar

unread,
Aug 4, 2022, 1:21:08 AM8/4/22
to slurm...@lists.schedmd.com
Phil Chiu wrote:

> - Individual slurm jobs which reboot nodes - With a for loop, I could
> submit a reboot job for each node. But I'm not sure how to limit this so at
> most N jobs are running simultaneously.

With a fake license called reboot?

Tina Friedrich

unread,
Aug 4, 2022, 6:23:59 AM8/4/22
to slurm...@lists.schedmd.com
I'm thinking something like that currently - setting up some kind of
TRES resource that limits how many are rebooted at any one time.

I usually do this sort of thing more or less manually; as in, I
generated a list of sbatch commands with the reboot job (one job per
node, specifying node name) - ordered to my liking (making sure I always
have GPUs of type X available, that sort of thing) - and then submitted
that in batches, waiting for one batch to finish before the next goes in.

Tina
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Tina Friedrich

unread,
Aug 4, 2022, 7:54:14 AM8/4/22
to slurm...@lists.schedmd.com
...job dependencies are also an option, thinking about this. You could
carve it up into X 'sets' of N nodes, with node-specific reboot jobs
that depend on the previous job in the same 'N' to finish.

Tina

Brian Andrus

unread,
Aug 4, 2022, 9:47:31 AM8/4/22
to slurm...@lists.schedmd.com
This is actually brilliant!

Brian Andrus

David Simpson

unread,
Aug 4, 2022, 12:04:07 PM8/4/22
to Slurm User Community List

Another way might be to implement slurm power off/on (if not already) and induce it as required.

-------------
David Simpson - Senior Systems Engineer
ARCCA, Redwood Building,
King Edward VII Avenue,
Cardiff, CF10 3NB                                                                              

David Simpson - peiriannydd uwch systemau
ARCCA, Adeilad Redwood,
King Edward VII Avenue,
Caerdydd, CF10 3NB

-----Original Message-----
From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Brian Andrus
Sent: 04 August 2022 14:47
To: slurm...@lists.schedmd.com
Subject: Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

External email to Cardiff University - Take care when replying/opening attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.

Chris Samuel

unread,
Aug 5, 2022, 2:04:22 AM8/5/22
to slurm...@lists.schedmd.com
On 3/8/22 10:20 pm, Gerhard Strangar wrote:

> With a fake license called reboot?

It's a neat idea, but I think there is a catch:

* 3 jobs start, each taking 1 license
* Other reboot jobs are all blocked
* Running reboot jobs trigger node reboot
* Running reboot jobs end when either the script exits and slurmd cleans
it up before the reboot kills it, or it gets killed as NODE_FAIL when
the node has been unresponsive for too long and is marked as down
* Licenses for those jobs are released
* 3 more reboot jobs start whilst the original 3 are rebooting
* 6 nodes are now rebooting
* Filesystem fall down go boom
* Also your rebooted nodes are now drained as "Node unexpectedly rebooted"

I guess you could change your Slurm config to not mark nodes as down if
they stop responding and make sure the job that's launched, but that
feels wrong to me.

Corentin Mercier

unread,
Aug 5, 2022, 5:27:40 AM8/5/22
to whoph...@gmail.com, slurm-users
Hello,

I think you could use SLURM's power saving mecanism to shut down all your nodes simultaneously.
Then doing srun -N<nb_nodes> -C <your_node_group> true (or any other small work) will wake up N nodes simultaneously.
You can even do srun while your nodes are powering down, SLURM will reboot them as soon as they're powered down.

I hope it will be helpful !

Regards,
C.Mercier
Reply all
Reply to author
Forward
0 new messages