[slurm-users] Power saving method selection for different kinds of hardware

213 views
Skip to first unread message

Ole Holm Nielsen

unread,
Nov 8, 2022, 9:36:50 AM11/8/22
to Slurm User Community List
I'm thinking about the best way to configure power saving (see
https://slurm.schedmd.com/power_save.html) when we have different types of
node hardware whose power state have to be managed differently:

1. Nodes with a BMC NIC interface where "ipmitool chassis power ..."
commands can be used.

2. Nodes where the BMC cannot be used for powering up due to the shared
NICs going down when the node is off :-(

3. Cloud nodes where special cloud CLI commands must be used (such as
Azure CLI).

The slurm.conf only permits one SuspendProgram and one ResumeProgram which
then need to figure out the cases listed above and perform appropriate
actions.

I was thinking to add a node feature to indicate the kind of power control
mechanism available, for example along these lines for the 3 above cases:

Nodename=node001 Feature=power_ipmi
Nodename=node002 Feature=power_none
Nodename=node003 Feature=power_azure

The node feature might be inquired in the SuspendProgram and ResumeProgram
and jump to separate branches of the script for power control commands.

Question: Has anyone thought of a similar or better way to handle power
saving for different types of nodes?

Thanks,
Ole

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,


Prentice Bisbal

unread,
Mar 27, 2023, 1:35:35 PM3/27/23
to slurm...@lists.schedmd.com
I'm just catching up on old mailing list messages now. Why not make your
SuspendProgram and ResumePrograms be shell scripts that look at some
node information in Slurm (look at the features as in your example) or
some other source ( use a case statement based on node names) and call
the correct suspend/resume command based on that?

I agree that attaching this metadata in the node definition and have
slurm act on it directly is the best solution, but in the meantime,
having a shell script that can figure out the correct way to
suspend/resume each host should be very doable, if not ideal.

Prentice

Ole Holm Nielsen

unread,
Mar 27, 2023, 2:32:52 PM3/27/23
to slurm...@lists.schedmd.com
Hi Prentice,

Since the last message I figured out a way to implement power_save:

I've documented our setup in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
This page contains a link to power_save scripts on GitHub.

Best regards,
Ole
Reply all
Reply to author
Forward
0 new messages