[slurm-users] Power Save: When is RESUME an invalid node state?

34 views
Skip to first unread message

Xaver Stiensmeier

unread,
Dec 6, 2023, 3:28:50 AM12/6/23
to slurm...@lists.schedmd.com

Dear Slurm User list,

using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in

slurm_update error: Invalid node state specified

when we called:

scontrol update NodeName="$1" state=RESUME reason=FailedStartup

in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs.

Maybe someone has a great idea how to tackle this problem.

Best regards
Xaver Stiensmeier

Ole Holm Nielsen

unread,
Dec 6, 2023, 4:30:53 AM12/6/23
to slurm...@lists.schedmd.com
Hi Xavier,
Probably you can't assign a "reason" when you update a node with
state=RESUME. The scontrol manual page says:

Reason=<reason> Identify the reason the node is in a "DOWN", "DRAINED",
"DRAINING", "FAILING" or "FAIL" state.

Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

IHTH,
Ole


Xaver Stiensmeier

unread,
Dec 6, 2023, 4:54:33 AM12/6/23
to slurm...@lists.schedmd.com
Hi Ole,

I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not always giving all information. We run our solution for about a year
now so I don't think there's a general problem (as in something that
necessarily occurs) with the command. But I will take a closer look. I
really feel like it has to be something more conditional though as
otherwise the error would've occurred more often (i.e. every time when
handling a fail and the command is execute).

Your repository would've been really helpful for me when we started
implementing the cloud scheduling, but I feel like we have implemented
most things you mention there already. But I will take a look at
`DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find
out; SLURM plans/planned to change that in the future (cloud key behaves
different than any other key in PrivateData). Of course our setup
differs a little in the details.

Best regards
Xaver

Ole Holm Nielsen

unread,
Dec 6, 2023, 5:10:22 AM12/6/23
to slurm...@lists.schedmd.com
Hi Xaver,

Your version of Slurm may matter for your power saving experience. Do you
run an updated version?

/Ole

On 12/6/23 10:54, Xaver Stiensmeier wrote:
> Hi Ole,
>
> I will double check, but I am very sure that giving a reason is possible
> as it has been done at least 20 other times without error during that
> exact run. It might be ignored though. You can also give a reason when
> defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
> not always giving all information. We run our solution for about a year
> now so I don't think there's a general problem (as in something that
> necessarily occurs) with the command. But I will take a closer look. I
> really feel like it has to be something more conditional though as
> otherwise the error would've occurred more often (i.e. every time when
> handling a fail and the command is execute).
> >>
>> IHTH,
>> Ole
>>
>>
>
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H....@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620
> Your repository would've been really helpful for me when we started>>
>> IHTH,
>> Ole
>>
>>
>
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H....@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

Xaver Stiensmeier

unread,
Dec 6, 2023, 5:51:44 AM12/6/23
to slurm...@lists.schedmd.com
Hi Ole,

Good idea. Here's our current version:

```
sinfo -V
slurm 22.05.7
```

Quick googling told me that the latest version is 23.11. Does the
upgrade change anything in that regard? I will keep reading.

Xaver

Ole Holm Nielsen

unread,
Dec 6, 2023, 6:03:59 AM12/6/23
to slurm...@lists.schedmd.com
On 12/6/23 11:51, Xaver Stiensmeier wrote:
> Good idea. Here's our current version:
>
> ```
> sinfo -V
> slurm 22.05.7
> ```
>
> Quick googling told me that the latest version is 23.11. Does the
> upgrade change anything in that regard? I will keep reading.

There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving
Power with Slurm" at https://slurm.schedmd.com/publications.html

For reasons of security and functionality it is recommended to follow
Slurm's releases (maybe not the first few minor versions of new major
releases like 23.11). FYI, I've collected information about upgrading
Slurm in the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm

/Ole

Xaver Stiensmeier

unread,
Dec 6, 2023, 6:15:11 AM12/6/23
to slurm...@lists.schedmd.com
Hi Ole,

for multiple reasons we build it ourself, but I am not really involved
in that process, but I will contact the person who is. Thanks for the
recommendation! We should probably implement a regular check whether
there is a new slurm version. I am not 100% whether this will fix our
issues or not, but it's worth a try.

Best regards
Xaver

Stefan Staeglich

unread,
Dec 7, 2023, 9:47:37 AM12/7/23
to slurm...@lists.schedmd.com
Hi Xaver,

we also had a similar problem with Slurm 21.08 (see thread "error: power_save
module disabled, NULL SuspendProgram").

Fortunately, we have not yet observed this since the upgrade to 23.02. But the
time period (about a month) is still too short to know if the problem is
really fixed as we are still in the normal recurrence period of that event.

Best regards,
Stefan
--
Albert-Ludwigs-Universität Freiburg
Institut für Informatik
Professur für Maschinelles Lernen

Stefan Stäglich
System-Administrator

T +49 761 203-8223

stae...@informatik.uni-freiburg.de
https://ml.informatik.uni-freiburg.de

Georges-Köhler-Allee 52
D-79110 Freiburg
Reply all
Reply to author
Forward
0 new messages