[slurm-users] Power save doesn't start nodes

Michael Gutteridge

unread,

Jul 17, 2018, 1:15:16 PM7/17/18

to Slurm User Community List

Hi

I'm running a cluster in a cloud provider and have run up against an odd problem with power save.  I've got several hundred nodes that Slurm won't power up even though they appear idle and in the powered-down state.  I suspect that they are in a "not-so-idle" state: `scontrol` for all of the nodes which aren't being powered up shows the state as "IDLE*+CLOUD+POWER".  The asterisk is throwing me off here- that state doesn't appear to be documented in the scontrol manpage (I want to say I'd seen it discussed on the list, but google searches haven't turned up much yet).

The other nodes in the cluster are being powered up and down as we'd expect.  It's just these nodes that Slurm doesn't power up.  In fact, it appears that the controller doesn't even _try_ to power up the node- the logs (both for the controller with DebugFlags=Power and the power management script logs) don't indicate even an attempt to start a node when requested.

I haven't figured a way to reliably reset the nodes to "IDLE".  Some relevant configs are:

SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SuspendProgram=/var/lib/slurm-llnl/suspend

SuspendTime=300
SuspendRate=10
ResumeRate=10
ResumeProgram=/var/lib/slurm-llnl/resume
ResumeTimeout=300
BatchStartTimeout=300

A typical node is configured thus:

NodeName=nodef74 NodeAddr=nodef74.fhcrc.org Feature=c5.2xlarge CPUs=4 RealMemory=16384 Weight=40 State=CLOUD

Thanks for your time- any advice or hints are greatly appreciated.

Michael

Antony Cleave

unread,

Jul 18, 2018, 3:49:01 AM7/18/18

to Slurm User Community List

I've not seen the IDLE* issue before but when my nodes got stuck I've always beena ble to fix them with this:

[root@cloud01 ~]# scontrol update nodename=cloud01 state=down reason=stuck

[root@cloud01 ~]# scontrol update nodename=cloud01 state=idle

[root@cloud01 ~]# scontrol update nodename=cloud01 state=power_down

[root@cloud01 ~]# scontrol update nodename=cloud01 state=power_up

Antony

John Hearns

unread,

Jul 18, 2018, 4:05:32 AM7/18/18

to Slurm User Community List

If it is any help, https://slurm.schedmd.com/sinfo.html

NODE STATE CODES

Node state codes are shortened as required for the field size. These node states may be followed by a special character to identify state flags associated with the node. The following node sufficies and states are used:

*: The node is presently not responding and will not be allocated any new work. If the node remains non-responsive, it will be placed in the DOWN state (except in the case of COMPLETING, DRAINED, DRAINING, FAIL, FAILING nodes).

Michael Gutteridge

unread,

Jul 18, 2018, 1:45:10 PM7/18/18

to Slurm User Community List

John: thanks for the link.  Curiously, sinfo doesn't show the asterisk, but has it documented.  scontrol shows the asterisk and doesn't document it...  at least for the state my cluster is in.

Antony: Thanks for the steps- I tried it out, but there was no change.  It seems like it should do the trick, but the controller would never run the resume or suspend script.  The logs indicated that "the nodes already up" or similar.  I ran the resume script manually and that seemed to have restored it to service.

However, it may have just been that one node (I had been working almost exclusively with this one node and I may have put it in a weird state).  On these other nodes it now seems sufficient to just "scontrol state=power_up".  My controller may have just been in a bad state, or perhaps it was just a couple bad apples I happened to pick to work with.

Anyway: it seems to be working again.  Thanks for the help and advice.

Michael

Reply all

Reply to author

Forward