You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Slurm User Community List
Hi
I'm running a cluster in a cloud provider and have run up against an odd problem with power save. I've got several hundred nodes that Slurm won't power up even though they appear idle and in the powered-down state. I suspect that they are in a "not-so-idle" state: `scontrol` for all of the nodes which aren't being powered up shows the state as "IDLE*+CLOUD+POWER". The asterisk is throwing me off here- that state doesn't appear to be documented in the scontrol manpage (I want to say I'd seen it discussed on the list, but google searches haven't turned up much yet).
The other nodes in the cluster are being powered up and down as we'd expect. It's just these nodes that Slurm doesn't power up. In fact, it appears that the controller doesn't even _try_ to power up the node- the logs (both for the controller with DebugFlags=Power and the power management script logs) don't indicate even an attempt to start a node when requested.
I haven't figured a way to reliably reset the nodes to "IDLE". Some relevant configs are:
Node state codes are shortened as required for the field size.
These node states may be followed by a special character to identify
state flags associated with the node.
The following node sufficies and states are used:
*
The node is presently not responding and will not be allocated
any new work. If the node remains non-responsive, it will
be placed in the DOWN state (except in the case of
COMPLETING, DRAINED, DRAINING,
FAIL, FAILING nodes).
Michael Gutteridge
unread,
Jul 18, 2018, 1:45:10 PM7/18/18
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Slurm User Community List
John: thanks for the link. Curiously, sinfo doesn't show the asterisk, but has it documented. scontrol shows the asterisk and doesn't document it... at least for the state my cluster is in.
Antony: Thanks for the steps- I tried it out, but there was no change. It seems like it should do the trick, but the controller would never run the resume or suspend script. The logs indicated that "the nodes already up" or similar. I ran the resume script manually and that seemed to have restored it to service.
However, it may have just been that one node (I had been working almost exclusively with this one node and I may have put it in a weird state). On these other nodes it now seems sufficient to just "scontrol state=power_up". My controller may have just been in a bad state, or perhaps it was just a couple bad apples I happened to pick to work with.
Anyway: it seems to be working again. Thanks for the help and advice.