[slurm-users] Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

221 views
Skip to first unread message

Xaver Stiensmeier via slurm-users

unread,
Feb 23, 2024, 12:57:33 PM2/23/24
to slurm...@lists.schedmd.com
Dear slurm-user list,

I have a cloud node that is powered up and down on demand. Rarely it can
happen that slurm's resumeTimeout is reached and the node is therefore
powered down. We have set ReturnToService=2 in order to avoid the node
being marked down, because the instance behind that node is created on
demand and therefore after a failure nothing stops the system to start
the node again as it is a different instance.

I thought this would be enough, but apparently the node is still marked
with "NOT_RESPONDING" which leads to slurm not trying to schedule on it.

After a while NOT_RESPONDING is removed, but I would like to move it
directly from within my fail script if possible so that the node can
return to service immediately and not be blocked by "NOT_RESPONDING".

Best regards,
Xaver


--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Xaver Stiensmeier via slurm-users

unread,
Feb 29, 2024, 4:10:43 AM2/29/24
to slurm...@lists.schedmd.com
I am wondering why my question (below) didn't catch anyone's attention.
Just for me as a feedback. Is it unclear where my problem lies or is it
clear, but no solution is known? I looked through the documentation and
now searched the Slurm repository, but am still unable to clearly
identify how to handle "NOT_RESPONDING".

I would really like to improve my question if necessary.

Best regards,
Xaver

nate--- via slurm-users

unread,
Sep 19, 2024, 9:06:40 AM9/19/24
to slurm...@lists.schedmd.com
Hi Xaver,

I found your thread while searching for a solution to the same issue with cloud nodes. In the past I have always used POWER_UP to get the node to register and clear the NOT_RESPONDING flag, but this necessarily creates an instance regardless of whether I need one. It turns out that updating with UNDRAIN accomplishes the same without booting an instance. Setting UNDRAIN allows the node to be scheduled, which causes the resume program to run and once booted and registered, NOT_RESPONDING is cleared.

Unfortunately, the node state still displays NOT_RESPONDING, so it still shows up in sinfo --dead and as far as I can tell there is no way to separate "will boot" from "won't boot" nodes. Clearly there is still some internal state there that does not appear to be user-visible, at least from scontrol show node. And if there is a way to administratively clear NOT_RESPONDING entirely, I have not found it. But hopefully this helps.

--nate

Xaver Stiensmeier via slurm-users

unread,
Sep 20, 2024, 4:06:11 AM9/20/24
to slurm...@lists.schedmd.com
Hey Nate,

we actually fixed our underlying issue that caused the NOT_RESPONDING
flag - on fails we automatically terminated the node manually instead of
letting Slurm call the terminate script. That lead to Slurm believing
the node should still be there when it was terminated already.

Therefore, we do not have the issue any more as we no longer see nodes
with NOT_RESPONDING.

Nice to hear that you found a solution though.

Best,
Xaver
Reply all
Reply to author
Forward
0 new messages