[slurm-users] nodes going to down* and getting stuck in that state

2,634 views
Skip to first unread message

Herc Silverstein

unread,
May 20, 2021, 12:15:41 AM5/20/21
to slurm...@schedmd.com, Felix Wolfheimer
Hi,

We have a cluster (in Google gcp) which has a few partitions set up to
auto-scale, but one partition is set up to not autoscale. The desired
state is for all of the nodes in this non-autoscaled partition
(SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted. 
However, we are finding that nodes periodically end up in the down*
state and that we cannot get them back into a usable state.  This is
using slurm 19.05.7

We have a script that runs periodically and checks the state of the
nodes and takes action based on the state.  If the node is in a down
state, then it gets terminated and if successfully terminated its state
is set to power_down.  There is a short 1 second pause and then for
those nodes that are in the POWERING_DOWN and not drained state they are
set to RESUME.

Sometimes after we start up the node and it's running slurmd we cannot
get some of these nodes back into a usable slurm state even after
manually fiddling with its state.   It seems to go between idle* and
down*.  But the node is there and we can log into it.

Does anyone have an idea of what might be going on?  And what we can do
to get these nodes back into a usable (I guess "idle") state?

Thanks,

Herc



bbene...@goodyear.com

unread,
May 20, 2021, 8:06:58 AM5/20/21
to Slurm User Community List, Felix Wolfheimer
We had a situation recently where a desktop was turned off for a week. When
we brought it back online (in a different part of the network with a different
IP), everything came up fine (slurmd and munge).

But it kept going into DOWN* for no apparent reason (neither daemon-wise nor
log-wise).

As part of another issue, we "scontrol reconfigure"d (and, as it turned out,
restarted slurmctld as well). THAT seems to have corrected it going to
DOWN*. It switched to IDLE and stayed there.

Not that this necessarily has anything to do with your issue...
But it does sound similar.

--
- Bill
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Bill Benedetto <bbene...@goodyear.com> The Goodyear Tire & Rubber Co.
I don't speak for Goodyear and they don't speak for me. We're both happy.
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+


>>> Herc Silverstein writes:

Herc> We have a cluster (in Google gcp) which has a few partitions set up to
Herc> auto-scale, but one partition is set up to not autoscale. The desired
Herc> state is for all of the nodes in this non-autoscaled partition
Herc> (SuspendExcParts=3Dgpu-t4-4x-ondemand) to continue running uninterrupted.
Herc> However, we are finding that nodes periodically end up in the down*
Herc> state and that we cannot get them back into a usable state. This is
Herc> using slurm 19.05.7
Herc>
Herc> We have a script that runs periodically and checks the state of the
Herc> nodes and takes action based on the state. If the node is in a down
Herc> state, then it gets terminated and if successfully terminated its state
Herc> is set to power_down. There is a short 1 second pause and then for
Herc> those nodes that are in the POWERING_DOWN and not drained state they are
Herc> set to RESUME.
Herc>
Herc> Sometimes after we start up the node and it's running slurmd we cannot
Herc> get some of these nodes back into a usable slurm state even after
Herc> manually fiddling with its state. It seems to go between idle* and
Herc> down*. But the node is there and we can log into it.
Herc>
Herc> Does anyone have an idea of what might be going on? And what we can do
Herc> to get these nodes back into a usable (I guess "idle") state?
Herc>
Herc> Thanks,
Herc>
Herc> Herc
Herc>
Herc>
Herc>




Brian Andrus

unread,
May 20, 2021, 12:26:13 PM5/20/21
to slurm...@lists.schedmd.com
Does it tell you the reason for it being down?

sinfo -R

I have seen where a node comes up, but the amount of memory slurmd sees
is a little less than what was configured in slurm.conf.
You should always set aside some of the memory when defining it in
slurm.conf so you have memory for the operating system and so things
don't choke if it comes up a bit lower because some driver took more
memory when it loaded.

Brian Andrus

Tim Carlson

unread,
May 20, 2021, 10:26:52 PM5/20/21
to Slurm User Community List
The SLURM controller AND all the compute nodes need to know who all is in the cluster. If you want to add a node or it changes IP addresses, you need to let all the nodes know about this which, for me, usually means restarting slurmd on the compute nodes.   

I just say this because I get caught by this all the time if I add some nodes and for whatever reason miss restarting one of the slurmd processes on the compute nodes.

Tim

Christopher Samuel

unread,
May 20, 2021, 11:58:43 PM5/20/21
to slurm...@lists.schedmd.com
On 5/19/21 9:15 pm, Herc Silverstein wrote:

> Does anyone have an idea of what might be going on?

To add to the other suggestions, I would say that checking the slurmctld
and slurmd logs to see what it is saying is wrong is a good place to start.

Best of luck,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Reply all
Reply to author
Forward
0 new messages