[slurm-users] Node switching randomly to down state

6 views
Skip to first unread message

Julien Tailleur via slurm-users

unread,
Sep 23, 2025, 2:15:06 PM (11 days ago) Sep 23
to slurm...@lists.schedmd.com
Dear all,

I am maintaining a small computing cluster and I have a weird behavior
that I fail at debugging.

My cluster comprise one master node and 16 computing servers, organized
in  two queues, each queue having 8 servers. All servers run up-to-date
Debian bullseye. All but 3 servers work flawlessly.

From the master node, I can see that 3 servers on one of the queue
appear down:

jtailleu@kandinsky:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Volume*      up   infinite      8  alloc FX[21-24,31-34]
Speed        up   infinite      3  down* FX[12-14]
Speed        up   infinite      4  alloc FX[41-44]
Speed        up   infinite      1   idle FX11

These servers are reachable by SSH/ping

jtailleu@kandinsky:~$ ping -c 1 FX12
PING FX12 (192.168.6.22) 56(84) bytes of data.
64 bytes from FX12 (192.168.6.22): icmp_seq=1 ttl=64 time=0.070 ms

--- FX12 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms

#####

I can also put these nodes back into idle mode:

root@kandinsky:~# scontrol update nodename=FX[12-14] state=idle
root@kandinsky:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Volume*      up   infinite      8  alloc FX[21-24,31-34]
Speed        up   infinite      3  idle* FX[12-14]
Speed        up   infinite      4  alloc FX[41-44]
Speed        up   infinite      1   idle FX11

But then, they switch back into down mode few minutes later:

root@kandinsky:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Volume*      up   infinite      8  alloc FX[21-24,31-34]
Speed        up   infinite      3  down* FX[12-14]
Speed        up   infinite      4  alloc FX[41-44]
Speed        up   infinite      1   idle FX11

root@kandinsky:~# sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2025-09-08T15:04:39 FX[12-14]

I do not understand where the "not responding" comes from, nor how I can
investigate that. Any idea what could trigger this behavior?

Best wishes,

Julien


--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Davide DelVento via slurm-users

unread,
Sep 23, 2025, 4:49:08 PM (11 days ago) Sep 23
to Julien Tailleur, slurm...@lists.schedmd.com
As the great Ole just taught us in another thread, this should tell you why:

sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where nodes=FX[12-14]

However I suspect you'd only get "not responding" again ;-)

Are you sure that all the slurm services are running correctly on those servers? Maybe try rebooting them?



Ryan Novosielski via slurm-users

unread,
Sep 23, 2025, 5:03:43 PM (11 days ago) Sep 23
to Julien Tailleur, slurm...@lists.schedmd.com
First place to look IMO would be confirming connectivity on the Slurm-related ports (eg. firewall issue). I notice this is especially true when you see it work for a little while and then stop after some period of time.

Log may also tell you what’s going on.

Julien Tailleur via slurm-users

unread,
Sep 23, 2025, 6:40:38 PM (11 days ago) Sep 23
to slurm...@lists.schedmd.com
On 9/23/25 16:44, Davide DelVento wrote:
> As the great Ole just taught us in another thread, this should tell
> you why:
>
> sacctmgr show event
> Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where
> nodes=FX[12-14]
>
> However I suspect you'd only get "not responding" again ;-)

Good prediction!

sacctmgr show event
Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User
       NodeName           TimeStart      Duration
State                                   Reason       User
--------------- ------------------- ------------- ------
---------------------------------------- ----------
                2021-08-25T11:13:56 1490-12:21:12        Cluster
Registered TRES
FX12            2025-09-08T15:04:39   15-08:30:29 DOWN*  Not
responding                           slurm(640+
FX13            2025-09-08T15:04:39   15-08:30:29 DOWN*  Not
responding                           slurm(640+
FX14            2025-09-08T15:04:39   15-08:30:29 DOWN*  Not
responding                           slurm(640+

> Are you sure that all the slurm services are running correctly on
> those servers? Maybe try rebooting them?

The service were all running. "Correctly" is harder to say :-) I did not
see anything obviously interesting in the logs, but I am not sure what
to look for.

Anyway, I've followed your advice and rebooted the servers and they are
idle for now. I will see how long it lasts. If that fixed it, I will
fall on my sword and apologize for disturbing the ML...

Best,

John Hearns via slurm-users

unread,
Sep 24, 2025, 2:32:43 PM (10 days ago) Sep 24
to Julien Tailleur, Slurm User Community List
Look at the slurmd logs on these nodes.
Or try to run slurmd in non background mode.

And as I said on another thread check the time on these nodes
Reply all
Reply to author
Forward
0 new messages