[slurm-users] slurm nodes showing down*

69 views
Skip to first unread message

Steven Jones via slurm-users

unread,
Dec 8, 2024, 4:59:16 PM12/8/24
to slurm...@schedmd.com
I have just rebuilt all my nodes and I see

Only 1 & 2 seem available?

While 3~6 are not

3's log,

[root@node3 log]# tail slurmd.log
[2024-12-08T21:45:51.250] CPU frequency setting not configured for this node
[2024-12-08T21:45:51.251] slurmd version 20.11.9 started
[2024-12-08T21:45:51.252] slurmd started on Sun, 08 Dec 2024 21:45:51 +0000
[2024-12-08T21:45:51.252] CPUs=20 Boards=1 Sockets=20 Cores=1 Threads=1 Memory=48269 TmpDisk=23324 Uptime=30 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[root@node3 log]#

And 7 doesnt want to talk to the controller.

[root@node7 slurm]# sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
[root@node7 slurm]# 

These are all rebuilt and 1~3 are identical and 4~7 are identical.

7's log keep saying,

[2024-12-08T21:49:17.246] error: Unable to register: Zero Bytes were transmitted or received
[2024-12-08T21:49:18.263] error: Unable to register: Zero Bytes were transmitted or received
[2024-12-08T21:49:19.278] error: Unable to register: Zero Bytes were transmitted or received
[2024-12-08T21:49:20.294] error: Unable to register: Zero Bytes were transmitted or received
[2024-12-08T21:49:21.310] error: Unable to register: Zero Bytes were transmitted or received

[root@vuwunicoslurmd1 slurm]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      2  idle* node[1-2]
debug*       up   infinite      4  down* node[3-6]
[root@vuwunicoslurmd1 slurm]# 

regards

Steven 

Steffen Grunewald via slurm-users

unread,
Dec 9, 2024, 7:29:05 AM12/9/24
to Steven Jones, slurm...@schedmd.com
iHi,

On Sun, 2024-12-08 at 21:57:11 +0000, Slurm users wrote:
> I have just rebuilt all my nodes and I see

Did they ever work before with Slurm? (Which version?)

> Only 1 & 2 seem available?
> While 3~6 are not

Either you didn't wait long enough (5 minutes should be sufficient),
or the "down*" nodes don't have a slurmd that talks to the slurmctld.
The reasons for the latter can only be speculated about.

> 3's log,
>
> [root@node3 log]# tail slurmd.log
> [2024-12-08T21:45:51.250] CPU frequency setting not configured for this node
> [2024-12-08T21:45:51.251] slurmd version 20.11.9 started
> [2024-12-08T21:45:51.252] slurmd started on Sun, 08 Dec 2024 21:45:51 +0000
> [2024-12-08T21:45:51.252] CPUs=20 Boards=1 Sockets=20 Cores=1 Threads=1 Memory=48269 TmpDisk=23324 Uptime=30 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

Does this match (exceed, for Memory and TmpDisk) the node declaration
known by the slurmctld?

> And 7 doesnt want to talk to the controller.
>
> [root@node7 slurm]# sinfo
> slurm_load_partitions: Zero Bytes were transmitted or received

Does it have munge running, with the right key?
I've seen this message when authorization was lost.

> These are all rebuilt and 1~3 are identical and 4~7 are identical.

Are the node declarations also identical, respectively?
Do they show the same features in slurmd.log?

> [root@vuwunicoslurmd1 slurm]# sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 2 idle* node[1-2]
> debug* up infinite 4 down* node[3-6]

What you see here is what the slurmctld sees.
The usual procedure to debug is to run the daemons that don't cooperate,
in debug mode.
Stop their services, start them manually one by one (ctld first), then
watch whether they talk to each other, and if they don't, learn what stops
them from doing so - then iterate editing the config, "scontrol reconfig",
lather, rinse, repeat.

You're the only one knowing your node configuration lines (NodeName=...),
so we can't help any further. Ole's pages perhaps can.

Best,
S

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Steven Jones via slurm-users

unread,
Dec 9, 2024, 2:42:34 PM12/9/24
to slurm...@schedmd.com
Hi,

I have fixed a time skew. 

Nodes still down so it wasnt time skew.

I have  run tests as per munge docs and it all looks OK.

[root@node1 ~]# munge -n | unmunge | grep STATUS
STATUS:           Success (0)
[root@node1 ~]# 

root@node1 ~]# munge -n | unmunge
STATUS:           Success (0)
ENCODE_HOST:      node1.ods.vuw.ac.nz (130.195.86.21)
ENCODE_TIME:      2024-12-09 19:37:19 +0000 (1733773039)
DECODE_TIME:      2024-12-09 19:37:19 +0000 (1733773039)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

[root@node1 ~]# ssh admjo...@vuw.ac.nz@vuwunicoslurmd1.ods.vuw.ac.nz munge -n -t 10 | unmunge
Password:
STATUS:           Success (0)
ENCODE_HOST:      ??? (130.195.19.157)
ENCODE_TIME:      2024-12-09 19:37:52 +0000 (1733773072)
DECODE_TIME:      2024-12-09 19:37:52 +0000 (1733773072)
TTL:              10
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              ??? (1204805830)
GID:              ??? (1204805830)
LENGTH:           0

[root@node1 ~]# 

[root@node1 ~]# munge -n -t 10 | ssh admjo...@vuw.ac.nz@vuwunicoslurmd1.ods.vuw.ac.nz unmunge
Password:
STATUS:           Success (0)
ENCODE_HOST:      ??? (130.195.86.21)
ENCODE_TIME:      2024-12-10 08:38:11 +1300 (1733773091)
DECODE_TIME:      2024-12-10 08:38:17 +1300 (1733773097)
TTL:              10
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

[root@node1 ~]#

The only obvious thing is the nodes are on UTC and the slurm controller NZDT

Will go look at slurm

regards

Steven 



From: Steffen Grunewald <steffen....@aei.mpg.de>
Sent: Tuesday, 10 December 2024 1:27 am
To: Steven Jones <steven...@vuw.ac.nz>
Cc: slurm...@schedmd.com <slurm...@schedmd.com>
Subject: Re: [slurm-users] slurm nodes showing down*
 
[You don't often get email from steffen....@aei.mpg.de. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Steven Jones via slurm-users

unread,
Dec 9, 2024, 4:40:11 PM12/9/24
to slurm...@schedmd.com
I cannot get node3 to work.

After some minutes 4~6 stop but that appears to munge sulking.

Node7 never works, seems to be the hwclock is faulty I cant set it so I'll ignore it. 


My problem is node3, i cant fathom why when 1 & 2 run 3 wont  work with slurm, it doesnt appear to be munge.

[root@vuwunicoslurmd1 ~]# [root@vuwunicoslurmd1 ~]# [root@vuwunicoslurmd1 ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* node6
debug*       up   infinite      1  down* node3
debug*       up   infinite      4   idle node[1-2,4-5]
[root@vuwunicoslurmd1 ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      3  idle* node[4-6]
debug*       up   infinite      2   idle node[1-2]
debug*       up   infinite      1   down node3
[root@vuwunicoslurmd1 ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      4  down* node[3-6]
debug*       up   infinite      2   idle node[1-2]

[root@node3 ~]# scontrol ping
Slurmctld(primary) at vuwunicoslurmd1.ods.vuw.ac.nz is UP
[root@node3 ~]#

[root@node3 ~]# systemctl status munge
? munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2024-12-09 20:41:47 UTC; 42min ago
     Docs: man:munged(8)
  Process: 673 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
 Main PID: 686 (munged)
    Tasks: 4 (limit: 26213)
   Memory: 1.2M
   CGroup: /system.slice/munge.service
           ??686 /usr/sbin/munged

Dec 09 20:41:47 node3.ods.vuw.ac.nz systemd[1]: Starting MUNGE authentication service...
Dec 09 20:41:47 node3.ods.vuw.ac.nz systemd[1]: Started MUNGE authentication service.
[root@node3 ~]# munge -n -t 10 | ssh admjo...@vuw.ac.nz@vuwunicoslurmd1.ods.vuw.ac.nz unmunge
Password:
STATUS:           Success (0)
ENCODE_HOST:      ??? (130.195.87.23)
ENCODE_TIME:      2024-12-10 10:24:53 +1300 (1733779493)
DECODE_TIME:      2024-12-10 10:25:00 +1300 (1733779500)
TTL:              10
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

[root@node3 ~]#

[root@vuwunicoslurmd1 log]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      3  down* node[4-6]
debug*       up   infinite      2   idle node[1-2]
debug*       up   infinite      1   down node3
[root@vuwunicoslurmd1 log]# 

I cant find anything to show why node3 refuses to run,  yet 1 & 2 do

regards

Steven


From: Steffen Grunewald <steffen....@aei.mpg.de>
Sent: Tuesday, 10 December 2024 1:27 am
To: Steven Jones <steven...@vuw.ac.nz>
Cc: slurm...@schedmd.com <slurm...@schedmd.com>
Subject: Re: [slurm-users] slurm nodes showing down*
 
[You don't often get email from steffen....@aei.mpg.de. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Steven Jones via slurm-users

unread,
Dec 9, 2024, 4:57:21 PM12/9/24
to slurm...@schedmd.com
Is the slurm version critical?


[root@node3 /]# sinfo -V
slurm 20.11.9
[root@node3 /]# uname -a
Linux node3.ods.vuw.ac.nz 4.18.0-553.30.1.el8_10.x86_64 #1 SMP Tue Nov 26 18:56:25 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
[root@node3 /]# 

root@vuwunicoslurmd1 log]# sinfo -V
slurm 22.05.9
[root@vuwunicoslurmd1 log]# uname -a
Linux vuwunicoslurmd1.ods.vuw.ac.nz 5.14.0-503.15.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Nov 14 15:45:31 EST 2024 x86_64 x86_64 x86_64 GNU/Linux
[root@vuwunicoslurmd1 log]#


Though again 1 & 2 run fine. so seems unlikely


regards

Steven 


From: Steven Jones via slurm-users <slurm...@lists.schedmd.com>
Sent: Tuesday, 10 December 2024 10:37 am
Cc: slurm...@schedmd.com <slurm...@schedmd.com>
Subject: [slurm-users] Re: slurm nodes showing down*
 
Reply all
Reply to author
Forward
0 new messages