[slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2,418 views
Skip to first unread message

Jeremy Fix

unread,
Feb 1, 2022, 4:38:17 AM2/1/22
to slurm...@lists.schedmd.com

Hello everyone,

we are facing a weird issue. On a regular basis, some compute nodes go from idle -> idle* -> down  and loop back to idle on its own;  The slurm manages several nodes and this state cycle appears only for some pools of nodes.

We get a trace on the compute node as :

[2022-02-01T09:41:11.381] error: Munge decode failed: Invalid credential
[2022-02-01T09:41:11.381] ENCODED: Thu Jan 01 01:00:00 1970
[2022-02-01T09:41:11.381] DECODED: Thu Jan 01 01:00:00 1970
[2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
[2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-02-01T09:41:11.391] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-02-01T09:41:11.392] debug2: Finish processing RPC: RESPONSE_FORWARD_FAILED

On the master, the only thing we get is , sometimes :

- slurmctld.log:[2022-02-01T10:00:04.456] agent/is_node_resp: node:node45 RPC:REQUEST_PING : Can't find an address, check slurm.conf

On the slurm master, the ips are not specified in /etc/hosts but by /etc/resolv.conf ; One hypothesis we have is that maybe our DNS server is taking some times, sometimes, to respond.

This happens on a very regular basis, exactly every 1h07 and for some nodes every 3 minutes.

We thought this might be due to munge but :

- We tried to resync the munge keys.

- The time is correctly synchronized with a ntp server ;  calling date as root on both nodes return the same date

- Munge uid/gid are correct :

root@node45:/var/log/slurm# ls -l /etc/munge/
-r-------- 1 munge munge 1024 janv. 27 18:49 munge.key

- We can encrypt/decrypt successfully ;

root@slurmaster:~$  munge -n | ssh node45 unmunge

STATUS:           Success (0)
ENCODE_HOST:      node45 (127.0.1.1)
ENCODE_TIME:      2022-02-01 10:22:21 +0100 (1643707341)
DECODE_TIME:      2022-02-01 10:22:23 +0100 (1643707343)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              .....
GID:              ......
LENGTH:           0


Do you have any idea on how to debug and hopefully solve that issue ?

Thank you !

Jeremy

Bjørn-Helge Mevik

unread,
Feb 1, 2022, 6:17:06 AM2/1/22
to slurm...@schedmd.com
This might not apply to your setup, but historically when we've seen
similar behaviour, it was often due to the affected compute nodes
missing from /etc/hosts on some *other* compute nodes.

--
B/H
signature.asc

Brian Andrus

unread,
Feb 1, 2022, 10:17:47 AM2/1/22
to slurm...@lists.schedmd.com

That looks like a DNS issue.

Verify all your nodes are able to resolve the names of each other.

Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the nodes (including head/login nodes) to ensure they all match.

Brian Andrus

Jeremy Fix

unread,
Feb 1, 2022, 2:29:34 PM2/1/22
to slurm...@lists.schedmd.com
Brian, Bjorn, thank you for your answers;

- From every compute node, I checked I could  nslookup  some other compute nodes as well as the slurm master for their hostnames; That worked;

In the mean time we identified other issues . Apparently, that solved the problem for part of the nodes (kyle[46-68]) but not others (kyle[01-45])

1) we are migrating from a previous slurm master to a new one and ... the old one still had its slurmctld running with the nodes listed. I think that explained the munge credentials traces . This were certainly coming from the old master
2) we had 2 network interfaces on the compute nodes; It appears that requests on the DHCP were flip flopping the IP between the two interfaces. I'm not sure, but this unusual thing may have created trouble to the slurm master; We simply deactivated one of the two interfaces to prevent that from happening

Unfortunately, even after solving this (and restarting the slurmctld, slurmd, rebooting the compute nodes), we still have issues on 45 compute nodes, while 20 others are now fine. The difference I notice in the slurmd log on the compute node is that :

- for nodes still cycling in idle*->drain , the last entry of the log is :

[2022-02-01T18:45:25.437] debug2: Finish processing RPC: REQUEST_NODE_REGISTRATION_STATUS

- for nodes that are now staying in idle, the last entry of the log is 

[2022-02-01T18:45:25.477] debug2: Finish processing RPC: REQUEST_NODE_REGISTRATION_STATUS
[2022-02-01T19:18:45.835] debug3: in the service_connection
[2022-02-01T19:18:45.837] debug2: Start processing RPC: REQUEST_PING
[2022-02-01T19:18:45.837] debug2: Finish processing RPC: REQUEST_PING


So, there is this missing "REQUEST_PING" RPC on the draining nodes.  On the slurm master, I see, for all the drained nodes, a bunch of : "RPC:REQUEST_PING : Can't find an address, check slurm.conf" and then "Nodes kyle[01-45] not responding", 'error: Nodes kyle[01-45] not responding, setting DOWN'

Sometimes, they come back to life. On the SLURM master logs, I see some "[2022-02-01T19:52:06.941] Node kyle47 now responding", "[2022-02-01T19:52:06.941] Node kyle46 now responding"

Is there any timeout for waiting for a node to respond that might be too short ? Actually, I do not see why they may not be responding;

Thank you for your help,

Jeremy.


>That looks like a DNS issue.
>
>Verify all your nodes are able to resolve the names of each other.

>Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the 
>nodes (including head/login nodes) to ensure they all match.

>Brian Andrus

On 2/1/2022 1:37 AM, Jeremy Fix wrote:

> Hello everyone,
>
> we are facing a weird issue. On a regular basis, some compute nodes go 
> from *idle* -> *idle** -> *down* and loop back to idle on its own;  
> The slurm manages several nodes and this state cycle appears only for 
> some pools of nodes.
>
> We get a trace on the compute node as :
>
> [2022-02-01T09:41:11.381] error: Munge decode failed: Invalid credential
> [2022-02-01T09:41:11.381] ENCODED: Thu Jan 01 01:00:00 1970
> [2022-02-01T09:41:11.381] DECODED: Thu Jan 01 01:00:00 1970
> [2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward: 
> REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid 
> authentication credential
> [2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward: 
> Protocol authentication error
> [2022-02-01T09:41:11.391] error: service_connection: 
> slurm_receive_msg: Protocol authentication error
> [2022-02-01T09:41:11.392] debug2: Finish processing RPC: 
> RESPONSE_FORWARD_FAILED
>
> On the master, the only thing we get is , sometimes :
>
> - slurmctld.log:[2022-02-01T10:00:04.456] agent/is_node_resp: 
> node:node45 RPC:REQUEST_PING : Can't find an address, check slurm.conf
>
> On the slurm master, the ips are not specified in /etc/hosts but by 
> /etc/resolv.conf ; One hypothesis we have is that maybe our DNS server 
> is taking some times, sometimes, to respond.
>
> This happens on a very regular basis, exactly every 1h07 and for some 
> nodes every 3 minutes.
>
> We thought this might be due to munge but :
>
> - _We tried to resync the munge_ keys.
>
> _- The time is correctly synchronized with a ntp server ; _ calling 
> date as root on both nodes return the same date
>
> _- Munge uid/gid are correct :_
>
> root at node45:/var/log/slurm# ls -l /etc/munge/
> -r-------- 1 munge munge 1024 janv. 27 18:49 munge.key
>
> - _We can encrypt/decrypt successfully _;
>
> root at slurmaster:~$  munge -n | ssh node45 unmunge

Jeremy Fix

unread,
Feb 2, 2022, 12:57:22 AM2/2/22
to slurm...@lists.schedmd.com
Hi,

A follow-up. I though some of nodes were ok but that's not the case;
This morning, another pool of consecutive (why consecutive by the way?
they are always consecutively failing) compute nodes are idle* . And now
of the nodes which were drained came back to life in idle and now again
switched to idle*.

One thing I should mention is that the master is now handling a total of
148 nodes; That's the new pool of 100 nodes which have a cycling state.
The previous 48 nodes that already handled by this master are ok.

I do not know if this should be considered a large system but we tried
to have a look to settings such as the ARP cache [1] on the slurm
master. I'm not very familiar with that, it seems to me it enlarges the
cache of the node names/IPs table. This morning, the master has 125
lines in "arp -a" (before changing the settings in systctl , it was
like, 20 or so); Do you think  this settings is also necessary on the
compute nodes ?

Best;

Jeremy.


[1]
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks




Tina Friedrich

unread,
Feb 2, 2022, 5:42:34 AM2/2/22
to slurm...@lists.schedmd.com
Hi Jeremy,

I haven't got anything very intelligent to contribute to solve your problem.

However, what I can tell you is that we run our production cluster with
one SLURM master running on a virtual machine handling just over 300
nodes. We have never seen the sort of problem you have other than when
there was a problem contacting the nodes.

The VM running slurmctld doesn't get any tuning, it's a stock CentOS 8
server. No increase any caching (ARP or otherwise) on the master. I just
checked and I don't even think I'm doing anything special about process
or memory limits for the user the SLURM proceses run as.

I have - from time to time - have the controller go unresponsive for a
moment, but that's usually to do with lots of prologs/epilogs happening
at the same time, and it does not cause node status to flap like that.

So unless you have indications of load on your master being very high or
memory pressure on the master, I wouldn't suspect the master not coping
for this.

(I don't do host files, I use DNS. But that really shouldn't make a
difference.)

A lot of people have said name resolution - and yes, that could - but
I'm actually also wondering if you might have a network problem
somewhere? Ethernet, I mean? Congestion, or corrupted packages?
Multipathing or path failover or spanning tree going wrong or flapping?

Tina
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Stephen Cousins

unread,
Feb 2, 2022, 10:28:16 AM2/2/22
to jerem...@centralesupelec.fr, Slurm User Community List
Hi Jeremy,

What is the value of TreeWidth in your slurm.conf? If there is no entry then I recommend setting it to a value a bit larger than the number of nodes you have in your cluster and then restarting slurmctld. 

Best,

Steve
--
________________________________________________________________
 Steve Cousins             Supercomputer Engineer/Administrator
 Advanced Computing Group            University of Maine System
 244 Neville Hall (UMS Data Center)              (207) 581-3574
 Orono ME 04469                      steve.cousins at maine.edu

Jeremy Fix

unread,
Feb 2, 2022, 1:57:26 PM2/2/22
to Stephen Cousins, Slurm User Community List
Hello , Thank you for your suggestion and I thank also thank Tina;

To answer your question, there is no TreeWidth entry in the slurm.conf

But it seems we figured out the issue .... and I'm so sorry we did not think about it : we already had a pool of 48 nodes on the master but their slurm.conf diverged from the ones on the pool of dancing state nodes; At least, their slurmd was not restarted;

And actually several people suggested that the slurmd need to talk between each other; That's really our fault; 100 nodes were aware of all the 148 nodes and 48 nodes were only aware of themselves; I suppose that created issues to the master;

So even if we also had other issues like interfaces flip flopping, the diverged slurm.conf was probably the issue.

Thank you all for your help, It is time to compute :)

Jeremy.
Reply all
Reply to author
Forward
0 new messages