[slurm-users] Compute nodes cycling from idle to down on a regular basis ?

Hello everyone,

Brian, Bjorn, thank you for your answers;

- From every compute node, I checked I could  nslookup  some other compute nodes as well as the slurm master for their hostnames; That worked;

In the mean time we identified other issues . Apparently, that solved the problem for part of the nodes (kyle[46-68]) but not others (kyle[01-45])

1) we are migrating from a previous slurm master to a new one and ... the old one still had its slurmctld running with the nodes listed. I think that explained the munge credentials traces . This were certainly coming from the old master
2) we had 2 network interfaces on the compute nodes; It appears that requests on the DHCP were flip flopping the IP between the two interfaces. I'm not sure, but this unusual thing may have created trouble to the slurm master; We simply deactivated one of the two interfaces to prevent that from happening

Unfortunately, even after solving this (and restarting the slurmctld, slurmd, rebooting the compute nodes), we still have issues on 45 compute nodes, while 20 others are now fine. The difference I notice in the slurmd log on the compute node is that :

- for nodes still cycling in idle*->drain , the last entry of the log is :

[2022-02-01T18:45:25.437] debug2: Finish processing RPC: REQUEST_NODE_REGISTRATION_STATUS

- for nodes that are now staying in idle, the last entry of the log is 

[2022-02-01T18:45:25.477] debug2: Finish processing RPC: REQUEST_NODE_REGISTRATION_STATUS
[2022-02-01T19:18:45.835] debug3: in the service_connection
[2022-02-01T19:18:45.837] debug2: Start processing RPC: REQUEST_PING
[2022-02-01T19:18:45.837] debug2: Finish processing RPC: REQUEST_PING

So, there is this missing "REQUEST_PING" RPC on the draining nodes.  On the slurm master, I see, for all the drained nodes, a bunch of : "RPC:REQUEST_PING : Can't find an address, check slurm.conf" and then "Nodes kyle[01-45] not responding", 'error: Nodes kyle[01-45] not responding, setting DOWN'

Sometimes, they come back to life. On the SLURM master logs, I see some "[2022-02-01T19:52:06.941] Node kyle47 now responding", "[2022-02-01T19:52:06.941] Node kyle46 now responding"

Is there any timeout for waiting for a node to respond that might be too short ? Actually, I do not see why they may not be responding;

Thank you for your help,

Jeremy.

>That looks like a DNS issue.
>
>Verify all your nodes are able to resolve the names of each other.

>Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the 
>nodes (including head/login nodes) to ensure they all match.

>Brian Andrus

On 2/1/2022 1:37 AM, Jeremy Fix wrote:

> Hello everyone,

[slurm-users] Compute nodes cycling from idle to down on a regular basis ?

Jeremy Fix

Bjørn-Helge Mevik

Brian Andrus

Jeremy Fix

Jeremy Fix

Tina Friedrich

Stephen Cousins

Jeremy Fix