[slurm-users] error: Unable to contact slurm controller (connect failure)

1,183 views
Skip to first unread message

Daniel Rodriguez Lopez (ext) via slurm-users

unread,
Nov 18, 2024, 11:23:31 AM11/18/24
to slurm...@lists.schedmd.com
Dear all,

We recently tried to fix our version of slurm in every node of our
cluster. After the instalation (slurm 20.11.9) in one of the compute
nodes, most of the commads (squeue, sinfo, scontrol show config etc)
returns this error:

 error: Unable to contact slurm controller (connect failure)

The .log files don't show any errors, we have both debugs values equal
to debug5. Also, the rest of the cluster works as usual.

I appreciate any insight on what could be the cause.

Thank you and regards,
Daniel

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Sid Young via slurm-users

unread,
Nov 18, 2024, 6:32:22 PM11/18/24
to Daniel Rodriguez Lopez (ext), slurm...@lists.schedmd.com

A few things to look at, make sure DNS/Host name resolution works,  disable any firewalls for testing, you can lock it down after, make sure the slurm.conf file is the same on all nodes.

I've just done a 20.11.9 to 24.05.2 upgrade along with a Centos7.9 to rhel 9.10 upgrade on all my nodes.

Sid

Sid

Steffen Grunewald via slurm-users

unread,
Nov 19, 2024, 2:57:09 AM11/19/24
to Daniel Rodriguez Lopez (ext), slurm...@lists.schedmd.com
Hi Daniel,

>  error: Unable to contact slurm controller (connect failure)
>
> I appreciate any insight on what could be the cause.

Can you check that the slurmctld is up and running, and that the said
commands work on the controller machine itself?
If the slurmctld cannot be started as a service, try to run it in verbose
debug mode (-D -vvv) and find out what might be wrong with it.
If it runs in foreground, check the systemd service again.
Proceed to compute nodes only when you are sure that the ctld is OK.
(IIRC there was a flag in the systemd service definition that had to be
adjusted after an upgrade, maybe you're hitting the same?)

Best,
Steffen

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

daniel.rodriguez--- via slurm-users

unread,
Nov 19, 2024, 4:19:02 AM11/19/24
to slurm...@lists.schedmd.com
Hi,

Thank you all for the early answers. We tried your suggestions and the problem was in the slurm.conf, we did not notice that the name of the control server had a typo.

Thank you, I really appreciate the help.

Best,
Daniel
Reply all
Reply to author
Forward
0 new messages