[slurm-users] Unable to contact slurm controller

19,269 views
Skip to first unread message

Mahmood Naderan

unread,
Jul 31, 2018, 11:36:11 AM7/31/18
to Slurm User Community List
Hi,
It seems that squeue is broken due to the following error:

[root@rocks7 ~]# squeue
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
[root@rocks7 ~]#  systemctl restart slurmd
[root@rocks7 ~]#  systemctl restart slurmctld
[root@rocks7 ~]# squeue
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
[root@rocks7 ~]# ps aux | grep slurm
root      2969  0.0  0.0 343112  3268 ?        Sl   Jul07   0:12 /usr/sbin/slurmdbd
kouhika+ 22930  0.0  0.0   4348   348 pts/2    S+   Jul30   0:00 /usr/libexec/slurm-spank-x11 -t compute-0-6 -i 803.0 -cgw -s ssh -o
kouhika+ 22931  9.7  0.0 192296 20292 pts/2    S+   Jul30 145:28 ssh -Y compute-0-6 /usr/libexec/slurm-spank-x11 -i 803.0 -c -g -w -s "ssh" -o ""
root     28532  0.0  0.0 143132  2072 ?        Sl   20:02   0:00 /usr/sbin/slurmd
root     29364  0.0  0.0 112712   964 pts/12   S+   20:03   0:00 grep --color=auto slurm


As you can see I tried to restart slurm processes, however, has no effect.
Any thought?


Regards,
Mahmood



Alex Chekholko

unread,
Jul 31, 2018, 1:03:43 PM7/31/18
to Slurm User Community List
Seems like your slurmctld is not running.  Have you checked its log to see why?

Mahmood Naderan

unread,
Jul 31, 2018, 1:24:26 PM7/31/18
to Slurm User Community List
I don't know what happened. It seems that it had been crashed before

[root@rocks7 ~]# systemctl status slurmctld -l
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2018-07-31 20:02:24 +0430; 1h 50min ago
  Process: 28578 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 28583 (code=exited, status=1/FAILURE)

Jul 31 20:02:23 rocks7.jupiterclusterscu.com systemd[1]: Starting Slurm controller daemon...
Jul 31 20:02:23 rocks7.jupiterclusterscu.com systemd[1]: PID file /var/run/slurmctld.pid not readable (yet?) after start.
Jul 31 20:02:23 rocks7.jupiterclusterscu.com systemd[1]: Started Slurm controller daemon.
Jul 31 20:02:24 rocks7.jupiterclusterscu.com systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE
Jul 31 20:02:24 rocks7.jupiterclusterscu.com systemd[1]: Unit slurmctld.service entered failed state.
Jul 31 20:02:24 rocks7.jupiterclusterscu.com systemd[1]: slurmctld.service failed.


Regards,
Mahmood




On Tue, Jul 31, 2018 at 9:32 PM, Alex Chekholko <al...@calicolabs.com> wrote:
Seems like your slurmctld is not running.  Have you checked its log to see why?


Regards,
Mahmood




Hadrian Djohari

unread,
Jul 31, 2018, 1:53:13 PM7/31/18
to Slurm User Community List
Look at /var/log/slurm/slurmctld.log
--
Hadrian Djohari
Manager of Research Computing Services, [U]Tech
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490

Mahmood Naderan

unread,
Jul 31, 2018, 2:55:39 PM7/31/18
to Slurm User Community List
Thank you very much. It seems that there was an unknown control character in one of the config files which I couldn't see that in the editor.

Regards,
Mahmood




On Tue, Jul 31, 2018 at 10:22 PM, Hadrian Djohari <hx...@case.edu> wrote:
Look at /var/log/slurm/slurmctld.log


Reply all
Reply to author
Forward
0 new messages