[slurm-users] Slurm - sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused

1,502 views
Skip to first unread message

Zainul Abiddin

unread,
Feb 2, 2021, 8:05:49 AM2/2/21
to slurm...@lists.schedmd.com
Hi All,
I have done slurmdbd configuration and while i am trying to run account manager with sacct i am getting below error.

[root@smaster ~]# sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
[root@smaster ~]#

My slurmdbd configuration :
[root@smaster ~]# cat /etc/slurm/slurmdbd.conf
AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePass=password
StorageUser=slurm
StorageLoc=slurm_acct_db

[root@smaster ~]# chown slurm: /etc/slurm/slurmdbd.conf
[root@smaster ~]# chmod 600 /etc/slurm/slurmdbd.conf
[root@smaster ~]# mkdir /var/log/slurm
[root@smaster ~]# touch /var/log/slurm/slurmdbd.log
[root@smaster ~]# chown slurm: /var/log/slurm/slurmdbd.log
[root@smaster ~]# scontrol show config | grep AccountingStorageHost
AccountingStorageHost   = localhost

Note:
i have edited file /etc/slurm/slurm.conf and modified the below line
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
Then restarted all the services

[root@smaster ~]# for i in munge slurmd slurmctld slurmdbd; do service $i status; done
Redirecting to /bin/systemctl status munge.service
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 36min ago
     Docs: man:munged(8)
 Main PID: 20613 (munged)
   CGroup: /system.slice/munge.service
           └─20613 /usr/sbin/munged

Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Stopped MUNGE authentication service.
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Starting MUNGE authentication service...
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started MUNGE authentication service.
Redirecting to /bin/systemctl status slurmd.service
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 36min ago
 Main PID: 20637 (slurmd)
   CGroup: /system.slice/slurmd.service
           └─20637 /usr/sbin/slurmd -D

Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started Slurm node daemon.
Feb 02 15:30:47 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 7 for UID 0
Feb 02 15:31:46 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 8 for UID 0
Feb 02 15:33:43 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 9 for UID 0

Redirecting to /bin/systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:11 IST; 3h 36min ago
 Main PID: 20660 (slurmctld)
   CGroup: /system.slice/slurmctld.service
           └─20660 /usr/sbin/slurmctld -D

Feb 02 13:21:11 smaster.calligotech.com systemd[1]: Started Slurm controller daemon.
Redirecting to /bin/systemctl status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 16:29:11 IST; 28min ago
 Main PID: 24146 (slurmdbd)
   CGroup: /system.slice/slurmdbd.service
           └─24146 /usr/sbin/slurmdbd -D

Feb 02 16:29:11 smaster.calligotech.com systemd[1]: Started Slurm DBD accounting daemon.
[root@smaster ~]# srun --ntasks=2 --label /bin/hostname
srun: job 22 queued and waiting for resources
srun: job 22 has been allocated resources
1: smaster.calligotech.com
0: smaster.calligotech.com
[root@smaster ~]#


However when i run the below command 

[root@smaster ~]# sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
[root@smaster ~]#

and i have troubleshooted below steps

[root@smaster ~]# telnet localhost 6819
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
[root@smaster ~]#

[root@smaster ~]# mysql -p -u slurm slurm_acct_db
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 9
Server version: 10.1.48-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [slurm_acct_db]> show tables;
Empty set (0.00 sec)

MariaDB [slurm_acct_db]>

Then i have added DBPort and restarted services 
[root@smaster ~]# cat /etc/slurm/slurmdbd.conf
AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
DbdPort=6819
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePass=password
StorageUser=slurm
StorageLoc=slurm_acct_db
[root@smaster ~]#

[root@smaster ~]# for i in munge slurmd slurmctld slurmdbd; do service $i status; done
Redirecting to /bin/systemctl status munge.service
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 55min ago
     Docs: man:munged(8)
 Main PID: 20613 (munged)
   CGroup: /system.slice/munge.service
           └─20613 /usr/sbin/munged

Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Stopped MUNGE authentication service.
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Starting MUNGE authentication service...
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started MUNGE authentication service.
Redirecting to /bin/systemctl status slurmd.service
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 55min ago
 Main PID: 20637 (slurmd)
   CGroup: /system.slice/slurmd.service
           └─20637 /usr/sbin/slurmd -D

Feb 02 15:30:47 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 7 for UID 0
Feb 02 15:31:46 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 8 for UID 0
Feb 02 15:33:43 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 9 for UID 0
Feb 02 15:38:45 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 12 for UID 0

Redirecting to /bin/systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:11 IST; 3h 55min ago
 Main PID: 20660 (slurmctld)
   CGroup: /system.slice/slurmctld.service
           └─20660 /usr/sbin/slurmctld -D

Feb 02 13:21:11 smaster.calligotech.com systemd[1]: Started Slurm controller daemon.
Redirecting to /bin/systemctl status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 16:29:11 IST; 47min ago
 Main PID: 24146 (slurmdbd)
   CGroup: /system.slice/slurmdbd.service
           └─24146 /usr/sbin/slurmdbd -D

Feb 02 16:29:11 smaster.calligotech.com systemd[1]: Started Slurm DBD accounting daemon.
[root@smaster ~]# ps -ef |grep slurm
root     20637     1  0 13:21 ?        00:00:00 /usr/sbin/slurmd -D
slurm    20660     1  0 13:21 ?        00:00:08 /usr/sbin/slurmctld -D
root     24146     1  0 16:29 ?        00:00:00 /usr/sbin/slurmdbd -D
root     25395 18378  0 17:17 pts/2    00:00:00 grep --color=auto slurm
[root@smaster ~]# sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
[root@smaster ~]#

[root@smaster ~]# tail /var/log/slurm/slurmdbd.log
[2021-02-02T17:16:01.913] error: mysql_real_connect failed: 2005 Unknown MySQL server host 'smater' (-2)
[2021-02-02T17:16:01.913] error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
[2021-02-02T17:16:06.963] error: mysql_real_connect failed: 2005 Unknown MySQL server host 'smater' (-2)
[2021-02-02T17:16:06.963] error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
[2021-02-02T17:16:12.083] error: mysql_real_connect failed: 2005 Unknown MySQL server host 'smater' (-2)
[2021-02-02T17:16:12.083] error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
[2021-02-02T17:16:17.140] error: mysql_real_connect failed: 2005 Unknown MySQL server host 'smater' (-2)
[2021-02-02T17:16:17.141] error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
[2021-02-02T17:16:22.804] error: mysql_real_connect failed: 2005 Unknown MySQL server host 'smater' (-2)
[2021-02-02T17:16:22.804] error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
[root@smaster ~]#
 
Still the problem remains the same. Please help me to resolve this issue.
 
Regards,
Zain

Marcus Wagner

unread,
Feb 3, 2021, 8:22:51 AM2/3/21
to slurm...@lists.schedmd.com
Hi Zainul,

there seems to be a hostname problem, your node is called "smaster" as far a I can see, the slurmdbd log complains about the server host "smater".

Best
Marcus

Am 02.02.2021 um 14:05 schrieb Zainul Abiddin:
> Hi All,
> I have done slurmdbd configuration and while i am trying to run account manager with *sacct* i am getting below error.
> Feb 02 13:21:10 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Stopped MUNGE authentication service.
> Feb 02 13:21:10 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Starting MUNGE authentication service...
> Feb 02 13:21:10 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Started MUNGE authentication service.
> Redirecting to /bin/systemctl status slurmd.service
> ● slurmd.service - Slurm node daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
>    Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 36min ago
>  Main PID: 20637 (slurmd)
>    CGroup: /system.slice/slurmd.service
>            └─20637 /usr/sbin/slurmd -D
>
> Feb 02 13:21:10 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Started Slurm node daemon.
> Feb 02 15:30:47 smaster.calligotech.com <http://smaster.calligotech.com/> slurmd[20637]: slurmd: Launching batch job 7 for UID 0
> Feb 02 15:31:46 smaster.calligotech.com <http://smaster.calligotech.com/> slurmd[20637]: slurmd: Launching batch job 8 for UID 0
> Feb 02 15:33:43 smaster.calligotech.com <http://smaster.calligotech.com/> slurmd[20637]: slurmd: Launching batch job 9 for UID 0
>
> Redirecting to /bin/systemctl status slurmctld.service
> ● slurmctld.service - Slurm controller daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
>    Active: active (running) since Tue 2021-02-02 13:21:11 IST; 3h 36min ago
>  Main PID: 20660 (slurmctld)
>    CGroup: /system.slice/slurmctld.service
>            └─20660 /usr/sbin/slurmctld -D
>
> Feb 02 13:21:11 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Started Slurm controller daemon.
> Redirecting to /bin/systemctl status slurmdbd.service
> ● slurmdbd.service - Slurm DBD accounting daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
>    Active: active (running) since Tue 2021-02-02 16:29:11 IST; 28min ago
>  Main PID: 24146 (slurmdbd)
>    CGroup: /system.slice/slurmdbd.service
>            └─24146 /usr/sbin/slurmdbd -D
>
> Feb 02 16:29:11 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Started Slurm DBD accounting daemon.
> [root@smaster ~]# srun --ntasks=2 --label /bin/hostname
> srun: job 22 queued and waiting for resources
> srun: job 22 has been allocated resources
> 1: smaster.calligotech.com <http://smaster.calligotech.com/>
> 0: smaster.calligotech.com <http://smaster.calligotech.com/>
> [root@smaster ~]#
>
>
> However when i run the below command
>
> [root@smaster ~]# sacct
> sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
> sacct: error: Sending PersistInit msg: Connection refused
> sacct: error: Problem talking to the database: Connection refused
> [root@smaster ~]#
>
> and i have troubleshooted below steps
>
> [root@smaster ~]# telnet localhost 6819
> Trying ::1...
> telnet: connect to address ::1: Connection refused
> Trying 127.0.0.1...
> telnet: connect to address 127.0.0.1 <http://127.0.0.1/>: Connection refused
> [root@smaster ~]#
>
> [root@smaster ~]# mysql -p -u slurm slurm_acct_db
> Enter password:
> Welcome to the MariaDB monitor.  Commands end with ; or \g.
> Your MariaDB connection id is 9
> Server version: 10.1.48-MariaDB MariaDB Server
>
> Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
>
> Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
>
> MariaDB [slurm_acct_db]> show tables;
> Empty set (0.00 sec)
>
> MariaDB [slurm_acct_db]>
>
> Then i have added DBPort and restarted services
> [root@smaster ~]# cat /etc/slurm/slurmdbd.conf
> AuthType=auth/munge
> DbdAddr=localhost
> DbdHost=localhost
> *DbdPort=6819*
> SlurmUser=slurm
> DebugLevel=4
> LogFile=/var/log/slurm/slurmdbd.log
> PidFile=/var/run/slurmdbd.pid
> StorageType=accounting_storage/mysql
> StorageHost=localhost
> StoragePass=password
> StorageUser=slurm
> StorageLoc=slurm_acct_db
> [root@smaster ~]#
>
> [root@smaster ~]# for i in munge slurmd slurmctld slurmdbd; do service $i status; done
> Redirecting to /bin/systemctl status munge.service
> ● munge.service - MUNGE authentication service
>    Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
>    Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 55min ago
>      Docs: man:munged(8)
>  Main PID: 20613 (munged)
>    CGroup: /system.slice/munge.service
>            └─20613 /usr/sbin/munged
>
> Feb 02 13:21:10 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Stopped MUNGE authentication service.
> Feb 02 13:21:10 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Starting MUNGE authentication service...
> Feb 02 13:21:10 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Started MUNGE authentication service.
> Redirecting to /bin/systemctl status slurmd.service
> ● slurmd.service - Slurm node daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
>    Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 55min ago
>  Main PID: 20637 (slurmd)
>    CGroup: /system.slice/slurmd.service
>            └─20637 /usr/sbin/slurmd -D
>
> Feb 02 15:30:47 smaster.calligotech.com <http://smaster.calligotech.com/> slurmd[20637]: slurmd: Launching batch job 7 for UID 0
> Feb 02 15:31:46 smaster.calligotech.com <http://smaster.calligotech.com/> slurmd[20637]: slurmd: Launching batch job 8 for UID 0
> Feb 02 15:33:43 smaster.calligotech.com <http://smaster.calligotech.com/> slurmd[20637]: slurmd: Launching batch job 9 for UID 0
> Feb 02 15:38:45 smaster.calligotech.com <http://smaster.calligotech.com/> slurmd[20637]: slurmd: Launching batch job 12 for UID 0
>
> Redirecting to /bin/systemctl status slurmctld.service
> ● slurmctld.service - Slurm controller daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
>    Active: active (running) since Tue 2021-02-02 13:21:11 IST; 3h 55min ago
>  Main PID: 20660 (slurmctld)
>    CGroup: /system.slice/slurmctld.service
>            └─20660 /usr/sbin/slurmctld -D
>
> Feb 02 13:21:11 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Started Slurm controller daemon.
> Redirecting to /bin/systemctl status slurmdbd.service
> ● slurmdbd.service - Slurm DBD accounting daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
>    Active: active (running) since Tue 2021-02-02 16:29:11 IST; 47min ago
>  Main PID: 24146 (slurmdbd)
>    CGroup: /system.slice/slurmdbd.service
>            └─24146 /usr/sbin/slurmdbd -D
>
> Feb 02 16:29:11 smaster.calligotech.com <http://smaster.calligotech.com/> systemd[1]: Started Slurm DBD accounting daemon.
--
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ

Zainul Abiddin

unread,
Feb 4, 2021, 6:24:51 AM2/4/21
to slurm...@lists.schedmd.com
Hi,

Thanks for your support for configuring Slurm -  Benson Muite, Michael Smith and Marcus Wagner

Finally I am able to set up Slurm on master and compute nodes with given instructions - ntp, hostname file and firewalls settings I have followed and corrected.

[root@smaster ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug        up   infinite      1   idle snode
hpc*         up   infinite      1   idle smaster
[root@smaster ~]#

Regards,
Zain

Reply all
Reply to author
Forward
0 new messages