A few things to check here:
$ munge -n | unmunge
Mike
From:
slurm-users <slurm-use...@lists.schedmd.com> on behalf of slurm-use...@lists.schedmd.com <slurm-use...@lists.schedmd.com>
Date: Tuesday, February 2, 2021 at 8:16 AM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: slurm-users Digest, Vol 40, Issue 4
Send slurm-users mailing list submissions to
slurm...@lists.schedmd.com
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
slurm-use...@lists.schedmd.com
You can reach the person managing the list at
slurm-us...@lists.schedmd.com
When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."
Today's Topics:
1. Slurm - sacct: error: slurm_persist_conn_open_without_init:
failed to open persistent connection to host:localhost:6819:
Connection refused (Zainul Abiddin)
2. Re: Slurm - Munge configuration details (Benson Muite)
----------------------------------------------------------------------
Message: 1
Date: Tue, 2 Feb 2021 18:35:20 +0530
From: Zainul Abiddin <zainu...@gmail.com>
To: slurm...@lists.schedmd.com
Subject: [slurm-users] Slurm - sacct: error:
slurm_persist_conn_open_without_init: failed to open persistent
connection to host:localhost:6819: Connection refused
Message-ID:
<CAA9R82u0L7VdZDhvP_1KfWmV...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi All,
I have done slurmdbd configuration and while i am trying to run account
manager with *sacct* i am getting below error.
[root@smaster ~]# sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
[root@smaster ~]#
My slurmdbd configuration :
[root@smaster ~]# cat /etc/slurm/slurmdbd.conf
AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePass=password
StorageUser=slurm
StorageLoc=slurm_acct_db
[root@smaster ~]# chown slurm: /etc/slurm/slurmdbd.conf
[root@smaster ~]# chmod 600 /etc/slurm/slurmdbd.conf
[root@smaster ~]# mkdir /var/log/slurm
[root@smaster ~]# touch /var/log/slurm/slurmdbd.log
[root@smaster ~]# chown slurm: /var/log/slurm/slurmdbd.log
[root@smaster ~]# scontrol show config | grep AccountingStorageHost
AccountingStorageHost = localhost
Note:
i have edited file /etc/slurm/slurm.conf and modified the below line
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
Then restarted all the services
[root@smaster ~]# for i in munge slurmd slurmctld slurmdbd; do service $i
status; done
Redirecting to /bin/systemctl status munge.service
? munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
preset: disabled)
Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 36min ago
Docs: man:munged(8)
Main PID: 20613 (munged)
CGroup: /system.slice/munge.service
??20613 /usr/sbin/munged
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Stopped MUNGE
authentication service.
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Starting MUNGE
authentication service...
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started MUNGE
authentication service.
Redirecting to /bin/systemctl status slurmd.service
? slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
preset: disabled)
Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 36min ago
Main PID: 20637 (slurmd)
CGroup: /system.slice/slurmd.service
??20637 /usr/sbin/slurmd -D
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started Slurm node
daemon.
Feb 02 15:30:47 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 7 for UID 0
Feb 02 15:31:46 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 8 for UID 0
Feb 02 15:33:43 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 9 for UID 0
Redirecting to /bin/systemctl status slurmctld.service
? slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled;
vendor preset: disabled)
Active: active (running) since Tue 2021-02-02 13:21:11 IST; 3h 36min ago
Main PID: 20660 (slurmctld)
CGroup: /system.slice/slurmctld.service
??20660 /usr/sbin/slurmctld -D
Feb 02 13:21:11 smaster.calligotech.com systemd[1]: Started Slurm
controller daemon.
Redirecting to /bin/systemctl status slurmdbd.service
? slurmdbd.service - Slurm DBD accounting daemon
Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled;
vendor preset: disabled)
Active: active (running) since Tue 2021-02-02 16:29:11 IST; 28min ago
Main PID: 24146 (slurmdbd)
CGroup: /system.slice/slurmdbd.service
??24146 /usr/sbin/slurmdbd -D
Feb 02 16:29:11 smaster.calligotech.com systemd[1]: Started Slurm DBD
accounting daemon.
[root@smaster ~]# srun --ntasks=2 --label /bin/hostname
srun: job 22 queued and waiting for resources
srun: job 22 has been allocated resources
1: smaster.calligotech.com
0: smaster.calligotech.com
[root@smaster ~]#
However when i run the below command
[root@smaster ~]# sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
[root@smaster ~]#
and i have troubleshooted below steps
[root@smaster ~]# telnet localhost 6819
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
[root@smaster ~]#
[root@smaster ~]# mysql -p -u slurm slurm_acct_db
Enter password:
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 9
Server version: 10.1.48-MariaDB MariaDB Server
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input
statement.
MariaDB [slurm_acct_db]> show tables;
Empty set (0.00 sec)
MariaDB [slurm_acct_db]>
Then i have added DBPort and restarted services
[root@smaster ~]# cat /etc/slurm/slurmdbd.conf
AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
*DbdPort=6819*
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePass=password
StorageUser=slurm
StorageLoc=slurm_acct_db
[root@smaster ~]#
[root@smaster ~]# for i in munge slurmd slurmctld slurmdbd; do service $i
status; done
Redirecting to /bin/systemctl status munge.service
? munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
preset: disabled)
Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 55min ago
Docs: man:munged(8)
Main PID: 20613 (munged)
CGroup: /system.slice/munge.service
??20613 /usr/sbin/munged
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Stopped MUNGE
authentication service.
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Starting MUNGE
authentication service...
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started MUNGE
authentication service.
Redirecting to /bin/systemctl status slurmd.service
? slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
preset: disabled)
Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 55min ago
Main PID: 20637 (slurmd)
CGroup: /system.slice/slurmd.service
??20637 /usr/sbin/slurmd -D
Feb 02 15:30:47 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 7 for UID 0
Feb 02 15:31:46 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 8 for UID 0
Feb 02 15:33:43 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 9 for UID 0
Feb 02 15:38:45 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 12 for UID 0
Redirecting to /bin/systemctl status slurmctld.service
? slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled;
vendor preset: disabled)
Active: active (running) since Tue 2021-02-02 13:21:11 IST; 3h 55min ago
Main PID: 20660 (slurmctld)
CGroup: /system.slice/slurmctld.service
??20660 /usr/sbin/slurmctld -D
Feb 02 13:21:11 smaster.calligotech.com systemd[1]: Started Slurm
controller daemon.
Redirecting to /bin/systemctl status slurmdbd.service
? slurmdbd.service - Slurm DBD accounting daemon
Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled;
vendor preset: disabled)
Active: active (running) since Tue 2021-02-02 16:29:11 IST; 47min ago
Main PID: 24146 (slurmdbd)
CGroup: /system.slice/slurmdbd.service
??24146 /usr/sbin/slurmdbd -D
Feb 02 16:29:11 smaster.calligotech.com systemd[1]: Started Slurm DBD
accounting daemon.
[root@smaster ~]# ps -ef |grep slurm
root 20637 1 0 13:21 ? 00:00:00 /usr/sbin/slurmd -D
slurm 20660 1 0 13:21 ? 00:00:08 /usr/sbin/slurmctld -D
root 24146 1 0 16:29 ? 00:00:00 /usr/sbin/slurmdbd -D
root 25395 18378 0 17:17 pts/2 00:00:00 grep --color=auto slurm
[root@smaster ~]# sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
[root@smaster ~]#
[root@smaster ~]# tail /var/log/slurm/slurmdbd.log
[2021-02-02T17:16:01.913] error: mysql_real_connect failed: 2005 Unknown
MySQL server host 'smater' (-2)
[2021-02-02T17:16:01.913] error: The database must be up when starting the
MYSQL plugin. Trying again in 5 seconds.
[2021-02-02T17:16:06.963] error: mysql_real_connect failed: 2005 Unknown
MySQL server host 'smater' (-2)
[2021-02-02T17:16:06.963] error: The database must be up when starting the
MYSQL plugin. Trying again in 5 seconds.
[2021-02-02T17:16:12.083] error: mysql_real_connect failed: 2005 Unknown
MySQL server host 'smater' (-2)
[2021-02-02T17:16:12.083] error: The database must be up when starting the
MYSQL plugin. Trying again in 5 seconds.
[2021-02-02T17:16:17.140] error: mysql_real_connect failed: 2005 Unknown
MySQL server host 'smater' (-2)
[2021-02-02T17:16:17.141] error: The database must be up when starting the
MYSQL plugin. Trying again in 5 seconds.
[2021-02-02T17:16:22.804] error: mysql_real_connect failed: 2005 Unknown
MySQL server host 'smater' (-2)
[2021-02-02T17:16:22.804] error: The database must be up when starting the
MYSQL plugin. Trying again in 5 seconds.
[root@smaster ~]#
Still the problem remains the same. Please help me to resolve this issue.
Regards,
Zain
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210202/f2348489/attachment-0001.htm>
------------------------------
Message: 2
Date: Tue, 2 Feb 2021 16:16:09 +0300
From: Benson Muite <benson...@emailplus.org>
To: slurm...@lists.schedmd.com
Subject: Re: [slurm-users] Slurm - Munge configuration details
Message-ID: <bd36d545-4fd7-05ec...@emailplus.org>
Content-Type: text/plain; charset=utf-8; format=flowed
On 2/2/21 4:00 PM, Zainul Abiddin wrote:
> Hi Benson,
>
> I am not able to do passwordless ssh? between master and compute nodes
> using Munge service.
> when i am running below command , here it is asking for a password for
> the compute node.
>
> /Am I configuring properly or not, so I need clarity on this?/
>
> [root@smaster ~]# munge -n | ssh snode unmunge
> root@snode's password:
> STATUS: ? ? ? ? ? Success (0)
> ENCODE_HOST: smaster.calligotech.com
> <http://smaster.calligotech.com/>?(192.168.1.195)
> ENCODE_TIME: ? ? ?2021-02-01 13:58:16 +0530 (1612168096)
> DECODE_TIME: ? ? ?2021-02-01 13:58:21 +0530 (1612168101)
> TTL: ? ? ? ? ? ? ?300
> CIPHER: ? ? ? ? ? aes128 (4)
> MAC: ? ? ? ? ? ? ?sha1 (3)
> ZIP: ? ? ? ? ? ? ?none (0)
> UID: ? ? ? ? ? ? ?root (0)
> GID: ? ? ? ? ? ? ?root (0)
> LENGTH: ? ? ? ? ? 0
>
> [root@smaster ~]#
>
> Regards,
> Zain
>
Hi Zain,
Perhaps try using the ipaddress instead of the hostname?
Also, are clocks synchronized? See
https://slurm.schedmd.com/quickstart_admin.html
Benson
End of slurm-users Digest, Vol 40, Issue 4
******************************************