[slurm-users] SLURMDBD fails trying to talk to MariaDB - Help debugging configuration

1,161 views
Skip to first unread message

Aravindh Sampathkumar

unread,
Oct 11, 2018, 4:59:57 PM10/11/18
to slurm...@lists.schedmd.com
Hello.

I'm trying to setup a SLURM cluster in a virtual environment before actually deploying it for serious work. I hit a snag where Slurmdbd fails soon after starting because of trouble connecting to MariaDB.

SlurmDBD service status:

[root@slmaster ~]# systemctl status slurmdbd

slurmdbd.service - Slurm DBD accounting daemon

   Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)

   Active: failed (Result: timeout) since Thu 2018-10-11 20:34:42 UTC; 14min ago

  Process: 1406 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS)


Oct 11 20:33:11 slmaster systemd[1]: Starting Slurm DBD accounting daemon...

Oct 11 20:33:11 slmaster systemd[1]: PID file /var/run/slurmdbd.pid not readable (yet?) after start.

Oct 11 20:34:42 slmaster systemd[1]: slurmdbd.service start operation timed out. Terminating.

Oct 11 20:34:42 slmaster systemd[1]: Failed to start Slurm DBD accounting daemon.

Oct 11 20:34:42 slmaster systemd[1]: Unit slurmdbd.service entered failed state.

Oct 11 20:34:42 slmaster systemd[1]: slurmdbd.service failed.


MariaDB running just fine:

[root@slmaster ~]# systemctl status mariadb

mariadb.service - MariaDB database server

   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)

   Active: active (running) since Thu 2018-10-11 20:33:11 UTC; 18min ago

  Process: 991 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)

  Process: 943 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)

 Main PID: 989 (mysqld_safe)

   CGroup: /system.slice/mariadb.service

           ├─ 989 /bin/sh /usr/bin/mysqld_safe --basedir=/usr

           └─1265 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/maria...


Oct 11 20:33:07 slmaster systemd[1]: Starting MariaDB database server...

Oct 11 20:33:07 slmaster mariadb-prepare-db-dir[943]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.

Oct 11 20:33:08 slmaster mysqld_safe[989]: 181011 20:33:08 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.

Oct 11 20:33:09 slmaster mysqld_safe[989]: 181011 20:33:09 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

Oct 11 20:33:11 slmaster systemd[1]: Started MariaDB database server.


Logfile from Slurmdbd:

[2018-10-11T20:33:11.648] debug:  Log file re-opened

[2018-10-11T20:33:11.720] debug:  Munge authentication plugin loaded

[2018-10-11T20:33:11.749] debug2: mysql_connect() called for db slurm_acct_db

[2018-10-11T20:33:11.785] debug2: innodb_buffer_pool_size: 629145600

[2018-10-11T20:33:11.785] debug2: innodb_log_file_size: 67108864

[2018-10-11T20:33:11.786] debug2: innodb_lock_wait_timeout: 900

[2018-10-11T20:33:11.956] Accounting storage MYSQL plugin loaded

[2018-10-11T20:33:11.958] debug2: ArchiveDir        = /tmp

[2018-10-11T20:33:11.958] debug2: ArchiveScript     = (null)

[2018-10-11T20:33:11.958] debug2: AuthInfo          = (null)

[2018-10-11T20:33:11.958] debug2: AuthType          = auth/munge

[2018-10-11T20:33:11.958] debug2: CommitDelay       = 0

[2018-10-11T20:33:11.958] debug2: DbdAddr           = slmaster

[2018-10-11T20:33:11.958] debug2: DbdBackupHost     = (null)

[2018-10-11T20:33:11.958] debug2: DbdHost           = slmaster

[2018-10-11T20:33:11.958] debug2: DbdPort           = 6819

[2018-10-11T20:33:11.958] debug2: DebugFlags        = (null)

[2018-10-11T20:33:11.958] debug2: DebugLevel        = 6

[2018-10-11T20:33:11.958] debug2: DebugLevelSyslog  = 10

[2018-10-11T20:33:11.958] debug2: DefaultQOS        = (null)

[2018-10-11T20:33:11.958] debug2: LogFile           = /var/log/slurm/slurmdbd.log

[2018-10-11T20:33:11.958] debug2: MessageTimeout    = 10

[2018-10-11T20:33:11.958] debug2: Parameters        = (null)

[2018-10-11T20:33:11.958] debug2: PidFile           = /var/spool/slurm/slurmdbd.pid

[2018-10-11T20:33:11.958] debug2: PluginDir         = /usr/lib64/slurm

[2018-10-11T20:33:11.958] debug2: PrivateData       = none

[2018-10-11T20:33:11.958] debug2: PurgeEventAfter   = NONE

[2018-10-11T20:33:11.958] debug2: PurgeJobAfter     = NONE

[2018-10-11T20:33:11.958] debug2: PurgeResvAfter    = NONE

[2018-10-11T20:33:11.958] debug2: PurgeStepAfter    = NONE

[2018-10-11T20:33:11.958] debug2: PurgeSuspendAfter = NONE

[2018-10-11T20:33:11.958] debug2: PurgeTXNAfter = NONE

[2018-10-11T20:33:11.958] debug2: PurgeUsageAfter = NONE

[2018-10-11T20:33:11.958] debug2: SlurmUser         = slurm(982)

[2018-10-11T20:33:11.958] debug2: StorageBackupHost = (null)

[2018-10-11T20:33:11.958] debug2: StorageHost       = localhost

[2018-10-11T20:33:11.958] debug2: StorageLoc        = slurm_acct_db

[2018-10-11T20:33:11.958] debug2: StoragePort       = 3306

[2018-10-11T20:33:11.958] debug2: StorageType       = accounting_storage/mysql

[2018-10-11T20:33:11.958] debug2: StorageUser       = slurm

[2018-10-11T20:33:11.958] debug2: TCPTimeout        = 2

[2018-10-11T20:33:11.958] debug2: TrackWCKey        = 0

[2018-10-11T20:33:11.958] debug2: TrackSlurmctldDown= 0

[2018-10-11T20:33:11.958] debug2: acct_storage_p_get_connection: request new connection 1

[2018-10-11T20:33:11.974] slurmdbd version 18.08.1 started

[2018-10-11T20:33:11.986] debug2: running rollup at Thu Oct 11 20:33:11 2018

[2018-10-11T20:33:11.986] debug2: Everything rolled up

[2018-10-11T20:34:42.968] Terminate signal (SIGINT or SIGTERM) received

[2018-10-11T20:34:42.969] debug:  rpc_mgr shutting down


sacctmgr says:

[root@slmaster ~]# sacctmgr -vvvv

sacctmgr: Accounting storage SLURMDBD plugin loaded

sacctmgr: debug2: slurm_connect failed: Connection refused

sacctmgr: debug2: Error connecting slurm stream socket at 127.0.0.1:6819: Connection refused

sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to slmaster:6819: Connection refused

sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused

sacctmgr: error: Problem talking to the database: Connection refused


I am able to connect to MariaDB locally using the above settings...

[root@slmaster ~]# mysql -u slurm -h localhost -P 3306 -p

Enter password: 

Welcome to the MariaDB monitor.  Commands end with ; or \g.

Your MariaDB connection id is 7

Server version: 5.5.60-MariaDB MariaDB Server


Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.


Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.


MariaDB [(none)]> 


Config files are all attached. slurm.conf, slurmdbd.conf

Mariadb configuration was not changed..

[mysqld]

datadir=/var/lib/mysql

socket=/var/lib/mysql/mysql.sock

# Disabling symbolic-links is recommended to prevent assorted security risks

symbolic-links=0

# Settings user and group are ignored when systemd is used.

# If you need to run mysqld under a different user or group,

# customize your systemd unit file for mariadb according to the

# instructions in http://fedoraproject.org/wiki/Systemd


[mysqld_safe]

log-error=/var/log/mariadb/mariadb.log

pid-file=/var/run/mariadb/mariadb.pid


#

# include all files from the config directory

#

!includedir /etc/my.cnf.d


Appreciate any help troubleshooting the "Connection refused" error.. 

Thanks,
--
  Aravindh Sampathkumar


slurm.conf
slurmdbd.conf

Lachlan Musicman

unread,
Oct 11, 2018, 6:55:26 PM10/11/18
to Slurm User Community List
1. After systemctl restart slurmdbd , what does journalctl -xe say?
2. Your email is very hard to read. This is bc posted in html, with terminal colours and etc. Could you send the next email in plain text pls?

Cheers
L.
--
------
'...postwork futures are dismissed with the claim that "it is not in our nature to be idle", thereby demonstrating at once an essentialist view of labor and an impoverished imagination of the possibilities of nonwork.'

Chris Samuel

unread,
Oct 11, 2018, 6:59:55 PM10/11/18
to slurm...@lists.schedmd.com
On 12/10/18 07:58, Aravindh Sampathkumar wrote:

> I'm trying to setup a SLURM cluster in a virtual environment before
> actually deploying it for serious work. I hit a snag where Slurmdbd
> fails soon after starting because of trouble connecting to MariaDB.

I don't see any errors there, just that systemd is killing slurmdbd for
some reason.

What happens if you run slurmdbd by hand as root? Like this:

slurmdbd -D -vvvv

That should run it in the foreground and output debug info to the screen.

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Aravindh Sampathkumar

unread,
Oct 12, 2018, 1:58:47 AM10/12/18
to slurm...@lists.schedmd.com
@Chris and @Lachlan,
Thanks for your responses.

I resolved the issue based on hint from Jeffrey in earlier email. I tweaked the location of PID files in slurm config files, but missed to change them in the systemd service definition files.
Making them watch the same PID files did the trick.

@Lachlan,
I understand the frustration with the HTML email. I'll use plaintext from now on whenever console content is involved.

Thanks!

--
Aravindh Sampathkumar
arav...@fastmail.com

Lachlan Musicman

unread,
Oct 14, 2018, 6:15:42 PM10/14/18
to Slurm User Community List
On Fri, 12 Oct 2018 at 17:02, Aravindh Sampathkumar <arav...@fastmail.com> wrote:
@Chris and @Lachlan,
Thanks for your responses.

I resolved the issue based on hint from Jeffrey in earlier email. I tweaked the location of PID files in slurm config files, but missed to change them in the systemd service definition files.
Making them watch the same PID files did the trick.


Great!
Reply all
Reply to author
Forward
0 new messages