[slurm-users] Slurm does not start after (stupid) upgrade from 16.05.9 to 20.11.7

878 views
Skip to first unread message

Julien Tailleur

unread,
Aug 25, 2021, 4:49:27 AM8/25/21
to slurm...@lists.schedmd.com
Dear all,

We have been running a computing cluster using slurm since 2016, that I
installed back then, with some help from others. I was pretty late on
upgrades and decided to upgrade the cluster up to debian Bullseye, which
runs slurm 20.11.7, starting from stretch, that runs slurm 16.05.9.

While the update of the system in itself went smoothly, slurm is broken.
Of course, that's the stage at which I thought "Oh, I should have
checked if the upgrade is supposed to be harmless"... Now that's the
self-bashing is rightfully done, I would be very happy with some help! I
hesitate between two strategies: removing slurm completely and a
completely new installation, or trying to save what can be saved... I am
tempted by the former since I remember suffering a bit to get the
installation right in the first place...

Munge works still fine but when I run

slurmctld -Dvvvvv -c

every goes smoothly until:

[...]
slurmctld: accounting_storage/slurmdbd: init: Accounting storage
SLURMDBD plugin loaded
slurmctld: debug3: Success.
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
127.0.1.1:6819: Connection refused
slurmctld: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to host:kandinsky:6819: Connection refused
slurmctld: error: Sending PersistInit msg: Connection refused
slurmctld: accounting_storage/slurmdbd: _load_dbd_state: recovered 0
pending RPCs
slurmctld: accounting_storage/slurmdbd:
clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817
with slurmdbd
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
127.0.1.1:6819: Connection refused
slurmctld: error: Sending PersistInit msg: Connection refused
slurmctld: debug:  Association database appears down, reading from state
file.
slurmctld: debug:  create_mmap_buf: Failed to open file
`/var/spool/slurm.state/last_tres`, No such file or directory
slurmctld: debug2: No last_tres file (/var/spool/slurm.state/last_tres)
to recover
slurmctld: debug:  create_mmap_buf: Failed to open file
`/var/spool/slurm.state/assoc_mgr_state`, No such file or directory
slurmctld: debug2: No association state file
(/var/spool/slurm.state/assoc_mgr_state) to recover
slurmctld: fatal: You are running with a database but for some reason we
have no TRES from it.  This should only happen if the database is down
and you don't have any state files.

6819 is the port on which slurmdb is supposed to be running so I tried:

slurmdbd -Dvvvvv

which yields

slurmdbd: debug:  Log file re-opened
slurmdbd: pidfile not locked, assuming no running daemon
slurmdbd: debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so
slurmdbd: debug:  auth/munge: init: Munge authentication plugin loaded
slurmdbd: debug3: Success.
slurmdbd: debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_mysql.so
slurmdbd: debug2: accounting_storage/as_mysql: init: mysql_connect()
called for db slurm_db
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane:
MySQL server version is: 10.5.11-MariaDB-1
slurmdbd: debug2: accounting_storage/as_mysql:
_check_database_variables: innodb_buffer_pool_size: 134217728
slurmdbd: debug2: accounting_storage/as_mysql:
_check_database_variables: innodb_log_file_size: 100663296
slurmdbd: debug2: accounting_storage/as_mysql:
_check_database_variables: innodb_lock_wait_timeout: 50
slurmdbd: error: Database settings not recommended values:
innodb_buffer_pool_size innodb_lock_wait_timeout
slurmdbd: debug4: accounting_storage/as_mysql: _set_db_curr_ver:
0(as_mysql_convert.c:128) query
select version from convert_version_table
slurmdbd: debug4: accounting_storage/as_mysql:
as_mysql_convert_tables_pre_create: as_mysql_convert_tables_pre_create:
No conversion needed, Horray!
slurmdbd: debug4: accounting_storage/as_mysql:
as_mysql_convert_tables_post_create:
as_mysql_convert_tables_post_create: No conversion needed, Horray!
slurmdbd: debug4: accounting_storage/as_mysql:
as_mysql_convert_non_cluster_tables_post_create:
as_mysql_convert_non_cluster_tables_post_create: No conversion needed,
Horray!
slurmdbd: error: mysql_query failed: 1558 Column count of mysql.proc is
wrong. Expected 21, found 20. Created with MariaDB 100126, now running
100511. Please use mariadb-upgrade to fix this error
drop procedure if exists get_parent_limits; create procedure
get_parent_limits(my_table text, acct text, cluster text, without_limits
int) begin set @par_id = NULL; set @mj = NULL; set @mja = NULL; set @mpt
= NULL; set @msj = NULL; set @mwpj = NULL; set @mtpj = ''; set @mtpn =
''; set @mtmpj = ''; set @mtrm = ''; set @prio = NULL; set @def_qos_id =
NULL; set @qos = ''; set @delta_qos = ''; set @my_acct = acct; if
without_limits then set @mj = 0; set @msj = 0; set @mwpj = 0; set @prio
= 0; set @def_qos_id = 0; set @qos = 1; end if; REPEAT set @s = 'select
'; if @par_id is NULL then set @s = CONCAT(@s, '@par_id := id_assoc, ');
end if; if @mj is NULL then set @s = CONCAT(@s, '@mj := max_jobs, ');
end if; if @mja is NULL then set @s = CONCAT(@s, '@mja :=
max_jobs_accrue, '); end if; if @mpt is NULL then set @s = CONCAT(@s,
'@mpt := min_prio_thresh, '); end if; if @msj is NULL then set @s =
CONCAT(@s, '@msj := max_submit_jobs, '); end if; if @mwpj is NULL then
set @s = CONCAT(@s, '@mwpj := max_wall_pj, '); end if; if @prio is NULL
then set @s = CONCAT(@s, '@prio := priority, '); end if; if @def_qos_id
is NULL then set @s = CONCAT(@s, '@def_qos_id := def_qos_id, '); end if;
if @qos = '' then set @s = CONCAT(@s, '@qos := qos, @delta_qos :=
REPLACE(CONCAT(delta_qos, @delta_qos), \',,\', \',\'), '); end if; set
@s = concat(@s, '@mtpj := CONCAT(@mtpj, if (@mtpj != \'\' && max_tres_pj
!= \'\', \',\', \'\'), max_tres_pj), @mtpn := CONCAT(@mtpn, if (@mtpn !=
\'\' && max_tres_pn != \'\', \',\', \'\'), max_tres_pn), @mtmpj :=
CONCAT(@mtmpj, if (@mtmpj != \'\' && max_tres_mins_pj != \'\', \',\',
\'\'), max_tres_mins_pj), @mtrm := CONCAT(@mtrm, if (@mtrm != \'\' &&
max_tres_run_mins != \'\', \',\', \'\'), max_tres_run_mins),
@my_acct_new := parent_acct from "', cluster, '_', my_table, '" where
acct = \'', @my_acct, '\' && user=\'\''); prepare query from @s; execute
query; deallocate prepare query; set @my_acct = @my_acct_new; UNTIL
without_limits || @my_acct = '' END REPEAT; END;
slurmdbd: error: mysql_query failed: 1558 Column count of mysql.proc is
wrong. Expected 21, found 20. Created with MariaDB 100126, now running
100511. Please use mariadb-upgrade to fix this error
drop procedure if exists get_coord_qos; create procedure
get_coord_qos(my_table text, acct text, cluster text, coord text) begin
set @qos = ''; set @delta_qos = ''; set @found_coord = NULL; set
@my_acct = acct; REPEAT set @s = 'select @qos := t1.qos, @delta_qos :=
REPLACE(CONCAT(t1.delta_qos, @delta_qos), \',,\', \',\'), @my_acct_new
:= parent_acct, @found_coord_curr := t2.user '; set @s = concat(@s,
'from "', cluster, '_', my_table, '" as t1 left outer join
acct_coord_table as t2 on t1.acct=t2.acct where t1.acct = @my_acct &&
t1.user=\'\' && (t2.user=\'', coord, '\' || t2.user is null)'); prepare
query from @s; execute query; deallocate prepare query; if
@found_coord_curr is not NULL then set @found_coord = @found_coord_curr;
end if; if @found_coord is NULL then set @qos = ''; set @delta_qos = '';
end if; set @my_acct = @my_acct_new; UNTIL @qos != '' || @my_acct = ''
END REPEAT; select REPLACE(CONCAT(@qos, @delta_qos), ',,', ','); END;
slurmdbd: accounting_storage/as_mysql: init: Accounting storage MYSQL
plugin failed
slurmdbd: error: Couldn't load specified plugin name for
accounting_storage/mysql: Plugin init() callback failed
slurmdbd: error: cannot create accounting_storage context for
accounting_storage/mysql
slurmdbd: fatal: Unable to initialize accounting_storage/mysql
accounting storage plugin

It thus seems that the database format is wrong. I do not care about
previous logs so I would be happy erasing previous table and creating a
new one, if possible, but I do not know what to do :-)

I tried running

mariadb-upgrade

but got

Version check failed. Got the following error when calling the 'mysql'
command line client
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using
password: NO)
FATAL ERROR: Upgrade failed

I have to admit that I do not remember setting a root password, but it
starts to date back and I was not the only one messing with the
cluster... I tried to follow this to change the root password:

https://linuxize.com/post/how-to-reset-a-mysql-root-password/

but this does not seem to be working. I would be happy with some
suggestions !

Best,

Julien Tailleur





Ole Holm Nielsen

unread,
Aug 25, 2021, 5:14:08 AM8/25/21
to slurm...@lists.schedmd.com
On 8/25/21 10:48 AM, Julien Tailleur wrote:
> We have been running a computing cluster using slurm since 2016, that I
> installed back then, with some help from others. I was pretty late on
> upgrades and decided to upgrade the cluster up to debian Bullseye, which
> runs slurm 20.11.7, starting from stretch, that runs slurm 16.05.9.

SchedMD documents that upgrades must be at most 2 major versions, see
https://slurm.schedmd.com/quickstart_admin.html#upgrade. So you would
have to go through 16.05 -> 17.02 -> 18.08 -> 20.02 -> 20.11 (soon 21.08
will be out). Whether you can find Debian packages for these old versions
is unknown to me.

I have collected some Slurm upgrading information in
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
It's written for CentOS, but the Slurm parts would be the same.

> While the update of the system in itself went smoothly, slurm is broken.
> Of course, that's the stage at which I thought "Oh, I should have checked
> if the upgrade is supposed to be harmless"... Now that's the self-bashing
> is rightfully done, I would be very happy with some help! I hesitate
> between two strategies: removing slurm completely and a completely new
> installation, or trying to save what can be saved... I am tempted by the
> former since I remember suffering a bit to get the installation right in
> the first place...

A useable database dump from the old 16.05 is vital! You could start
again with Slurm 16.05 and upgrade in 4 steps as indicated above.

Beware of potential database issues:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#database-upgrade-from-slurm-17-02-and-older

If the 4-step upgrade doesn't work, starting from scratch seems to be the
only option :-( My Slurm Wiki page may perhaps be of a little help:
https://wiki.fysik.dtu.dk/niflheim/SLURM

/Ole

Julien Tailleur

unread,
Aug 25, 2021, 6:14:40 AM8/25/21
to slurm...@lists.schedmd.com
Dear Ole,

thanks for your answer and your useful links. I finally managed to
change the root password of mariadb and successfully ran

mariadb-upgrade

I could then restart everything and the cluster seems to be running
again (!)

I will run more checks and will also ask questions on the debian-hpc
mailing list, because all packages have changed names and I am a bit
lost :-)

Best wishes,

Julien

Reply all
Reply to author
Forward
0 new messages