[slurm-users] MariaDB lock problems for sacctmgr delete query

362 views
Skip to first unread message

Jessica Nettelblad

unread,
Feb 13, 2018, 5:45:04 PM2/13/18
to Slurm User Community List

TLDR; If you get a timeout for the Slurm database, and a longer timelimit in innodb doesn't help, you might want to consider loosening the lock mode in MariaDB.

The long story!

So, we’ve just upgraded our main cluster to 17.11.3 and moved our database to Mariadb. There have been some glitches and this one falls into the category where it’s not an actual bug, but our experience might still be interesting to someone who is doing sacctmgr delete and find Slurmdbd crashing. After changing the MariaDB configuration, it worked again, and I didn't try to repro the issue again or test it further. But here's what I saw from fixing the problem for us.

THE ERROR

Slurmdbd repeatably died, with the error message “fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error.” Setting innodb_lock_wait_timeout in my.cnf to a higher value didn’t solve the problem.

One single query from a script seemed to be the only thing needed to create this lock situation: sacctmgr -i delete account where account=$accountname cluster=$cluster_name`

THE PROBLEM

A delete by sacctmgr is followed up with an alter table, in the same transaction. https://github.com/SchedMD/slurm/blob/master/src/plugins/accounting_storage/mysql/accounting_storage_mysql.c

This seems to be problematic using pretty standard configurations for MariaDB in Centos7. The query seems to create a lock conflict with itself. “Waiting for table metadata lock | alter table "milou_assoc_table" AUTO_INCREMENT=0”

THE FIX

The Slurm code already postpones the ALTER TABLE call until the end of the transaction, noting that a rollback won’t be possible afterwards. Mixing DDL and DML SQL statements in the same transaction, for the same table, might not be wise.

A quicker solution that I opted for, in the middle of a service stop with our systems down, was to change the MariaDB configuration. Instead of 1, I set innodb_autoinc_lock_mode=2, allowing for looser locks.

OUR SETUP

We are running Slurm 17.11.3 on a 300-node Centos7 cluster with MariaDB 5.5.56.-2. We have all old and new users in our LDAP and information on expiration of projects in a separate external structure. Only projects that are active (not expired) and users belonging to at least one such projects, are listed in the Slurm database. At regular intervals, expired data is removed using sacctmgr delete.

SOME  COMMENTS
Since we moved the database to MariaDB and upgraded to 17.11 at the same time, I don’t know how MariaDB behaved with previous Slurm versions.

We got this issue with delete, and changing this configuration fixed it. There might be problems with other queries too.

Changing to a looser lock mode might introduce new issues, especially depending on what backup and recovery solutions you have planned for your database. I set innodb_autoinc_lock_mode=2, but it is possible that the “traditional” value of 0 will also work.

That’s it! It would be interesting to hear if someone else has encountered this problem and how you solved it.

Best regards,

Jessica Nettelblad, UPPMAX


Bjørn-Helge Mevik

unread,
Feb 14, 2018, 9:32:00 AM2/14/18
to slurm...@schedmd.com
Thanks for the heads-up! We're currently runnint 17.02.7 with MariaDB,
and haven't seen this problem, but we are going to upgrade to 17.11 in
the not-to-far-future.

--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
signature.asc

Jessica Nettelblad

unread,
Feb 14, 2018, 3:33:37 PM2/14/18
to Slurm User Community List, slurm...@schedmd.com
Jessica Nettelblad, UPPMAX

Ole Holm Nielsen

unread,
Feb 16, 2018, 6:24:25 AM2/16/18
to slurm...@lists.schedmd.com
We're planning to upgrade Slurm 17.02 to 17.11 soon, so it's important
for us to test the slurmdbd and database upgrade before doing the actual
upgrade.

I've made a *successful* upgrade of the database migration from 17.02 to
17.11, making a dry run on an offlined compute node running CentOS 7.4.

The dry run procedure has been documented in my Slurm Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#make-a-dry-run-database-upgrade

I tried Jessica's problem command, and it didn't cause any errors in my
case:

# sacctmgr -i delete account where account=XXX cluster=niflheim

At least it appears that we don't get hurt by Slurm Bug 4785 referred to
in the page mentioned below.

Question: Is it safer to wait for 17.11.4 where the issue will
presumably be solved?

On 02/14/2018 09:32 PM, Jessica Nettelblad wrote:
> FYI - SchedMD has now solved the issue in the master branch.
>
> https://github.com/SchedMD/slurm/commit/4a16541bf0e005e1984afd4201b97df482e269ee#diff-7649dde209b4e528e3ba8bb090b19f63
>
> Best regards,
> Jessica Nettelblad, UPPMAX
>
>
> On Wed, Feb 14, 2018 at 3:31 PM, Bjørn-Helge Mevik
> <b.h....@usit.uio.no <mailto:b.h....@usit.uio.no>> wrote:
>
> Thanks for the heads-up!  We're currently runnint 17.02.7 with MariaDB,
> and haven't seen this problem, but we are going to upgrade to 17.11 in
> the not-to-far-future.

/Ole

Christopher Samuel

unread,
Feb 16, 2018, 7:44:50 AM2/16/18
to slurm...@lists.schedmd.com
Hi Ole,

On 16/02/18 22:23, Ole Holm Nielsen wrote:

> Question: Is it safer to wait for 17.11.4 where the issue will
> presumably be solved?

I don't think the commit has been backported to 17.11.x to date.

It's in master (for 18.08) here:

commit 4a16541bf0e005e1984afd4201b97df482e269ee
Author: Tim Wickberg <t...@schedmd.com>
Date: Tue Feb 13 21:34:33 2018 -0700

Remove AUTO_INCREMENT resets when removing an association from the
database.

These calls are apparently quite expensive on larger databases,
and nothing requires that the association id number not have any
skipped values. So stop doing this.

Performance problems first reported by Jessica Nettelblad.

Bug 4785.


All the best,
Chris

Reply all
Reply to author
Forward
0 new messages