[slurm-dev] Fwd: Slurmdbd tables and Galera (clustered MariaDB)

6 views
Skip to first unread message

Christopher Samuel

unread,
Sep 17, 2013, 2:53:52 AM9/17/13
to slurm-dev, Brett Pemberton

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi folks,

I've had the following query from Brett, our sysadmin who currently is
in the process of migrating us off MySQL onto Galera (clustered MySQL
or derivative, in our case MariaDB as that seems to be what the
distros are replacing MySQL with). Forwarded with his permission.

However, it needs to have primary keys defined for all tables and this
is where Slurm has an issue with two tables in particulr, below are
his queries about working around this - can we get some feedback on
these please?

Thanks!
Chris

- -------- Original Message --------
Subject: Slurmdbd tables
Date: Tue, 17 Sep 2013 09:11:50 +1000
From: Brett Pemberton <b...@unimelb.edu.au>
To: Christopher Samuel <sam...@unimelb.edu.au>

Chris,

The situation:

We need all tables to have primary keys defined (for galera
replication). However slurm has two tables per cluster that don't have
primary keys:

mysql> describe avoca_last_ran_table;
+----------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------------------+------+-----+---------+-------+
| hourly_rollup | int(10) unsigned | NO | | 0 | |
| daily_rollup | int(10) unsigned | NO | | 0 | |
| monthly_rollup | int(10) unsigned | NO | | 0 | |
+----------------+------------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

mysql> describe avoca_suspend_table;
+------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+-------+
| job_db_inx | int(11) | NO | | NULL | |
| id_assoc | int(11) | NO | | NULL | |
| time_start | int(10) unsigned | NO | | 0 | |
| time_end | int(10) unsigned | NO | | 0 | |
+------------+------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

I'm not familiar with the contents of the 'suspend' table, as ours are
all empty, but it could be that job_db_inx might be possible to make a
primary key, or if not, can an auto_increment id field be added and made
a primary key?

The 'last_ran' table only ever has one entry in it, the jobids that were
last ran, funnily enough.

So again, hourly_rollup could be made a primary key, or they could add
an 'id' field which will have the value of 1.

Is it possible for you to bring this up and get some feedback as to
whether this will be possible?

cheers,

/ Brett

- --
Brett Pemberton - Systems Administrator - VLSCI


- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlI3/C8ACgkQO2KABBYQAh9UKACfZkMzD5JFKKH4mtfbOUZ/3LxF
0ekAnROTEC0eujdva6ZbgC9xVqZVYbtB
=0u+q
-----END PGP SIGNATURE-----

Olaf Gellert

unread,
Jan 16, 2014, 5:27:56 AM1/16/14
to slurm-dev, Brett Pemberton

Hi there,

as there was no answer on the list to this question: Well,
I would definitely like to use MariaDB Galera for slurm.
Anyone has successfully tried that? Anyone addressed the
issues raised by Bred?

To me it seems that they are showstoppers for Galera
(though maybe easily fixed by the slurm developers).
What would be the preferred database solution to
prevent a single point of failure running slurm?
MySQL cluster?

Regards, Olaf
--
Dipl. Inform. Olaf Gellert email gel...@dkrz.de
Deutsches Klimarechenzentrum GmbH phone +49 (0)40 460094 214
Bundesstrasse 45a fax +49 (0)40 460094 270
D-20146 Hamburg, Germany www http://www.dkrz.de

Sitz der Gesellschaft: Hamburg
Geschäftsführer: Prof. Dr. Thomas Ludwig
Registergericht: Amtsgericht Hamburg, HRB 39784

r...@q-leap.de

unread,
Jan 16, 2014, 1:05:00 PM1/16/14
to slurm-dev, Olaf Gellert

>>>>> "Olaf" == Olaf Gellert <gel...@dkrz.de> writes:

Hi Olaf,

Olaf> as there was no answer on the list to this question: Well, I
Olaf> would definitely like to use MariaDB Galera for slurm. Anyone
Olaf> has successfully tried that? Anyone addressed the issues
Olaf> raised by Bred?

Olaf> To me it seems that they are showstoppers for Galera (though
Olaf> maybe easily fixed by the slurm developers). What would be
Olaf> the preferred database solution to prevent a single point of
Olaf> failure running slurm? MySQL cluster?

in my opinion, using a replicated mysql is real overkill for a database
just serving SLURM. In Qlustar HA (pacemaker/corosync) installations,
we're using a simple failover mysql (actually mariadb) setup, where
pacemaker is responsible for starting mysql and slurm/slurmdbd in what
pacemaker calls a resource group. Since the SLURM generated DB load is
typically not very heavy, in most cases it's enough to put
the DB on a DRBD replicated local software RAID (of two head-nodes) that
is added as a shared storage in the pacemaker resource group. Such a
setup has no SPO (single point of failure).

Hope this helps,

Roland

----
Roland Fehrenbacher, PhD
Founder/CEO
Q-Leap Networks GmbH
Tel. : +49(0)7034/277620
EMail: r...@q-leap.com
http://www.q-leap.com / http://qlustar.com
Olaf> -- Dipl. Inform. Olaf Gellert email gel...@dkrz.de Deutsches
Olaf> Klimarechenzentrum GmbH phone +49 (0)40 460094 214
Olaf> Bundesstrasse 45a fax +49 (0)40 460094 270 D-20146 Hamburg,
Olaf> Germany www http://www.dkrz.de

Olaf> Sitz der Gesellschaft: Hamburg Geschäftsführer:
Olaf> Prof. Dr. Thomas Ludwig Registergericht: Amtsgericht Hamburg,
Olaf> HRB 39784

Christopher Samuel

unread,
Jan 16, 2014, 7:16:51 PM1/16/14
to slurm-dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 16/01/14 21:27, Olaf Gellert wrote:

> as there was no answer on the list to this question: Well, I would
> definitely like to use MariaDB Galera for slurm. Anyone has
> successfully tried that? Anyone addressed the issues raised by
> Bred?

Oops - Danny and myself has a discussion at the Slurm User Group about
this last September and what Brett did was to add indexes to the two
tables (you can't add extra columns as slurmdbd will delete them).

So for *_suspend_table he added job_db_inx as the primary key and for
*_last_ran_table he added hourly_rollup as the primary key.

Unfortunately he then left academia for a job in the Real World(TM)
and so we're still in the situation where one of the Galera cluster
nodes is a traditional secondary slave to our old production MySQL
primary server so we're not using Galera in anger (yet).

Some of our folks are off to a Galera training course soon, so
hopefully we can finish the transition soon.

All the best,
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlLYdicACgkQO2KABBYQAh9GhACfUEqQGlm9L+hpYKbjauos4Tgk
7B8An3j6e2bufYIsjQRPzCJQFSoGQAft
=3Q2c
-----END PGP SIGNATURE-----

Olaf Gellert

unread,
Jan 17, 2014, 8:17:51 AM1/17/14
to slurm-dev

Hi Roland,

r...@q-leap.de wrote:
> in my opinion, using a replicated mysql is real overkill for a database
> just serving SLURM. In Qlustar HA (pacemaker/corosync) installations,
> we're using a simple failover mysql (actually mariadb) setup, where
> pacemaker is responsible for starting mysql and slurm/slurmdbd in what
> pacemaker calls a resource group. Since the SLURM generated DB load is
> typically not very heavy, in most cases it's enough to put
> the DB on a DRBD replicated local software RAID (of two head-nodes) that
> is added as a shared storage in the pacemaker resource group. Such a
> setup has no SPO (single point of failure).

well, I wonder how good this really works. drbd replicates the
filesystem (on disk block level). I would expect that when the
active mysql server dies (imagine a simple power failure), the
database never get's synced (neither on disk block level nor
on database transaction level) and would be left in a, well,
lets say, cluttered state. Did you test that drbd/mysql setting?

As the load on the slurm database is usually not too heavy,
a simpler (eg. synchronous) replication mechanism than that of
galera would certainly be sufficient...

Regards, Olaf

--
Dipl. Inform. Olaf Gellert email gel...@dkrz.de
Deutsches Klimarechenzentrum GmbH phone +49 (0)40 460094 214
Bundesstrasse 45a fax +49 (0)40 460094 270
D-20146 Hamburg, Germany www http://www.dkrz.de

Sitz der Gesellschaft: Hamburg
Geschäftsführer: Prof. Dr. Thomas Ludwig
Registergericht: Amtsgericht Hamburg, HRB 39784

r...@q-leap.de

unread,
Jan 20, 2014, 10:37:55 AM1/20/14
to slurm-dev, Olaf Gellert

>>>>> "Olaf" == Olaf Gellert <gel...@dkrz.de> writes:

Olaf> r...@q-leap.de wrote:
>> in my opinion, using a replicated mysql is real overkill for a
>> database just serving SLURM. In Qlustar HA (pacemaker/corosync)
>> installations, we're using a simple failover mysql (actually
>> mariadb) setup, where pacemaker is responsible for starting mysql
>> and slurm/slurmdbd in what pacemaker calls a resource
>> group. Since the SLURM generated DB load is typically not very
>> heavy, in most cases it's enough to put the DB on a DRBD
>> replicated local software RAID (of two head-nodes) that is added
>> as a shared storage in the pacemaker resource group. Such a setup
>> has no SPO (single point of failure).

Olaf> well, I wonder how good this really works. drbd replicates the
Olaf> filesystem (on disk block level). I would expect that when the
Olaf> active mysql server dies (imagine a simple power failure), the
Olaf> database never get's synced (neither on disk block level nor
Olaf> on database transaction level) and would be left in a, well,
Olaf> lets say, cluttered state. Did you test that drbd/mysql
Olaf> setting?

Provided the correct sync protocol (C) is chosen, DRBD will make sure this
doesn't happen. From the DRBD guide
(http://www.drbd.org/users-guide/s-replication-protocols.html):

"""
Protocol C. Synchronous replication protocol. Local write operations on
the primary node are considered completed only after both the local and
the remote disk write have been confirmed. As a result, loss of a single
node is guaranteed not to lead to any data loss. Data loss is, of
course, inevitable even with this replication protocol if both nodes (or
their storage subsystems) are irreversibly destroyed at the same time.
"""

The worst that should happen, is that you'll lose the last transaction
(if there was an uncommited one, while the active node crashed) of the
slurmdbd (don't know, whether it has a replay log, that will make sure,
stuff, that wasn't written successfully to the DB, will be retried).

Olaf> As the load on the slurm database is usually not too heavy, a
Olaf> simpler (eg. synchronous) replication mechanism than that of
Olaf> galera would certainly be sufficient...

Exactly. We didn't experience any mysql load problems or corruption with
this kind of setup in > 8 years on many clusters.
Reply all
Reply to author
Forward
0 new messages