[slurm-users] SlurmDBD losing connection to the backend MariaDB

212 views
Skip to first unread message

Richard Chang

unread,
Nov 1, 2022, 12:20:33 AM11/1/22
to slurm...@lists.schedmd.com
Hi,

Just for my info, I would like to know what happens when SlurmDBD loses
connection to the backend Database, for ex, MariaDB.

Does it cache the accounting info and keep them till the DB comes back
up ?, or does it panic and shut down ?

Thank you,

RC.


Brian Andrus

unread,
Nov 1, 2022, 12:29:09 AM11/1/22
to slurm...@lists.schedmd.com
It caches up to a point. As I understand it, that is about an hour
(depending on size and how busy the cluster is, as well as available
memory, etc).

Brian Andrus

Greg Wickham

unread,
Nov 1, 2022, 1:11:37 AM11/1/22
to Slurm User Community List

Hi Richard,

 

Slurmctld caches the updates until slurmdbd comes back online.

 

You can see how many records are pending for the database by using the “sdiag” command and looking for “DBD Agent queue size”.

 

If this number grows significantly it means that slurmdbd isn’t available.

 

   -Greg

Ole Holm Nielsen

unread,
Nov 1, 2022, 4:44:27 AM11/1/22
to slurm...@lists.schedmd.com
Hi Brian,

On 11/1/22 05:28, Brian Andrus wrote:
> It caches up to a point. As I understand it, that is about an hour
> (depending on size and how busy the cluster is, as well as available
> memory, etc).

Have you found any documentation of slurmdbd caching? It's well-known
that slurmctld caches information while slurmdbd is down, see for example
page 30 in the talk "Field Notes Mark 2: Random Musings From Under A New
Hat"[1] by Tim Wickberg, SchedMD:

> For slurmdbd, the critical element in the failure domain is
> MySQL, not slurmdbd. slurmdbd itself is stateless.
> ● slurmctld will cache accounting records (up to a limit) if
> slurmdbd is unavailable. This can be hours+ to days+
> depending on your system without data loss.

The statelessness of slurmdbd makes me think that it can't cache any data.

Thanks,
Ole

[1] https://slurm.schedmd.com/publications.html

Richard Chang

unread,
Nov 1, 2022, 5:03:26 AM11/1/22
to slurm...@lists.schedmd.com

Hello Greg,

I have a two node set up. node1 is primary slurmctld + backup slurmdbd and node2 is primary slurmdbd + backup slurmctld and mysql database host.

 My concern is if node 2 goes down, then the backup slurmdbd will take over, then what will happen ?

I have read that slurmctld can cache data, but what about slurmdbd? Not sure.

I have intentionally used the slurmdbd + mariadb in the second node because I didn't want to overload the primary slurmctld.

I hope you all are getting the picture of how my set up is.

Thanks,

RC

Greg Wickham

unread,
Nov 1, 2022, 7:10:58 AM11/1/22
to Slurm User Community List

Hi Richard,

 

While trying to respond I was looking into the manual pages and while it does appear that slurm can support some kind of high availability(*) it doesn’t seem simple.


With multiple slurmctld only one can be active at any time as they share state information. It’s not clear how they know about each other, so this may require STONITH(*).

 

With slurmdbd, there’s “AccountingStorageHost” and “AccountingStorageBackupHost”, again it’s not quite clear how these interact.

 

In slrmdbd.conf there is “StorageBackupHost” with the description:

. . . . It is up to the backup solution to enforce the coherency of the

accounting information between the two hosts. With clustered

database solutions (active/passive HA), you would not need to use

this feature. Default is none.


On our site we’re running only a simple setup. One VM with slurmctld and another VM with both slurmdbd+mariadbd.

 

Perhaps others who have dabbled with redundancy can reply.

 

   -greg

 

(* I say this trusting the best way to get a response on the Internet is say something wrong and then wait for the avalanche of corrections).

Brian Andrus

unread,
Nov 1, 2022, 4:31:30 PM11/1/22
to slurm...@lists.schedmd.com
Ole,

Fair enough, it is actually slurmctld that does the caching. Technical
typo on my part there.

Just trying to let the user know, there is a window that they have to
ensure no information is lost during a database outage.

Brian Andrus

Richard Chang

unread,
Nov 1, 2022, 9:50:19 PM11/1/22
to slurm...@lists.schedmd.com
Does it mean it is best to use a single slurmdbd host in my case?

My primary slurmctld is the backup slurmdbd host, and my worry is if the
primary slurmdbd host ( which is also the mariadb server) goes down,
will the backup slurmdbd be able to cache data and wait till the mariadb
catches up ?

Thanks,

RC

Brian Andrus

unread,
Nov 1, 2022, 10:39:31 PM11/1/22
to slurm...@lists.schedmd.com
RC,

In that scenario, the backup slurmdbd would take over, but then its
database would not necessarily be in sync with the 'main' database
(hence the warnings/info about it in the documentation).

For my setup, I have 2 slurmdbd hosts, but they both connect to the
same, separate, MariaDB server, which is HA. Now, I can take down the
primary slurmdbd system and the other will takeover, so I can bring them
up/down as needed for updates, etc.

If your two slurmdbd servers use different databases, you would need a
way to keep them in sync, regardless of which slurmdbd was processing
data. There are many ways to do that, but those designs fall under
MariaDB and not Slurm.

Brian Andrus

Richard Chang

unread,
Nov 2, 2022, 10:14:42 AM11/2/22
to slurm...@lists.schedmd.com
Hello Brian,

Thank you for the reply and sharing your design. Can you please share
your MariaDB server HA details.? ( Can be offline and DM to me )

I would like to understand it so that I can replicate it  here.

Thanks & regards,

Richard.
Reply all
Reply to author
Forward
0 new messages