Hi Richard,
Slurmctld caches the updates until slurmdbd comes back online.
You can see how many records are pending for the database by using the “sdiag” command and looking for “DBD Agent queue size”.
If this number grows significantly it means that slurmdbd isn’t available.
-Greg
Hello Greg,
I have a two node set up. node1 is primary slurmctld + backup
slurmdbd and node2 is primary slurmdbd + backup slurmctld and
mysql database host.
My concern is if node 2 goes down, then the backup slurmdbd will take over, then what will happen ?
I have read that slurmctld can cache data, but what about
slurmdbd? Not sure.
I have intentionally used the slurmdbd + mariadb in the second node because I didn't want to overload the primary slurmctld.
I hope you all are getting the picture of how my set up is.
Thanks,
RC
Hi Richard,
While trying to respond I was looking into the manual pages and while it does appear that slurm can support some kind of high availability(*) it doesn’t seem simple.
With multiple slurmctld only one can be active at any time as they share state information. It’s not clear how they know about each other, so this may require STONITH(*).
With slurmdbd, there’s “AccountingStorageHost” and “AccountingStorageBackupHost”, again it’s not quite clear how these interact.
In slrmdbd.conf there is “StorageBackupHost” with the description:
. . . . It is up to the backup solution to enforce the coherency of the
accounting information between the two hosts. With clustered
database solutions (active/passive HA), you would not need to use
this feature. Default is none.
On our site we’re running only a simple setup. One VM with slurmctld and another VM with both slurmdbd+mariadbd.
Perhaps others who have dabbled with redundancy can reply.
-greg
(* I say this trusting the best way to get a response on the Internet is say something wrong and then wait for the avalanche of corrections).