[slurm-users] Effect of slurmctld and slurmdb going down on running/pending jobs

Amjad Syed

unread,

Jun 23, 2021, 6:55:03 PM6/23/21

to slurm...@lists.schedmd.com

Hello all

We have a cluster running centos 7 . Our slurm scheduler is running on a vm machine and we are running out of disk space for /var

The slurm innodb is taking most of space. We intend to expand the vdisk for slurm server. This will require a reboot for changes to take effect. Do we have to stop users submitting jobs by draining all partitions and then restart the server. That is slurmctld.slurmdb and mariadb? Or will the restarting of slurm vm have no effect on running/pending iobs?

Sincerely

Amjad

Barbara Krašovec

unread,

Jun 24, 2021, 1:28:24 AM6/24/21

to slurm...@lists.schedmd.com

Just in case, increase Slurmdtimeout in slurm.conf (so that when the
controller is back, it will give you time to fix the issues with the
communication between slurmd and slurmctld - if there will be any).
Otherwise it should not affect running and pending jobs. First stop
controller, then slurmdbd. And then when the disk arrangements are done,
first start slurmdbd and then slurmctld.

Cheers,

Barbara

Josef Dvoracek

unread,

Jun 24, 2021, 5:16:59 AM6/24/21

to slurm...@lists.schedmd.com

hi,

just set the partitions to "DOWN" to avoid unexpected behavior for users
and reboot slurm(ctl|dbd)+sql box. Running jobs are from my experience
not affected.
No need to drain nodes.

josef

--
Josef Dvoracek
Institute of Physics | Czech Academy of Sciences
cell: +420 608 563 558 | https://telegram.me/jose_d | FZU phone nr. : 2669

Tina Friedrich

unread,

Jun 24, 2021, 5:26:44 AM6/24/21

to slurm...@lists.schedmd.com

I thought setting partitions to DOWN will kill jobs?

Amjad - to my experience, the slurmdbd & slurmctld server can be
rebooted with no effect on running jobs. You can't submit whilst it's
down, and I'm not precisely sure what happens to jobs that are just
finishing - but really the impact should be minimal.

(I've done exactly what you're needing to do - reboot so a change in
disk size is picked up - at least once with the cluster running.)

It is absolutely safe to restart slurmctld (and slurmdbd) with jobs
running on the cluster, that really is something that at least I do all
the time.

Tina

On 24/06/2021 10:16, Josef Dvoracek wrote:
> hi,
>
> just set the partitions to "DOWN" to avoid unexpected behavior for users
> and reboot slurm(ctl|dbd)+sql box. Running jobs are from my experience
> not affected.
> No need to drain nodes.
>
> josef
>
> On 24. 06. 21 0:54, Amjad Syed wrote:
>> Hello all
>> We have a cluster running centos 7 . Our slurm scheduler is
>> running on a vm machine and we are running out of disk space for /var
>> The slurm innodb is taking most of space. We intend to expand the
>> vdisk for slurm server. This will require a reboot for changes to
>> take effect. Do we have to stop users submitting jobs by draining
>> all partitions and then restart the server. That is slurmctld.slurmdb
>> and mariadb? Or will the restarting of slurm vm have no effect on
>> running/pending iobs?
>>
>> Sincerely
>>
>> Amjad
>

--

Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Josef Dvoracek

unread,

Jun 24, 2021, 5:44:27 AM6/24/21

to slurm...@lists.schedmd.com

> I thought setting partitions to DOWN will kill jobs?

nn, it just avoids starting new jobs from the job queue in given partition.

josef

Reply all

Reply to author

Forward