[slurm-users] How to get an estimate of job completion for planned maintenance?

828 views
Skip to first unread message

Ahmad Khalifa

unread,
Nov 5, 2021, 6:17:38 PM11/5/21
to Slurm User Community List
If I plan maintenance on a certain day, how long before that day should I set the queue to drain mode?! Is there a way to estimate the completion date / time of current running jobs?!

Regards. 

Carsten Beyer

unread,
Nov 7, 2021, 7:45:54 AM11/7/21
to slurm...@lists.schedmd.com
Hi Ahmad,

you could use squeue -h -t r --format="%i %e" | sort -k2 to get a list
of all running jobs sorted by their endtime.

We use normaly a maintenance reservation with starttime of the
mainenance (or with some leading time before) to get the system free of
jobs. That make things easier, because if you drain your cluster no new
jobs could start. With the reservation jobs with a shorter wallclock
time could be backfilled till the reservation/maintenance starts. You
can put the reservation anytime in the system but at least or before
"<starttime maintenance> minus <longest MaxTime of partition>", e.g.

scontrol create reservation=<name> starttime=<starttime>
duration=<duration>  user=root flags=maint nodes=ALL

Hope, that helps a little bit,

Carsten

--
Carsten Beyer
Abteilung Systeme

Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a * D-20146 Hamburg * Germany

Phone: +49 40 460094-221
Fax: +49 40 460094-270
Email: be...@dkrz.de
URL: http://www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784

Diego Zuccato

unread,
Nov 8, 2021, 6:48:49 AM11/8/21
to Slurm User Community List, Carsten Beyer
Hi.

I usually create a maintenance reservation with IGNORE_JOBS flag, so I
can avoid new jobs interfering with it. Then I'll contact job owners to
warn 'em I'll kill their jobs if needed.
Actually that's useful only for nodes that allow unlimited time jobs:
for the others it's sufficient to plan in advance (if max run time is
24h, then the reservation should be created more than 24h in advance).

Just my $.02

Diego

Il 07/11/2021 13:45, Carsten Beyer ha scritto:
> Hi Ahmad,
>
> you could use squeue -h -t r --format="%i %e" | sort -k2 to get a list
> of all running jobs sorted by their endtime.
>
> We use normaly a maintenance reservation with starttime of the
> mainenance (or with some leading time before) to get the system free of
> jobs. That make things easier, because if you drain your cluster no new
> jobs could start. With the reservation jobs with a shorter wallclock
> time could be backfilled till the reservation/maintenance starts. You
> can put the reservation anytime in the system but at least or before
> "<starttime maintenance> minus <longest MaxTime of partition>", e.g.
>
> scontrol create reservation=<name> starttime=<starttime>
> duration=<duration>  user=root flags=maint nodes=ALL
>
> Hope, that helps a little bit,
>
> Carsten
>

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Marcus Wagner

unread,
Nov 9, 2021, 7:56:15 AM11/9/21
to slurm...@lists.schedmd.com
I have written a script, which loops through all runnning jobs to tell me, when a job ends on a specific node. This can be also done for all nodes. The output would be for the longest job e.g.:

ncm0430 -> 2021-12-04T15:48:35

Nonetheless, we also plan maintenances with reservations, we do not drain the partitions.


Best
Marcus


Am 05.11.2021 um 23:16 schrieb Ahmad Khalifa:
> If I plan maintenance on a certain day, how long before that day should I set the queue to drain mode?! Is there a way to estimate the completion date / time of current running jobs?!
>
> Regards.

--
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Server, Storage, HPC
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ

Ole Holm Nielsen

unread,
Nov 9, 2021, 8:42:18 AM11/9/21
to slurm...@lists.schedmd.com
On 11/9/21 13:55, Marcus Wagner wrote:
> I have written a script, which loops through all runnning jobs to tell me,
> when a job ends on a specific node. This can be also done for all nodes.
> The output would be for the longest job e.g.:
>
> ncm0430              -> 2021-12-04T15:48:35
>
> Nonetheless, we also plan maintenances with reservations, we do not drain
> the partitions.

The pestat script from
https://github.com/OleHolmNielsen/Slurm_tools/blob/master/pestat also
print job ending times on nodes:

$ pestat -E | sort -k 11

You can make all sorts of node selections with the other pestat options.

/Ole

Loris Bennett

unread,
Nov 9, 2021, 8:42:36 AM11/9/21
to Slurm User Community List
Hi Ahmed,
We just set up a reservation at a point at a time which is further in the
future than our maximum run-time. There is then no need to drain
anything. Short running jobs can still run right up to the reservation.

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin Email loris....@fu-berlin.de

Chris Samuel

unread,
Nov 9, 2021, 9:30:24 PM11/9/21
to slurm...@lists.schedmd.com
On 9/11/21 5:42 am, Loris Bennett wrote:

> We just set up a reservation at a point at a time which is further in the
> future than our maximum run-time. There is then no need to drain
> anything. Short running jobs can still run right up to the reservation.

This is the same technique we use too, works well!

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Ryan Novosielski

unread,
Dec 14, 2021, 4:33:55 PM12/14/21
to Slurm User Community List
Another useful format string – and again, this is if you mess up and don’t do a reservation early enough (or your environment has no concept of a time limit) – is this one:

squeue -o %u,%i,%L

Will show you username, job id, and remaining time – which is sometimes easier to deal with than end date/time.

--
#BlackLivesMatter
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
Reply all
Reply to author
Forward
0 new messages