[slurm-users] Releasing stale allocated TRES

Schneider, Gerald

unread,

Nov 23, 2023, 4:57:17 AM11/23/23

to slurm...@lists.schedmd.com

Hi there,

I have a recurring problem with allocated TRES, which are not released after all jobs on that node are finished. The TRES are still marked as allocated and no new jobs can't be scheduled on that node using those TRES.

$ scontrol show node node2
NodeName=node2 Arch=x86_64 CoresPerSocket=64
CPUAlloc=0 CPUTot=256 CPULoad=0.11
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:tesla:8
NodeAddr=node2 NodeHostName=node2 Version=21.08.5
OS=Linux 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023
RealMemory=1025593 AllocMem=0 FreeMem=1025934 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=AMPERE
BootTime=2023-11-23T09:01:28 SlurmdStartTime=2023-11-23T09:02:09
LastBusyTime=2023-11-23T09:03:19
CfgTRES=cpu=256,mem=1025593M,billing=256,gres/gpu=8,gres/gpu:tesla=8
AllocTRES=gres/gpu=8
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Previously the allocation was gone after the server was turned off for a couple of hours (power conservation) but the issue occurred again and this time it persists even after the server was off over night.

Is there any way to release the allocation manually?

Regards,
Gerald Schneider

--
Gerald Schneider

Fraunhofer-Institut für Graphische Datenverarbeitung IGD
Joachim-Jungius-Str. 11 | 18059 Rostock | Germany
Tel. +49 6151 155-309 | +49 381 4024-193 | Fax +49 381 4024-199
gerald.s...@igd-r.fraunhofer.de | www.igd.fraunhofer.de

Markus Kötter

unread,

Nov 23, 2023, 5:50:49 AM11/23/23

to slurm...@lists.schedmd.com

Hi,

On 23.11.23 10:56, Schneider, Gerald wrote:
> I have a recurring problem with allocated TRES, which are not
> released after all jobs on that node are finished. The TRES are still
> marked as allocated and no new jobs can't be scheduled on that node
> using those TRES.

Remove the node from slurm.conf and restart slurmctld, re-add, restart.
Remove from Partition definitions as well.

MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security

Ole Holm Nielsen

unread,

Nov 23, 2023, 6:16:54 AM11/23/23

to slurm...@lists.schedmd.com

On 11/23/23 11:50, Markus Kötter wrote:
> On 23.11.23 10:56, Schneider, Gerald wrote:
>> I have a recurring problem with allocated TRES, which are not
>> released after all jobs on that node are finished. The TRES are still
>> marked as allocated and no new jobs can't be scheduled on that node
>> using those TRES.
>
> Remove the node from slurm.conf and restart slurmctld, re-add, restart.
> Remove from Partition definitions as well.

Just my 2 cents: Do NOT remove a node from slurm.conf just as described!

When adding or removing nodes, both slurmctld as well as all slurmd's must
be restarted! See the SchedMD presentation
https://slurm.schedmd.com/SLUG23/Field-Notes-7.pdf slides 51-56 for the
recommended procedure.

/Ole

Bjørn-Helge Mevik

unread,

Nov 23, 2023, 6:56:12 AM11/23/23

to slurm...@schedmd.com

"Schneider, Gerald" <gerald.s...@igd-r.fraunhofer.de> writes:

> Is there any way to release the allocation manually?

I've only seen this once on our clusters, and that time it helped just
restarting slurmctld.

If this is a recurring problem, perhaps it will help to upgrade Slurm.
You are running quite an old version.

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

signature.asc

Reply all

Reply to author

Forward