[slurm-users] Two gpu types on one node: gres/gpu count reported lower than configured (1

Gregor Hagelueken

unread,

Oct 16, 2023, 10:40:11 AM10/16/23

to slurm...@lists.schedmd.com

Hi,

We have a ubuntu server (22.04) with currently 5 GPUs (1 x l40 and 4 x rtx_a5000).

I am trying to configure slurm such that a user can select either the l40 or a5000 gpus for a particular job.

I have configured my slurm.conf and gres.conf files similar as in this old thread:

https://groups.google.com/g/slurm-users/c/fc-eoHpTNwU

I have pasted the contents of the two files below.

Unfortunately, my node is always on “drain” and scontrol shows this error:

Reason=gres/gpu count reported lower than configured (1 < 5)

Any idea what I am doing wrong?

Cheers and thanks for your help!

Gregor

Here are my slurm.conf and gres.conf files.

AutoDetect=off
NodeName=heimdall Name=gpu Type=l40 File=/dev/nvidia0
NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia1
NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia2
NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia3
NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia4

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmdDebug=debug2
#
ClusterName=heimdall
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
GresTypes=gpu
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
# COMPUTE NODES
NodeName=heimdall CPUs=128 Gres=gpu:l40:1,gpu:a5000:4 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=773635 State=UNKNOWN
PartitionName=heimdall Nodes=ALL Default=YES MaxTime=INFINITE State=UP DefMemPerCPU=8000 DefCpuPerGPU=16

Feng Zhang

unread,

Oct 16, 2023, 10:53:59 AM10/16/23

to Slurm User Community List

Try

scontrol update NodeName=heimdall state=DOWN Reason="gpu issue"

and then

scontrol update NodeName=heimdall state=RESUME

to see if it will work. Probably just SLURM daemon having a hiccup
after you made changes.

Best,

Feng

Gregor Hagelueken

unread,

Oct 18, 2023, 12:36:43 AM10/18/23

to Slurm User Community List

Dear Feng,
That worked! Thank you!
Cheers
Gregor

Sent from my iPhone.

> Am 16.10.2023 um 17:05 schrieb Feng Zhang <prod...@gmail.com>:
>
> Try

Reply all

Reply to author

Forward

[slurm-users] Two gpu types on one node: gres/gpu count reported lower than configured (1 < 5)

Gregor Hagelueken

Feng Zhang

Gregor Hagelueken