[slurm-users] Two gpu types on one node: gres/gpu count reported lower than configured (1 < 5)

421 views
Skip to first unread message

Gregor Hagelueken

unread,
Oct 16, 2023, 10:40:11 AM10/16/23
to slurm...@lists.schedmd.com
Hi,

We have a ubuntu server (22.04) with currently 5 GPUs (1 x l40 and 4 x rtx_a5000).
I am trying to configure slurm such that a user can select either the l40 or a5000 gpus for a particular job.
I have configured my slurm.conf and gres.conf files similar as in this old thread:
I have pasted the contents of the two files below.
 
Unfortunately, my node is always on “drain” and scontrol shows this error:
Reason=gres/gpu count reported lower than configured (1 < 5)

Any idea what I am doing wrong?
Cheers and thanks for your help!
Gregor

Here are my slurm.conf and gres.conf files.
AutoDetect=off
NodeName=heimdall Name=gpu Type=l40  File=/dev/nvidia0
NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia1
NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia2
NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia3
NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia4

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmdDebug=debug2
#
ClusterName=heimdall
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
GresTypes=gpu
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
# COMPUTE NODES
NodeName=heimdall CPUs=128 Gres=gpu:l40:1,gpu:a5000:4 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=773635 State=UNKNOWN
PartitionName=heimdall Nodes=ALL Default=YES MaxTime=INFINITE State=UP DefMemPerCPU=8000 DefCpuPerGPU=16

Feng Zhang

unread,
Oct 16, 2023, 10:53:59 AM10/16/23
to Slurm User Community List
Try

scontrol update NodeName=heimdall state=DOWN Reason="gpu issue"

and then

scontrol update NodeName=heimdall state=RESUME

to see if it will work. Probably just SLURM daemon having a hiccup
after you made changes.

Best,

Feng

Gregor Hagelueken

unread,
Oct 18, 2023, 12:36:43 AM10/18/23
to Slurm User Community List
Dear Feng,
That worked! Thank you!
Cheers
Gregor

Sent from my iPhone.

> Am 16.10.2023 um 17:05 schrieb Feng Zhang <prod...@gmail.com>:
>
> Try
Reply all
Reply to author
Forward
0 new messages