[slurm-users] gres/gpu count reported lower than configured

3,351 views
Skip to first unread message

Geleßus, Achim

unread,
Oct 21, 2022, 9:39:16 AM10/21/22
to slurm...@lists.schedmd.com
Hello Slurm Admins,

 I have set up Slurm for a GPU-cluster. The basic installation without
gres/gpu works well. Now I try adding the GPUs to the Slurm configuration.
All attempts have failed so far and I always get with sinfo -R the message

gres/gpu count reported lower than configured ( 0 < 2 )

With nvidia-smi the GPUs are found and running jobs on them works well.
I have tried to get rid off the above error by updating the state to IDLE with
scontrol. That attempt also failed with error message

slurm_update error: Invalid node state specified

I ran slurmd on the GPU node with debug5 level. From slurmd.log I see that
gres.conf is found and gres_gpu.so / gpu_genric.so are loaded.

My Slurm configuration is as follows:

slurm.conf:
GresTypes=gpu
NodeName=hpc-node14 CPUs=128 RealMemory=515815 Sockets=2 CoresPerSocket=64 ThreadsPerCore=1 Gres=gpu:2 State=UNKNOWN

gres.conf:
NodeName=hpc-node[01-14] Name=gpu File=/dev/nvidia[0-1]

Does anyone know what is wrong and how to fix that problem?
Thank you.


Best wishes
Achim


Groner, Rob

unread,
Oct 21, 2022, 10:27:16 AM10/21/22
to slurm...@lists.schedmd.com
I've encountered that many times, and for me, it was always related to AutoDetect and the nvidia-ml library.  Does your slurmd log contain a line like "debug:  skipping GRES for NodeName=t-gc-1202  AutoDetect=nvml"?  I see that you didn't specifically set AutoDetect to nvml in gres.conf, but maybe you should set AutoDetect=off just to be sure.

If "sinfo" shows an "inval" node, then setting them to Resume (not Idle) won't work until you figure out why it thinks the node configuration is invalid.

Geleßus, Achim

unread,
Oct 21, 2022, 12:17:37 PM10/21/22
to Slurm User Community List

Yes, you are right. AutoDetect=off in the gres.conf file solved the
problem! Thank you very much!!


Best wishes

Achim


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Groner, Rob <rug...@psu.edu>
Sent: Friday, October 21, 2022 16:26
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] gres/gpu count reported lower than configured
 
Reply all
Reply to author
Forward
0 new messages