[slurm-users] NVML autodetect "Failed to get supported memory frequencies" error

190 views
Skip to first unread message

Joshua Baker-LePain

unread,
Mar 4, 2021, 11:36:47 PM3/4/21
to slurm-users
We're in the midst of transitioning our SGE cluster to slurm 20.02.6, running
on up-to-date CentOS-7. We built RPMs from the standard tarball against CUDA
10.1. These RPMs worked just fine on our first GPU test node (with Tesla K80s)
using "AutoDetect=nvml" in /etc/gres.conf. However, we just tried to add a
second host with GTX 1080s in it. Running "slurmd -G" results in the following
output:

slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: 4 GPU system device(s) detected
slurmd: WARNING: The following autodetected GPUs are being ignored:
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55
slurmd: Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55
slurmd: Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):0-27
slurmd: Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):0-27
slurmd: Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

My googling has utterly failed me on this. Any help? Thanks!

--
Joshua Baker-LePain
Wynton Cluster Sysadmin
UCSF


Kilian Cavalotti

unread,
Mar 5, 2021, 10:52:23 AM3/5/21
to Slurm User Community List
Hi Joshua,

On Thu, Mar 4, 2021 at 8:38 PM Joshua Baker-LePain <j...@salilab.org> wrote:
> slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
> slurmd: error: for the GPU : Not Supported

> slurmd: 4 GPU system device(s) detected
> slurmd: WARNING: The following autodetected GPUs are being ignored:
> slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55

It's just a warning, informing that getting (and setting) memory
frequencies is not supported on GeForce GPUs.
You won't be able to use the srun/salloc/sbatch --gpu-freq option for
those GPUs, but that's about it. Everything else should work normally.

Cheers,
--
Kilian

Joshua Baker-LePain

unread,
Mar 5, 2021, 1:27:59 PM3/5/21
to Slurm User Community List
On Fri, 5 Mar 2021 at 7:51am, Kilian Cavalotti wrote
Ah, that makes sense. The "GPUs are being ignored" bit was throwing me
off, but that was simply b/c I hadn't added the GRES to the node
configuration yet. Thanks!
Reply all
Reply to author
Forward
0 new messages