Joshua Baker-LePain
unread,Mar 4, 2021, 11:36:47 PM3/4/21Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to slurm-users
We're in the midst of transitioning our SGE cluster to slurm 20.02.6, running
on up-to-date CentOS-7. We built RPMs from the standard tarball against CUDA
10.1. These RPMs worked just fine on our first GPU test node (with Tesla K80s)
using "AutoDetect=nvml" in /etc/gres.conf. However, we just tried to add a
second host with GTX 1080s in it. Running "slurmd -G" results in the following
output:
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: 4 GPU system device(s) detected
slurmd: WARNING: The following autodetected GPUs are being ignored:
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55
slurmd: Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55
slurmd: Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):0-27
slurmd: Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):0-27
slurmd: Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
My googling has utterly failed me on this. Any help? Thanks!
--
Joshua Baker-LePain
Wynton Cluster Sysadmin
UCSF