[slurm-users] AutoDetect=nvml throwing an error message

3,362 views
Skip to first unread message

Cristóbal Navarro

unread,
Apr 14, 2021, 5:48:16 PM4/14/21
to Slurm User Community List
Hi community,
I have set up the configuration files as mentioned in the documentation, but the slurmd of the GPU-compute node fails with the following error shown in the log.
After reading the slurm documentation, it is not entirely clear to me how to properly set up GPU autodetection for the gres.conf file as it does not mention if the nvml detection should be automatic or not.
I have also read the top google searches including https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html but that was a problem of a cuda installation overwritten (not my case).
This a DGX A100 node that comes with the Nvidia driver installed and nvml is located at /etc/include/nvml.h, not sure if there is a libnvml.so or similar as well.
How to tell SLURM to look at those paths? any ideas of experience sharing is welcome.
best


slurmd.log (GPU node)
[2021-04-14T17:31:42.302] got shutdown request
[2021-04-14T17:31:42.302] all threads complete
[2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
[2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of '(null)'
[2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
[2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of '(null)'
[2021-04-14T17:31:42.304] debug:  gres/gpu: fini: unloading
[2021-04-14T17:31:42.304] debug:  gpu/generic: fini: fini: unloading GPU Generic plugin
[2021-04-14T17:31:42.304] select/cons_tres: common_fini: select/cons_tres shutting down ...
[2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0
[2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature plugin unloaded
[2021-04-14T17:31:42.304] Slurmd shutdown completing
[2021-04-14T17:31:42.321] debug:  Log file re-opened
[2021-04-14T17:31:42.321] debug2: hwloc_topology_init
[2021-04-14T17:31:42.321] debug2: hwloc_topology_load
[2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml
[2021-04-14T17:31:42.446] Considering each NUMA node as a socket
[2021-04-14T17:31:42.446] debug:  CPUs:256 Boards:1 Sockets:8 CoresPerSocket:16 ThreadsPerCore:2
[2021-04-14T17:31:42.446] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2021-04-14T17:31:42.447] debug2: hwloc_topology_init
[2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
[2021-04-14T17:31:42.448] Considering each NUMA node as a socket
[2021-04-14T17:31:42.448] debug:  CPUs:256 Boards:1 Sockets:8 CoresPerSocket:16 ThreadsPerCore:2
[2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)
[2021-04-14T17:31:42.449] debug:  gres/gpu: init: loaded
[2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.




gres.conf (just AutoDetect=nvml)
➜  ~ cat /etc/slurm/gres.conf
# GRES configuration for native GPUS
# DGX A100 8x Nvidia A100
# not working, slurm cannot find nvml
AutoDetect=nvml
#Name=gpu File=/dev/nvidia[0-7]
#Name=gpu Type=A100 File=/dev/nvidia[0-7]
#Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
#Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
#Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
#Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
#Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
#Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
#Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
#Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63


slurm.conf
GresTypes=gpu
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres

## We don't want a node to go back in pool without sys admin acknowledgement
ReturnToService=0

## Basic scheduling
#SelectType=select/cons_res
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
SchedulerType=sched/backfill

TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup

## Nodes list
## use native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu

## Partitions list
PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE State=UP Nodes=nodeGPU01  Default=YES
PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE State=UP Nodes=nodeGPU01

--
Cristóbal A. Navarro

Cristóbal Navarro

unread,
Apr 14, 2021, 5:51:51 PM4/14/21
to Slurm User Community List
typing error, should be --> **located at /usr/include/nvml.h**
--
Cristóbal A. Navarro

Michael Di Domenico

unread,
Apr 15, 2021, 9:00:31 AM4/15/21
to Slurm User Community List
the error message sounds like when you built the slurm source it
wasn't able to find the nvml devel packages. if you look in where you
installed slurm, in lib/slurm you should have a gpu_nvml.so. do you?

On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro

Cristóbal Navarro

unread,
Apr 15, 2021, 1:47:47 PM4/15/21
to Slurm User Community List
Hi Michael,
Thanks, Indeed I don't have it. Slurm must have not detected it.
I double checked and NVML is installed (libnvidia-ml-dev for Ubuntu)
Here is some output, including the relevant paths for nvml.
Is it possible to tell the slurm compilation to check these paths for nvml ?
best

NVML PKG CHECK
➜  ~ sudo apt search nvml
Sorting... Done
Full Text Search... Done
cuda-nvml-dev-11-0/unknown 11.0.167-1 amd64
  NVML native dev links, headers

cuda-nvml-dev-11-1/unknown,unknown 11.1.74-1 amd64
  NVML native dev links, headers

cuda-nvml-dev-11-2/unknown,unknown 11.2.152-1 amd64
  NVML native dev links, headers

libnvidia-ml-dev/focal,now 10.1.243-3 amd64 [installed]
  NVIDIA Management Library (NVML) development files
python3-pynvml/focal 7.352.0-3 amd64
  Python3 bindings to the NVIDIA Management Library



NVML Shared library location
➜  ~ find /usr/lib | grep libnvidia-ml
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.102.04
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so



NVML Header
➜  ~ find /usr | grep nvml
/usr/include/nvml.h





SLURM LIBS
➜  ~ ls /usr/lib64/slurm/
accounting_storage_mysql.so*        core_spec_none.so*                  job_submit_pbs.so*                  proctrack_pgid.so*                
accounting_storage_none.so*         cred_munge.so*                      job_submit_require_timelimit.so*    route_default.so*                
accounting_storage_slurmdbd.so*     cred_none.so*                       job_submit_throttle.so*             route_topology.so*                
acct_gather_energy_ibmaem.so*       ext_sensors_none.so*                launch_slurm.so*                    sched_backfill.so*                
acct_gather_energy_ipmi.so*         gpu_generic.so*                     mcs_account.so*                     sched_builtin.so*                
acct_gather_energy_none.so*         gres_gpu.so*                        mcs_group.so*                       sched_hold.so*                    
acct_gather_energy_pm_counters.so*  gres_mic.so*                        mcs_none.so*                        select_cons_res.so*              
acct_gather_energy_rapl.so*         gres_mps.so*                        mcs_user.so*                        select_cons_tres.so*              
acct_gather_energy_xcc.so*          gres_nic.so*                        mpi_none.so*                        select_linear.so*                
acct_gather_filesystem_lustre.so*   jobacct_gather_cgroup.so*           mpi_pmi2.so*                        site_factor_none.so*              
acct_gather_filesystem_none.so*     jobacct_gather_linux.so*            mpi_pmix.so@                        slurmctld_nonstop.so*            
acct_gather_interconnect_none.so*   jobacct_gather_none.so*             mpi_pmix_v2.so*                     src/                              
acct_gather_interconnect_ofed.so*   jobcomp_elasticsearch.so*           node_features_knl_generic.so*       switch_none.so*                  
acct_gather_profile_hdf5.so*        jobcomp_filetxt.so*                 power_none.so*                      task_affinity.so*                
acct_gather_profile_influxdb.so*    jobcomp_lua.so*                     preempt_none.so*                    task_cgroup.so*                  
acct_gather_profile_none.so*        jobcomp_mysql.so*                   preempt_partition_prio.so*          task_none.so*                    
auth_munge.so*                      jobcomp_none.so*                    preempt_qos.so*                     topology_3d_torus.so*            
burst_buffer_generic.so*            jobcomp_script.so*                  prep_script.so*                     topology_hypercube.so*            
cli_filter_lua.so*                  job_container_cncu.so*              priority_basic.so*                  topology_none.so*                
cli_filter_none.so*                 job_container_none.so*              priority_multifactor.so*            topology_tree.so*                
cli_filter_syslog.so*               job_submit_all_partitions.so*       proctrack_cgroup.so*                                                  
cli_filter_user_defaults.so*        job_submit_lua.so*                  proctrack_linuxproc.so* 

--
Cristóbal A. Navarro

Stephan Roth

unread,
Apr 16, 2021, 6:18:51 AM4/16/21
to Slurm User Community List
Hi Cristóbal

Under Debian Stretch/Buster I had to set
LDFLAGS=-L/usr/lib/x86_64-linux-gnu/nvidia/current for configure to find
the NVML shared library.

Best,
Stephan

On 15.04.21 19:46, Cristóbal Navarro wrote:
> Hi Michael,
> Thanks, Indeed I don't have it. Slurm must have not detected it.
> I double checked and NVML is installed (libnvidia-ml-dev for Ubuntu)
> Here is some output, including the relevant paths for nvml.
> Is it possible to tell the slurm compilation to check these paths for nvml ?
> best
>
> *NVML PKG CHECK*
> ➜  ~ sudo apt search nvml
> Sorting... Done
> Full Text Search... Done
> cuda-nvml-dev-11-0/unknown 11.0.167-1 amd64
>   NVML native dev links, headers
>
> cuda-nvml-dev-11-1/unknown,unknown 11.1.74-1 amd64
>   NVML native dev links, headers
>
> cuda-nvml-dev-11-2/unknown,unknown 11.2.152-1 amd64
>   NVML native dev links, headers
>
> *libnvidia-ml-dev/focal,now 10.1.243-3 amd64 [installed]
>   NVIDIA Management Library (NVML) development files*
> python3-pynvml/focal 7.352.0-3 amd64
>   Python3 bindings to the NVIDIA Management Library
>
>
>
> *NVML Shared library location*
> ➜  ~ find /usr/lib | grep libnvidia-ml
> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.102.04
> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
>
>
>
> *NVML Header*
> ➜  ~ find /usr | grep nvml
> /usr/include/nvml.h
>
>
>
>
> *SLURM LIBS*
> <mailto:cristobal...@gmail.com>> wrote:
> >
> > typing error, should be --> **located at /usr/include/nvml.h**
> >
> > On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro
> <cristobal...@gmail.com
-------------------------------------------------------------------
Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
+4144 632 30 59 | ETF D 104 | Sternwartstrasse 7 | 8092 Zurich
-------------------------------------------------------------------

Reply all
Reply to author
Forward
0 new messages