[slurm-users] File-less NVIDIA GeForce 4070 Ti being removed from GRES list

52 views
Skip to first unread message

Shooktija S N via slurm-users

unread,
Apr 2, 2024, 7:10:39 AM4/2/24
to slurm...@lists.schedmd.com
Hi,

I am trying to set up Slurm (version 22.05) on a 3 node cluster each having an NVIDIA GeForce RTX 4070 Ti GPU.
I tried to follow along with the GRES setup tutorial on the Schedmd website and added the following (Gres=gpu:RTX4070TI:1) to the Node configuration in /etc/slurm/slurm.conf:

NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1

I do not have a gres.conf.
However, I see this line at the debug log level in /var/log/slurmd.log:

[2024-04-02T15:57:19.022] debug:  Removing file-less GPU gpu:RTX4070TI from final GRES list

What other configs are necessary for Slurm to work with my GPU? 

More information:
OS: Proxmox VE 8.1.4
Kernel: 6.5.13
CPU: AMD EPYC 7662
Memory: 128636MiB

/etc/slurm/slurm.conf that's shared by all the 3 nodes without the comment lines:

ClusterName=DlabCluster
SlurmctldHost=server1
GresTypes=gpu
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Reed Dier via slurm-users

unread,
Apr 2, 2024, 3:53:47 PM4/2/24
to Shooktija S N, slurm...@lists.schedmd.com
Assuming that you have the cuda drivers installed correctly (nvidia-smi for instance),
You should create a gres.conf with just this line:

> AutoDetect=nvml

If that doesn’t automagically begin working, you can increase the verbosity of slurmd with

> SlurmdDebug=debug2

It should then print a bunch of logs describing any gpu’s that are found.
You may need to alter the name from RTX4070TI (which is wordy as is).
I’m not sure just how lax the matching engine of slurm and the nvml interface are with matching strings.

Hope that helps,

Reed
> --
> slurm-users mailing list -- slurm...@lists.schedmd.com
> To unsubscribe send an email to slurm-us...@lists.schedmd.com

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Reply all
Reply to author
Forward
0 new messages