Hello,
I am trying to rewrite my gres.conf file.
Before changes, this file was just like this:
NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23
NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1 Cores=12-23
NodeName=node-gpu-3 AutoDetect=off Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-4 AutoDetect=off Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Cores=0-7
# you can seee that nodes node-gpu-1 and node-gpu-2 have two GPUs each one, whereas nodes node-gpu-3 and node-gpu-4 have only one GPU each one
And my slurmd.conf was this:
[...]
AccountingStorageTRES=gres/gpu
GresTypes=gpu
NodeName=node-gpu-1 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
NodeName=node-gpu-2 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
NodeName=node-gpu-3 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=23000 Gres=gpu:GeForceRTX3080:1
NodeName=node-gpu-4 CPUs=8 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7800 Gres=gpu:GeForceRTX3080:1
NodeName=node-worker-[0-22] CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=47000
[...]
With this configuration, all seems works fine, except slurmctld.log reports:
[...]
error: _node_config_validate: gres/gpu: invalid GRES core specification (0-11) on node node-gpu-3
error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) on node node-gpu-1
error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) on node node-gpu-2
error: _node_config_validate: gres/gpu: invalid GRES core specification (0-7) on node node-gpu-4
[...]
However, even these errors, users can submit jobs and request GPUs resources.
Now, I have tried to reconfigure gres.conf and slurmd.conf in this way:
gres.conf:
Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0
Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1
Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0
Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1
Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0
Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0
# there is no NodeName attribute
slurmd.conf:
[...]
NodeName=node-gpu-1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
NodeName=node-gpu-2 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
NodeName=node-gpu-3 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=23000 Gres=gpu:GeForceRTX3080:1
NodeName=node-gpu-4 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7800 Gres=gpu:GeForceRTX3080:1
NodeName=node-worker-[0-22] SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=47000
# there is no CPUs attribute
[...]
With this new configuration, nodes with GPU start correctly slurmd.service daemon, but nodes without GPU (node-worker-[0-22]) can’t start slurmd.service daemon and returns this error:
[...]
error: Waiting for gres.conf file /dev/nvidia0
fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory
[...]
It seems SLURM is waiting that “node-workers” have also an nvidia GPU but not, theses nodes haven’t GPU... So, where is my configuration error?
I have read in https://slurm.schedmd.com/gres.conf.html about syntax and examples but it seems I’m doing some wrong.
Thanks!!
Hi,
my GPU testing system (named “gpu-node”) is a simple computer with one socket and a processor " Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz". Executing "lscpu", I can see there are 4 cores per socket, 2 threads per core and 8 CPUs:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 26
Model name: Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
My “gres.conf” file is:
NodeName=gpu-node Name=gpu Type=GeForce-GTX-TITAN-X File=/dev/nvidia0 CPUs=0-1
NodeName=gpu-node Name=gpu Type=GeForce-GTX-TITAN-Black File=/dev/nvidia1 CPUs=2-3
Running “numactl -H” in “gpu-node” host, reports:
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 7809 MB
node 0 free: 6597 MB
node distances:
node 0
0: 10
CPUs are assigned 0-1 for first GPU and 2-3 for second GPU. However, “lscpu” shows 8 CPUs… If I rewrite “gres.conf” in this way:
NodeName=gpu-node Name=gpu Type=GeForce-GTX-TITAN-X File=/dev/nvidia0 CPUs=0-3
NodeName=gpu-node Name=gpu Type=GeForce-GTX-TITAN-Black File=/dev/nvidia1 CPUs=4-7
when I run “scontrol reconfigure”, slurmctld log reports this error message:
[2024-06-05T11:42:18.558] error: _node_config_validate: gres/gpu: invalid GRES core specification (4-7) on node gpu-node
So I think SLURM only can get physical cores and not threads, so my node only can serve 4 cores (in “lspcu”) but in gres.conf I need to write “CPUs”, not “Cores”… isn’t it?
But if “numactl -H” shows 8 CPUs, why I can use this 8 CPUs in “gres.conf”?
Sorry about this large email.
Thanks.