Hi,
I'm using slurm on a small 8 nodes cluster. I've recently added one GPU node with two Nvidia A100, one with 40Gb of RAM and one with 80Gb.
As using this GPU resource increase I would like to manage this
resource with Gres to avoid usage conflict. But at this time my
setup do not works as I can reach a GPU without reserving it:
srun -n 1 -p tenibre-gpu ./a.out
can use a GPU even if the reservation do not specify this
resource (checked with running nvidia-smi on the node).
"tenibre-gpu" is a slurm partition with only this gpu node.
From the documentation I've created a gres.conf file and it is propagated on all the nodes (9 compute nodes, 1 login node and the management node) and slurmd has been restarted.
gres.conf is:*
## GPU setup on tenibre-gpu-0
NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0 Flags=nvidia_gpu_env
NodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1 Flags=nvidia_gpu_env
In slurm.conf I have checked these flags:
## Basic scheduling
## Generic resources
SelectTypeParameters=CR_Core_Memory
SchedulerType=sched/backfill
SelectType=select/cons_tres
GresTypes=gpu
## Nodes list
....
Nodename=tenibre-gpu-0 RealMemory=257270 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
....
#partitions
PartitionName=tenibre-gpu MaxTime=48:00:00 DefaultTime=12:00:00 DefMemPerCPU=4096 MaxMemPerCPU=8192 Shared=YES State=UP Nodes=tenibre-gpu-0
...
May be I've missed something ? I'm running Slurm 20.11.7-1.
Thanks for your advices.
Patrick
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com
Hello Patrick,
On 11/13/24 12:01 PM, Patrick Begou via slurm-users wrote:
As using this GPU resource increase I would like to manage this resource with Gres to avoid usage conflict. But at this time my setup do not works as I can reach a GPU without reserving it:
srun -n 1 -p tenibre-gpu ./a.out
can use a GPU even if the reservation do not specify this resource (checked with running nvidia-smi on the node). "tenibre-gpu" is a slurm partition with only this gpu node.
I think what you're looking for is the ConstrainDevices parameter in cgroup.conf.
See here:
- https://slurm.schedmd.com/archive/slurm-20.11.7/cgroup.conf.html
Best,
Hi Roberto,
thanks for pointing to this parameter. I set it, update all the
nodes, restart slurmd everywhere but it does not change the
behavior.
However, when looking in the slurmd log on the GPU node I notice
this information:
[2024-11-13T16:41:08.434] debug: CPUs:32
Boards:1 Sockets:8 CoresPerSocket:4 ThreadsPerCore:1
[2024-11-13T16:41:08.434] debug:
gres/gpu: init: loaded
[2024-11-13T16:41:08.434] WARNING: A
line in gres.conf for GRES gpu:A100-40 has 1 more configured
than expected in slurm.conf. Ignoring extra GRES.
[2024-11-13T16:41:08.434] WARNING: A
line in gres.conf for GRES gpu:A100-80 has 1 more configured
than expected in slurm.conf. Ignoring extra GRES.
[2024-11-13T16:41:08.434] debug:
gpu/generic: init: init: GPU Generic plugin loaded
[2024-11-13T16:41:08.434] topology/none:
init: topology NONE plugin loaded
[2024-11-13T16:41:08.434] route/default:
init: route default plugin loaded
[2024-11-13T16:41:08.434] CPU frequency
setting not configured for this node
[2024-11-13T16:41:08.434] debug: Resource
spec: No specialized cores configured by default on this node
[2024-11-13T16:41:08.434] debug: Resource
spec: Reserved system memory limit not configured for this node
[2024-11-13T16:41:08.434] debug: Reading
cgroup.conf file /etc/slurm/cgroup.conf
[2024-11-13T16:41:08.434] error:
MaxSwapPercent value (0.0%) is not a valid number
[2024-11-13T16:41:08.436] debug:
task/cgroup: init: core enforcement enabled
[2024-11-13T16:41:08.437] debug:
task/cgroup: task_cgroup_memory_init: task/cgroup/memory:
total:257281M allowed:100%(enforced), swap:0%(enforced),
max:100%(257281M) max+swap:100%(514562M) min:30M
kmem:100%(257281M permissive) min:30M swappiness:0(unset)
[2024-11-13T16:41:08.437] debug:
task/cgroup: init: memory enforcement enabled
[2024-11-13T16:41:08.438] debug:
task/cgroup: task_cgroup_devices_init: unable to open
/etc/slurm/cgroup_allowed_devices_file.conf: No such file or
directory
[2024-11-13T16:41:08.438] debug:
task/cgroup: init: device enforcement enabled
[2024-11-13T16:41:08.438] debug:
task/cgroup: init: task/cgroup: loaded
[2024-11-13T16:41:08.438] debug:
auth/munge: init: Munge authentication plugin loaded
So something is wrong in may gres.conf file I think as I ttry do
configure 2 different devices on the node may be?
## GPU setup on tenibre-gpu-0
NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0
Flags=nvidia_gpu_env
NodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1
Flags=nvidia_gpu_env
Patrick
Hi Patrick,
You're missing a Gres= on your node in your slurm.conf:
Nodename=tenibre-gpu-0 RealMemory=257270 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
Gres=gpu:A100-40:1,gpu:A100-80:1
Ben
This email was sent to you by someone outside the University.You should only click on links or attachments if you are certain that the email is genuine and the content is safe.
-- Benjamin Smith <bsm...@ed.ac.uk> Computing Officer, AT-7.12a Research and Teaching Unit School of Informatics, University of EdinburghThe University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com