[slurm-users] Issue with Enforcing GPU Usage Limits in Slurm

161 views
Skip to first unread message

lyz--- via slurm-users

unread,
Apr 14, 2025, 9:30:00 AM4/14/25
to slurm...@lists.schedmd.com
Hi, I am currently encountering an issue with Slurm's GPU resource limitation. I have attempted to restrict the number of GPUs a user can utilize by executing the following command:​
sacctmgr modify user lyz set MaxTRES=gres/gpu=2
This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utilizes all 4 GPUs during execution. This indicates that the GPU usage limit is not being enforced as expected.​ How can I resolve this situation.

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Christopher Samuel via slurm-users

unread,
Apr 14, 2025, 6:50:42 PM4/14/25
to slurm...@lists.schedmd.com
On 4/14/25 6:27 am, lyz--- via slurm-users wrote:

> This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utilizes all 4 GPUs during execution. This indicates that the GPU usage limit is not being enforced as expected.​ How can I resolve this situation.

You need to make sure you're using cgroups to control access to devices
for tasks, a starting point for reading up on this is here:

https://slurm.schedmd.com/cgroups.html

Good luck!

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

lyz--- via slurm-users

unread,
Apr 15, 2025, 6:32:14 AM4/15/25
to slurm...@lists.schedmd.com
Hi, Christopher. Thank you for your reply.

I have already modified the cgroup.conf configuration file in Slurm as follows:

vim /etc/slurm/cgroup.conf
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#
CgroupAutomount=yes

ConstrainCores=yes
ConstrainRAMSpace=yes

Then I edited slurm.conf:

vim /etc/slurm/slurm.conf
PrologFlags=CONTAIN
TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup
I restarted both the slurmctld service on the head node and the slurmd service on the compute nodes.

I also set resource limits for the user:
[root@head1 ~]# sacctmgr show assoc format=cluster,account%35,user%35,partition,maxtres%35,GrpCPUs,GrpMem
Cluster Account User Partition MaxTRES GrpCPUs GrpMem
---------- ----------------------------------- ----------------------------------- ---------- ----------------------------------- -------- -------
cluster lyz
cluster lyz lyz gpus=2 80

However, when I specify CUDA device numbers in my .py script, for example:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
def test_gpu():
if torch.cuda.is_available():
torch.cuda.set_device(4)
print("CUDA is available. PyTorch can use GPU.")

num_gpus = torch.cuda.device_count()
print(f"Number of GPUs available: {num_gpus}")

current_device = torch.cuda.current_device()
print(f"Current GPU device: {current_device}")

device_name = torch.cuda.get_device_name(current_device)
print(f"Name of the current GPU device: {device_name}")

x = torch.rand(5, 5).cuda()
print("Random tensor on GPU:")
print(x)
else:
print("CUDA is not available. PyTorch will use CPU.")
time.sleep(1000)

if __name__ == "__main__":
test_gpu()

When I run this script, it still bypasses the resource restrictions set by cgroup.

Are there any other ways to solve this problem?

Sean Crosby via slurm-users

unread,
Apr 15, 2025, 7:49:09 AM4/15/25
to l...@simplehpc.com, slurm...@lists.schedmd.com
You need to add

ConstrainDevices=yes

To your cgroup.conf and restart slurmd on your nodes. This is the setting which gives access to only the GRES you request in the jobs

Sean


From: lyz--- via slurm-users <slurm...@lists.schedmd.com>
Sent: Tuesday, April 15, 2025 8:29:41 PM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: [EXT] [slurm-users] Re: Issue with Enforcing GPU Usage Limits in Slurm
 
External email: Please exercise caution

lyz--- via slurm-users

unread,
Apr 15, 2025, 9:18:40 AM4/15/25
to slurm...@lists.schedmd.com
Hi, Sean.
I followed your instructions and added ConstrainDevices=yes to the /etc/slurm/cgroup.conf file on the server node, and then restarted the relevant services on both the server and the client.
However, I still can't enforce the restriction in the Python program.

It seems like the restriction applies to the physical GPU hardware, but it doesn't take effect for CUDA.

Sean Crosby via slurm-users

unread,
Apr 15, 2025, 3:58:24 PM4/15/25
to l...@simplehpc.com, slurm...@lists.schedmd.com
What version of Slurm are you running and what's the contents of your gres.conf file? 

Sean


From: lyz--- via slurm-users <slurm...@lists.schedmd.com>
Sent: Tuesday, April 15, 2025 11:16:40 PM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: [slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm
 
External email: Please exercise caution

Christopher Samuel via slurm-users

unread,
Apr 15, 2025, 4:17:14 PM4/15/25
to slurm...@lists.schedmd.com
On 4/15/25 12:55 pm, Sean Crosby via slurm-users wrote:

> What version of Slurm are you running and what's the contents of your
> gres.conf file?

Also what does this say?

systemctl cat slurmd | fgrep Delegate

--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

lyz--- via slurm-users

unread,
Apr 15, 2025, 10:00:33 PM4/15/25
to slurm...@lists.schedmd.com
Hi, Sean. It's the latest slurm version.
[root@head1 ~]# sinfo --version
slurm 22.05.3

And this is my content of the gres.conf in gpu node.
# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Name=gpu File=/dev/nvidia3
Name=gpu File=/dev/nvidia4
Name=gpu File=/dev/nvidia5
Name=gpu File=/dev/nvidia6
Name=gpu File=/dev/nvidia7
# END AUTOGENERATED SECTION -- DO NOT REMOVE

lyz--- via slurm-users

unread,
Apr 15, 2025, 10:05:05 PM4/15/25
to slurm...@lists.schedmd.com
Hi, Christ. Thank you for continuing paying attention to this issue.
I followed your instuction. And This is the output:

[root@head1 ~]# systemctl cat slurmd | fgrep Delegate
Delegate=yes

lyz

Christopher Samuel via slurm-users

unread,
Apr 16, 2025, 12:58:05 AM4/16/25
to slurm...@lists.schedmd.com
On 4/15/25 6:57 pm, lyz--- via slurm-users wrote:

> Hi, Sean. It's the latest slurm version.
> [root@head1 ~]# sinfo --version
> slurm 22.05.3

That's quite old (and no longer supported), the oldest still supported
version is 23.11.10 and 24.11.4 came out recently.

What does the cgroup.conf file on one of your compute nodes look like?

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Christopher Samuel via slurm-users

unread,
Apr 16, 2025, 12:58:07 AM4/16/25
to slurm...@lists.schedmd.com
Hiya,

On 4/15/25 7:03 pm, lyz--- via slurm-users wrote:

> Hi, Christ. Thank you for continuing paying attention to this issue.
> I followed your instuction. And This is the output:
>
> [root@head1 ~]# systemctl cat slurmd | fgrep Delegate
> Delegate=yes

That looks good to me, thanks for sharing that!

--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

lyz--- via slurm-users

unread,
Apr 16, 2025, 1:31:57 AM4/16/25
to slurm...@lists.schedmd.com
Hi ! Christ.
The cgroup.conf on my gpu node is as same as head node. The content are as follow:
CgroupAutomount=yes

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes

I'll try slurm of high version.

lyz--- via slurm-users

unread,
Apr 16, 2025, 3:58:53 AM4/16/25
to slurm...@lists.schedmd.com
Hi! Christ.
Thank you again for your instruction.

I've tried version 23.11.10. It does work.

When I ran the script using the following command, it successfully restricted the usage to the specified CUDA devices:
srun -p gpu --gres=gpu:2 -nodelist=node11 python test.py

And when I checked the GPUs using this command, I saw the expected number of GPUs:
srun -p gpu --gres=gpu:2 -nodelist=node11 --pty nvidia-smi

Thank you very much for your guidance.

Best luck
Lyz

Chris Samuel via slurm-users

unread,
Apr 16, 2025, 11:24:18 AM4/16/25
to slurm...@lists.schedmd.com
Hiya!

On 16/4/25 12:56 am, lyz--- via slurm-users wrote:

> I've tried version 23.11.10. It does work.

Oh that's wonderful, so glad it helped! It did seem quite odd that it
wasn't working for you before then. I wonder if this was a cgroups v1 vs
cgroups v2 thing?

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

lyz--- via slurm-users

unread,
Apr 16, 2025, 9:47:43 PM4/16/25
to slurm...@lists.schedmd.com
Hi Chris!

I didn't modify the cgroup configuration file; I only upgraded the Slurm version.
After that, the limitations worked successfully.

It's quite odd.

lyz
Reply all
Reply to author
Forward
0 new messages