[slurm-users] GPU / cgroup challenges

1,921 views
Skip to first unread message

R. Paul Wiegand

unread,
May 1, 2018, 5:25:43 PM5/1/18
to slurm...@lists.schedmd.com
Greetings,

I am setting up our new GPU cluster, and I seem to have a problem
configuring things so that the devices are properly walled off via
cgroups. Our nodes each of two GPUS; however, if --gres is unset, or
set to --gres=gpu:0, I can access both GPUs from inside a job.
Moreover, if I ask for just 1 GPU then unset the CUDA_VISIBLE_DEVICES
environmental variable, I can access both GPUs. From my
understanding, this suggests that it is *not* being protected under
cgroups.

I've read the documentation, and I've read through a number of threads
where people have resolved similar issues. I've tried a lot of
configurations, but to no avail. Below I include some snippets of
relevant (current) parameters; however, I also am attaching most of
our full conf files.

[slurm.conf]
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
JobAcctGatherType=jobacct_gather/linux
AccountingStorageTRES=gres/gpu
GresTypes=gpu

NodeName=evc1 CPUs=32 RealMemory=191917 Sockets=2 CoresPerSocket=16
ThreadsPerCore=1 State=UNKNOWN NodeAddr=ivc1 Weight=1 Gres=gpu:2

[gres.conf]
NodeName=evc[1-10] Name=gpu File=/dev/nvidia0
COREs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NodeName=evc[1-10] Name=gpu File=/dev/nvidia1
COREs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

[cgroup.conf]
ConstrainDevices=yes

[cgroup_allowed_devices_file.conf]
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*

Thanks,
Paul.
cgroup_allowed_devices_file.conf
cgroup.conf
gres.conf
slurm.conf

Kevin Manalo

unread,
May 1, 2018, 7:01:32 PM5/1/18
to pa...@tesseract.org, Slurm User Community List
Paul,

Having recently set this up, this was my test, when you make a single GPU request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty bash) request you should only see the GPU assigned to you via 'nvidia-smi'

When gres is unset you should see

nvidia-smi
No devices were found

Otherwise, if you ask for 1 of 2, you should only see 1 device.

Also, I recall appending this to the bottom of

[cgroup_allowed_devices_file.conf]
..
Same as yours
...
/dev/nvidia*

There was a SLURM bug issue that made this clear, not so much in the website docs.

-Kevin

Christopher Samuel

unread,
May 1, 2018, 7:22:04 PM5/1/18
to slurm...@lists.schedmd.com
On 02/05/18 09:00, Kevin Manalo wrote:

> Also, I recall appending this to the bottom of
>
> [cgroup_allowed_devices_file.conf]
> ..
> Same as yours
> ...
> /dev/nvidia*
>
> There was a SLURM bug issue that made this clear, not so much in the website docs.

That shouldn't be necessary, all we have for this is..

The relevant line from our cgroup.conf:

[...]
# Constrain devices via cgroups (to limits access to GPUs etc)
ConstrainDevices=yes
[...]

Our entire cgroup_allowed_devices_file.conf:

/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/dev/ram
/dev/random
/dev/hfi*


This is on RHEL7.

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

R. Paul Wiegand

unread,
May 1, 2018, 7:24:13 PM5/1/18
to Kevin Manalo, Slurm User Community List
Thanks Kevin!

Indeed, nvidia-smi in an interactive job tells me that I can get access to the device when I should not be able to.

I thought including the /dev/nvidia* would whitelist those devices ... which seems to be the opposite of what I want, no?  Or do I misunderstand?

Thanks,
Paul

Christopher Samuel

unread,
May 1, 2018, 7:25:41 PM5/1/18
to slurm...@lists.schedmd.com
On 02/05/18 09:23, R. Paul Wiegand wrote:

> I thought including the /dev/nvidia* would whitelist those devices
> ... which seems to be the opposite of what I want, no? Or do I
> misunderstand?

No, I think you're right there, we don't have them listed and cgroups
constrains it correctly (nvidia-smi says no devices when you don't
request any GPUs).

Which version of Slurm are you on?

cheers,
Chris

R. Paul Wiegand

unread,
May 1, 2018, 7:29:09 PM5/1/18
to Slurm User Community List
Thanks Chris.  I do have the ConstrainDevices turned on.  Are the differences in your cgroup_allowed_devices_file.conf relevant in this case?

R. Paul Wiegand

unread,
May 1, 2018, 7:32:24 PM5/1/18
to Slurm User Community List
Slurm 17.11.0 on CentOS 7.1

Kevin Manalo

unread,
May 1, 2018, 7:42:59 PM5/1/18
to pa...@tesseract.org, Slurm User Community List

Chris,

 

Thanks for the correction there, that /dev/nvidia* isn’t needed in [cgroup_allowed_devices_file.conf] for constraining GPU devices.

 

-Kevin

Christopher Samuel

unread,
May 1, 2018, 7:55:07 PM5/1/18
to slurm...@lists.schedmd.com
On 02/05/18 09:31, R. Paul Wiegand wrote:

> Slurm 17.11.0 on CentOS 7.1

That's quite old (on both fronts, RHEL 7.1 is from 2015), we started on
that same Slurm release but didn't do the GPU cgroup stuff until a later
version (17.11.3 on RHEL 7.4).

I don't see anything in the NEWS file about relevant cgroup changes
though (there is a cgroup affinity fix but that's unrelated).

You do have identical slurm.conf, cgroup.conf,
cgroup_allowed_devices_file.conf etc on all the compute nodes too?
Slurmd and slurmctld have both been restarted since they were
configured?

All the best,

R. Paul Wiegand

unread,
May 1, 2018, 8:16:42 PM5/1/18
to Slurm User Community List
Yes, I am sure they are all the same.  Typically, I just scontrol reconfig; however, I have also tried restarting all daemons.

We are moving to 7.4 in a few weeks during our downtime.  We had a QDR -> OFED version constraint -> Lustre client version constraint issue that delayed our upgrade.

Should I just wait and test after the upgrade?

Christopher Samuel

unread,
May 1, 2018, 8:29:41 PM5/1/18
to slurm...@lists.schedmd.com
On 02/05/18 10:15, R. Paul Wiegand wrote:

> Yes, I am sure they are all the same. Typically, I just scontrol
> reconfig; however, I have also tried restarting all daemons.

Understood. Any diagnostics in the slurmd logs when trying to start
a GPU job on the node?

> We are moving to 7.4 in a few weeks during our downtime. We had a
> QDR -> OFED version constraint -> Lustre client version constraint
> issue that delayed our upgrade.

I feel your pain.. BTW RHEL 7.5 is out now so you'll need that if
you need current security fixes.

> Should I just wait and test after the upgrade?

Well 17.11.6 will be out then that will include for a deadlock
that some sites hit occasionally, so that will be worth throwing
into the mix too. Do read the RELEASE_NOTES carefully though,
especially if you're using slurmdbd!

R. Paul Wiegand

unread,
May 2, 2018, 9:05:11 AM5/2/18
to Slurm User Community List
I dug into the logs on both the slurmctld side and the slurmd side.
For the record, I have debug2 set for both and
DebugFlags=CPU_BIND,Gres.

I cannot see much that is terribly relevant in the logs. There's a
known parameter error reported with the memory cgroup specifications,
but I don't think that is germane.

When I set "--gres=gpu:1", the slurmd log does have encouraging lines such as:

[2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device
/dev/nvidia0 for job
[2018-05-02T08:47:04.916] [203.0] debug: Not allowing access to
device /dev/nvidia1 for job

However, I can still "see" both devices from nvidia-smi, and I can
still access both if I manually unset CUDA_VISIBLE_DEVICES.

When I do *not* specify --gres at all, there is no reference to gres,
gpu, nvidia, or anything similar in any log at all. And, of course, I
have full access to both GPUs.

I am happy to attach the snippets of the relevant logs, if someone
more knowledgeable wants to pour through them. I can also set the
debug level higher, if you think that would help.


Assuming upgrading will solve our problem, in the meantime: Is there
a way to ensure that the *default* request always has "--gres=gpu:1"?
That is, this situation is doubly bad for us not just because there is
*a way* around the resource management of the device but also because
the *DEFAULT* behavior if a user issues an srun/sbatch without
specifying a Gres is to go around the resource manager.

Fulcomer, Samuel

unread,
May 2, 2018, 11:12:59 AM5/2/18
to Slurm User Community List
This came up around 12/17, I think, and as I recall the fixes were added to the src repo then; however, they weren't added to any fo the 17.releases. 

Wiegand, Paul

unread,
May 2, 2018, 11:15:57 AM5/2/18
to Fulcomer, Samuel, Slurm User Community List
So there is a patch?

------ Original message------
From: Fulcomer, Samuel
Date: Wed, May 2, 2018 11:14
To: Slurm User Community List;
Cc:
Subject:Re: [slurm-users] GPU / cgroup challenges

Chris Samuel

unread,
May 5, 2018, 9:05:22 AM5/5/18
to slurm...@lists.schedmd.com
On Wednesday, 2 May 2018 11:04:34 PM AEST R. Paul Wiegand wrote:

> When I set "--gres=gpu:1", the slurmd log does have encouraging lines such
> as:
>
> [2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device
> /dev/nvidia0 for job
> [2018-05-02T08:47:04.916] [203.0] debug: Not allowing access to
> device /dev/nvidia1 for job
>
> However, I can still "see" both devices from nvidia-smi, and I can
> still access both if I manually unset CUDA_VISIBLE_DEVICES.

The only thing I can think of is a bug that's been fixed since 17.11.0 (as I
know it works for us with 17.11.5) or a kernel bug (or missing device
cgroups).

Sorry I can't be more helpful!

R. Paul Wiegand

unread,
May 21, 2018, 7:18:10 AM5/21/18
to Slurm User Community List
I am following up on this to first thank everyone for their suggestion and also let you know that indeed, ugrading from 17.11.0 to 17.11.6 solved the problem. Our GPUs are now properly walled off via cgroups per our existing config.

Thanks!

Paul.
Reply all
Reply to author
Forward
0 new messages