[slurm-users] Use gres to handle permissions of /dev/dri/card* and /dev/dri/renderD*?

1,024 views
Skip to first unread message

Martin Pecka

unread,
Sep 4, 2020, 3:29:44 PM9/4/20
to slurm...@lists.schedmd.com
Hello, we want to use EGL backend for accessing OpenGL without the need
for Xorg. This approach requires access to devices /dev/dri/card* and
/dev/dri/renderD* . Is there a way to give access to these devices along
with /dev/nvidia* which we use for CUDA? Ideally as a single generic
resource that would give permissions to all three files at once.

Thank you for any tips.

--
Martin Pecka


Mgr. Martin Pecka

unread,
Oct 20, 2020, 4:58:54 PM10/20/20
to slurm...@lists.schedmd.com
Pinging this topic again. Nobody has an idea how to define multiple
files to be treated as a single gres?

Thank you for help,

Martin Pecka

Dne 4.9.2020 v 21:29 Martin Pecka napsal(a):

Daniel Letai

unread,
Oct 21, 2020, 12:52:53 PM10/21/20
to slurm...@lists.schedmd.com

Take a look at https://github.com/SchedMD/slurm/search?q=dri%2F

If the ROCM-SMI API is present, using AutoDetect=rsmi in gres.conf might be enough, if I'm reading this right.


Of course, this assumes the cards in question are AMD and not NVIDIA.

Daniel Letai

unread,
Oct 21, 2020, 1:02:28 PM10/21/20
to slurm...@lists.schedmd.com

Just a quick addendum - rsmi_dev_drm_render_minor_get used in the plugin references the ROCM-SMI lib from https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/2e8dc4f2a91bfa7661f4ea289736b12153ce23c2/src/rocm_smi.cc#L1689 So the library (as an .so file) should be installed for this to work.



On 20/10/2020 23:58, Mgr. Martin Pecka wrote:

Martin Pecka

unread,
Jan 6, 2022, 12:21:26 PM1/6/22
to Slurm User Community List

Hello, I'm reviving a bit of old thread, but I just noticed I don't see my January 2021 message in the archives, so I'm sending it again now that the issue again got live on our side.


To quickly recap, we want to add permissions not only to /dev/nvidia* devices based on the requested gres, but also to the corresponding /dev/dri/card* and /dev/dri/renderD* devices - they are all connected to the same GPU, but the additional two allow using the card for rendering instead of CUDA computations etc. I had some idea how to achieve that without changing SLURM codebase, and I got something that could almost work. It probably just needs some polishing. Could anybody please comment whether the proposed solution is a good idea?


The 15 Jan 2021 message:


So I started thinking if this could not be somehow handled by a prologue script and direct cgroup manipulation? I'm no expert in either, so please check my lines of thoughts.


#!/bin/bash

PATH=/usr/bin/:/bin

gpus=${SLURM_STEP_GPUS:-$SLURM_JOB_GPUS}  # or CUDA_VISIBLE_DEVICES when run inside the cgroup?
cgroup=$(cat /proc/self/cgroup | grep devices | cut -d: -f3)  # or something else?

# blacklist all DRM devices (major 226)
cgset -r devices.deny="a 226:* rwm" devices:${cgroup}

for NVIDIA_SMI_ID in ${gpus//,/ }; do
  # find on which PCI path does this device sit
  pci_id=$(nvidia-smi -i $NVIDIA_SMI_ID --query-gpu=pci.bus_id --format=noheader,csv | tail -c+5 | tr '[:upper:]' '[:lower:]')

  # find the DRM devices sitting on the same PCI bus
  card=$(ls /sys/bus/pci/devices/${pci_id}/drm/ | grep card | xargs basename)
  render=$(ls /sys/bus/pci/devices/${pci_id}/drm/ | grep renderD | xargs basename)

  # allow access to the DRM devices
  [ -n "${card}" ] && cgset -r devices.allow="c $(cat /sys/class/drm/${card}/dev) rw" devices:${cgroup} && echo "Allowed /dev/dri/${card} DRI device access"
  [ -n "${render}" ] && cgset -r devices.allow="c $(cat /sys/class/drm/${render}/dev) rw" devices:${cgroup} && echo "Allowed /dev/dri/${render} render node access"
done

Now I wonder whether this should be Prolog=, TaskProlog= or something else (that would also change whether I look at CUDA_VISIBLE_DEVICES or SLURM_STEP_GPUS, and how I figure out the cgroup name). I guess that were this script run as the invoking user, then nothing would prevent him from gaining access to all devices again. So I'd incline to treat it as a Prolog= script run by root. How would I get the cgroup ID then? Compose it from parts as mentioned in the slurm cgroups docs? (/cgroup/cpuset/slurm/uid_100/job_123/step_0/task_2) Or is there a more reliable way?


A related but offtopic idea popped up in my head when thinking about GPUs. Most of them are actually a consolidation of more devices like stream processors, encoders, decoders, raytraces, shaders, memory etc. Could it be possible (in future) to actually offer each of these pieces as a different gres? The problem is most of them do not have any special file which the user could lock to tell the others he's playing there now. So it'd probably require support at the level of cgroup implemetation, which, in turn, would require changing all GPU drivers. And it would require being able to request just chunks of GPU memory (not sure if that's possible right now, but I think I saw some pull request about that).


Thank you for hints!


Martin


Dne 21.10.2020 v 19:09 Martin Pecka napsal(a):

Or maybe could this be "emulated" by a set of 3 GRES per card that are "linked" together? I.e. rules like "if the user requests GRES /dev/dri/card0, he will also automatically need to claim /dev/dri/renderD128 and /dev/nvidia0"?


Dne 21.10.2020 v 18:52 Daniel Letai napsal(a):

Stephan Roth

unread,
Jan 6, 2022, 2:28:05 PM1/6/22
to slurm...@lists.schedmd.com
Hi Martin,

My (quick and unrefined) thoughts about this:

This could only work if you don't have ConstrainDevices=yes in your
cgroup.conf. Which I don't think is a good idea, as jobs can use GPUs
allocated to other jobs.

Let's assume you don't use ConstrainDevices=yes:
The GPU's allocated to a job can only safely be identified in the job's
context (task prologue). I assume you're aware of this, as your
suggesting to use SLURM_STEP_GPUS or CUDA_VISIBLE_DEVICES.

On a sidenote: AFAIK, these environment variables are supposed to be
identical to the minor PCI device number (for CUDA_VISIBLE_DEVICES,
provided CUDA_DEVICE_ORDER=PCI_BUS_ID is set). This might change after a
node is rebooted. For your use case, this shouldn't matter, though.

Then my question is, how can you safely use cgset in a jobs context
(i.e. with it's user's privileges) to modify access to /dev/dri/card* etc?


Is your goal to enable VirtualGL for jobs? If it is, I tried a solution
with packing it, its dependencies, a minimal X11 server and turbovnc
into a singularity image which can be used in a job.
This worked as a proof of concept for glxgears, but not for the software
users wanted to run.

Eventually this might work with Vulkan instead of OpenGL. Software in
question would have to be updated, too, GPU drivers would have to
support the needed Vulkan features as well.


Any more thoughts and insights about this topic are appreciated by me as
well.

Best,
Stephan
-------------------------------------------------------------------
Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
+4144 632 30 59 | ETF D 104 | Sternwartstrasse 7 | 8092 Zurich
-------------------------------------------------------------------

Martin Pecka

unread,
Jan 7, 2022, 5:30:41 AM1/7/22
to Slurm User Community List
Maybe I have good news, Stephan (and others). I discovered SLURM 20.11
added a MultipleFiles option to gres.conf, which replaces File=. There
are no docs about it yet, but I found a (possibly) working snippet
making use of this option here:
https://bugs.schedmd.com/show_bug.cgi?id=11091#c13 .

So my guess is that the correct line could be something like

    Name=gpu Type=a100
MultipleFile=/dev/nvidia0,/dev/dri/card1,/dev/dri/renderD128

(our machines have an integrated GPU, too, which creates /dev/dri/card0,
but not renderD device; that's why I allocate card1 to the 0th nvidia gpu)

I'll try to make a test setup using this and report how it works. Most
importantly, it would be essential to know whether the card* and
renderD* device names are also assigned in PCI order (hope so!). And
whether cgroups are handling these devices correctly. There would also
be a problem how to report which card* and renderD* devices the user can
use in a job, but if they can be devised from SLURM_STEP_GPUS, it
wouldn't be difficult to provide a userspace script that generates the
list of usable devices.

> Is your goal to enable VirtualGL for jobs? If it is, I tried a solution
> with packing it, its dependencies, a minimal X11 server and turbovnc
> into a singularity image which can be used in a job.
> This worked as a proof of concept for glxgears, but not for the software
> users wanted to run.
Yes, virtualgl+xvfb or virtualgl+turbovnc is exactly the use-case on my
mind. We had this working on a headless non-slurm server without a lot
of problems, running a robotics simulator with rendering sensors, and
sometimes even with GUI.
> Eventually this might work with Vulkan instead of OpenGL. Software in
> question would have to be updated, too, GPU drivers would have to
> support the needed Vulkan features as well.
No idea which devices Vulkan uses. Are they also the DRM devices?

Martin


Martin Pecka

unread,
Jan 7, 2022, 11:34:31 AM1/7/22
to Slurm User Community List
Okay, I verified the MultipleFiles approach on a testing slurm install
with 1 control computer and two nodes and it works (with
ConstrainDevices=yes)!

    Name=gpu Type=3090
MultipleFiles=/dev/nvidia0,/dev/dri/card1,/dev/dri/renderD128
    Name=gpu Type=3090
MultipleFiles=/dev/nvidia1,/dev/dri/card2,/dev/dri/renderD129

I could attach with various --gres=gpu:3090:* configs - one card, two
cards, and I always got access only to the files belonging the the
acquired cards - both in CUDA applications (nvidia*), in EGL (card*) or
in VAAPI-accelerated ffmpeg (renderD*; these should also be what Vulkan
uses, as I found out). With nvidia cards, it works flawlessly. I had a
problem with i915 integrated GPU on a testing notebook - I could set it
up as a gres (without the nvidia* device), I could claim it or use the
renderD* device in ffmpeg, but VirtualGL did not run on the card* device...

With slurm 20.11, you get an unpleasant behavior of the environment
variables, though. CUDA_VISIBLE_DEVICES and SLURM_STEP_GPUS contain
garbage. On a 2-GPU machine, CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 and
SLURM_STEP_GPUS=0,1,128,1,2,129. However, this bug got fixed in 21.08.
And I think a workaround for 20.11 would be to just pick every 3rd value
from these lists in a prolog script.

One last thing to mention - for the card* and renderD* devices to work
in slurm, you have to set them to mode 666 in the phyiscal node
machines. cgroups will take care about blacklisting the non-claimed
devices. Also don't forget to add /dev/dri/card* and /dev/dri/renderD*
to /etc/slurm/cgroup_allowed_devices_file.conf .

Now, the only thing that remains to verify is that the mapping between
nvidiaX and cardX devices doesn't change with reboots. I wasn't able to
find any documentation about how are either of these devices enumerated.
On all machines I could access, the cardX and renderDY devices follow
the same order, and I'd bet that it's given (as the render node is
created by the same driver as the cardX device). Although you can't
simply say Y=X+128 (see the example from my previous email where card0
doesn't have any renderD). Experimentally, the order is not the PCI Bus
ID order (0000:01:00.0 has card2, while 0000:41:00.0 has card1 on one
machine). On all machines I could access, it also seemed to me that the
relative order between nvidiaX and cardX devices remains the same.
However, I know people say the ordering of nvidiaX devices can change
between reboots (or at least I think I saw something like that written
somewhere). Anyone has a pointer to more information?

Let me know if somebody else succeeds setting this up!

Martin




Reply all
Reply to author
Forward
0 new messages