[slurm-dev] Wrong device order in CUDA_VISIBLE_DEVICES

2,048 views
Skip to first unread message

Maik Schmidt

unread,
Nov 3, 2017, 5:15:34 AM11/3/17
to slurm-dev
Dear all,

first, let me say that we do not use ConstrainDevice in our setup, so we
have to rely on CUDA_VISIBLE_DEVICES to ensure that user applications
use the correct GPU that they have allocated on our multi-GPU nodes.
This seemed to work well for quite some time on our homogenous nodes,
but now that we have a heterogenous node with three different GPU
architectures present, I have noticed that the way SLURM sets
CUDA_VISIBLE_DEVICES does in no way conform with how CUDA actually
interprets this variable.

It is my understanding that when ConstrainDevices is not set to "yes",
SLURM uses the so called "Minor Number" (nvidia-smi -q | grep Minor)
that is the number in the device name (/dev/nvidia0 -> ID 0 and so on)
and puts it in the environment variable. This, however, does not
necessarily match the device index in neither nvml nor CUDA API, nor
does it correlate with the device IDs in CUDA_VISIBLE_DEVICES.

By default, CUDA uses a heuristic called FASTEST_FIRST to determine the
order with respect to CUDA_VISIBLE_DEVICES, making the fastest GPU
device 0 but leaving the rest of the devices unspecified (see [1]). 
This behaviour can be overridden by also setting
CUDA_DEVICE_ORDER=PCI_BUS_ID, but even then, it is not guaranteed that
the order of the devices under /dev match the order of the PCI bus IDs.

Long story short, with the IDs that SLURM puts in CUDA_VISIBLE_DEVICES,
we do not get the right devices selected by CUDA applications which can
easily be verified with e.g. deviceQuery from the CUDA samples.

I currently do not see a way to fix this properly without interfacing to
the CUDA RT, or at least using NVML/nvidia-smi to get the GPU UUIDs,
which can also be used in CUDA_VISIBLE_DEVICES and would make this
entire mess a lot more intuitive. It seems, though, that we have to
patch the gres plugin in order to achieve this.

Any thoughts?

[1] http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars

--
Maik Schmidt
HPC Services

Technische Universität Dresden
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
Willers-Bau A116
D-01062 Dresden
Telefon: +49 351 463-32836


Kilian Cavalotti

unread,
Nov 3, 2017, 12:08:01 PM11/3/17
to slurm-dev

Hi Malk,

On Fri, Nov 3, 2017 at 2:14 AM, Maik Schmidt <maik.s...@tu-dresden.de> wrote:
> It is my understanding that when ConstrainDevices is not set to "yes", SLURM
> uses the so called "Minor Number" (nvidia-smi -q | grep Minor) that is the
> number in the device name (/dev/nvidia0 -> ID 0 and so on) and puts it in
> the environment variable.

Not exactly. When using ConstrainDevices, Slurm creates a cgroup for
the job where only the GPUs that have been allocated to that job are
visible. Meaning that on a 4-GPU server, when you submit a job with
"--gres gpu:1" and when ContrainDevices is enabled and correctly
configured, "nvidia-smi -L" will only list 1 GPU in that job's
context.

By default, CUDA (the NVML, actually) numbers all the GPUs it has
access to from 0. Meaning that in our previous job, the id assigned to
that GPU by the NVML will be 0. If, while that job is still running,
you submit another 1-GPU job, in the context of that second job, the
GPU id will *also* be 0, as this is the only GPU the job will see. You
can verify that the physical GPUs assigned to each job are indeed
different by looking at either their serial number, PCI address or
UUID.

This "relative" numbering scheme (by opposition to the absolute
numbering scheme that the kernel uses for CPUs, for instance), is a
long-debated historical CUDA idiosyncrasy. I don't think it makes a
lot of sense in modern day multi-GPU systems, but that's how it is.
Some can argue that it simplifies the life of the developer, who can
always assume that there will be a GPU 0 in the environment. But it
most often leads to horrible assumptions in applications code...

> This, however, does not necessarily match the
> device index in neither nvml nor CUDA API, nor does it correlate with the
> device IDs in CUDA_VISIBLE_DEVICES.
>
> By default, CUDA uses a heuristic called FASTEST_FIRST to determine the
> order with respect to CUDA_VISIBLE_DEVICES, making the fastest GPU device 0
> but leaving the rest of the devices unspecified (see [1]). This behaviour
> can be overridden by also setting CUDA_DEVICE_ORDER=PCI_BUS_ID, but even
> then, it is not guaranteed that the order of the devices under /dev match
> the order of the PCI bus IDs.

I think it should, since the driver creates the /dev/ entries using
the PCI order too.

> Long story short, with the IDs that SLURM puts in CUDA_VISIBLE_DEVICES, we
> do not get the right devices selected by CUDA applications which can easily
> be verified with e.g. deviceQuery from the CUDA samples.

I can see that happening indeed, if the NVML numbering scheme doesn't
match the device numbers in /dev. Slurm only knows about the
/dev/nvidiaX devices, and that's what it uses to set the value of
CUDA_VISIBLE_DEVICES when ConstrainDevices is not enabled (cf.
https://bugs.schedmd.com/show_bug.cgi?id=1421 for some historical
context).

GPU numbering is a giant mess. I think that at some point, NVIDIA
should really fix the way GPUs are numbered. It's actually funny to
see that even the NVIDIA developers are forced to develop workarounds
in their own software:
https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation

Since this is quite unlikely to happen, one option for better
integration would be for Slurm to query the GPU UUIDs and use them to
populate CUDA_VISIBLE_DEVICES instead of the current integer indexes.
You may want to submit a feature request at https://bugs.schedmd.com
if you're interested. But in the meantime, your best option is
probably to enable ConstrainDevices to alleviate the issue, or to use
CUDA_DEVICE_ORDER=PCI_BUS_ID

Cheers,
--
Kilian
Reply all
Reply to author
Forward
0 new messages