For SingularityCE 3.9 we are working toward using the official 'nvidia-container-cli` tooling to setup Nvidia GPUs and their required libraries when running Singularity containers. Using `nvidia-container-cli` is the Nvidia supported method of configuring GPU containers, and will allow benefits over SingularityCE's current home-grown approach (though the legacy --nv GPU binding will be maintained and available for the time being).
* We will be able to support passing through only specific GPUs /MIG instances into a container (see below).
* We will be able to provide support for non-compute GPU capabilities, i.e. the graphics/video/display capabilities of nvidia-container-cli.
* The limited CUDA forward-compatible version increases from within the container via 'compat' libraries etc. should work.
* We don't have to track CUDA library changes - Nvidia will update their tool when things change.
A draft PR is now up for discussion at:https://github.com/sylabs/singularity/pull/144
In particular we'd love feedback on how to address the issue of `NVIDIA_VISIBLE_DEVICES` handling to expose only specific GPUs / MIG instances from the host into the container...
Historically Singularity has *always* bound all GPU devices into the container when running with `--nv`. This means applications in the container can access all GPUs, and restrictions to specific GPUs are via application configuration and/or setting `CUDA_VISIBLE_DEVICES`. It's easy to work around those, or set them up incorrectly, ending up with an application using more GPUs than you intend.
In the OCI world, only the GPUs specified by the env var `NVIDIA_VISIBLE_DEVICES` in config are available when the container runs. `nvidia-container-cli` uses this environment variable to bind in the correct GPUs at container setup time.
At present, the PR does the following:
* Makes *all* GPUs available in the container if `--nv` is used without `--contain` or `--containall`. This is the same as current SingularityCE, and is actually due to SingularityCE binding the whole `/dev` tree... not anything nvidia-container-cli is doing.
* Only makes the `NVIDIA_VISIBLE_DEVICES` GPUs available in the container if `--nv` is used with `--contain` or `--containall`. This differs from current SingularityCE. There is a warning if `NVIDIA_VISIBLE_DEVICES` has not been set... that no GPUs will be available.
This is a bit confusing, but we are aiming here to:
1) Allow limiting containers to specific GPUs via `NVIDIA_VISIBLE_DEVICES`.
2) Get as close as we can to the way GPUs are used in the OCI world, but still...
3) Respect the historic way in which Singularity favors integration over isolation, has made all GPUs available, and is deployed in existing workflows.
If you have any suggestions on how we might handle `NVIDIA_VISIBLE_DEVICES` and limiting things to specific GPUs.
What would you expect to happen by default? How would you expect this to interact with `--contain` / `--containall` etc?