"nvidia-smi not found" and SINGULARITYENV_PATH having no effect

369 views
Skip to first unread message

Keith Ball

unread,
Mar 14, 2018, 12:00:12 AM3/14/18
to singularity
Hi All,

We have a Bright Computing cluster running RHEL 7.4. We are running Bright-packaged singularity 2.4.2 and CUDA 9.0 Toolkit (from which our nvidia-smi comes).
This binary lives in a nonstandard location: /cm/local/apps/cuda-driver/lib/current/bin  (likewise, CUDA libs liver under /cm/local/apps/ as well).

When we try to run using "singularity run --nv", either by first building a Singularity image then running it, or running the Docker image "on the fly", we get a "no nvidia-smi" error as shown below:

$ singularity build tensorflow_xxx.img docker://reg.xxxx.com:5000/tensorflow_xxx:1cedc37_2018-01-13

pbt $ singularity run --nv tensorflow_xxx.img
which: no nvidia-smi in (/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin)
WARNING: Could not find the Nvidia SMI binary to bind into container
...

We do bind the path "/cm/local/apps/cuda-driver" into the container using /etc/singularity/singularity.conf. Also, we set SINGULARITYENV_PATH in /etc/singularity/init to be set to include the path to nvidia-smi.

One can see from debug output (singularity --debug run --nv), that:
 - the 'nvidia-smi not found' occurs very early in the output.
 - later in the debug output, one sees:

      DEBUG   [U=35035,P=18620]  singularity_runtime_environment()         Evaluating envar to clean: SINGULARITYENV_PATH=/cm/local/apps/cuda/libs/current/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin

...

     DEBUG   [U=35035,P=18620]  singularity_runtime_

environment()         Converting envar 'SINGULARITYENV_PATH' to 'PATH' = '/cm/local/apps/cuda/libs/current/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin'


so it appears that singularity is "trying" to set PATH. However, one can verify (once the container gets to a prompt) that PATH is just set to the standard "/bin:/sbin:/usr/

bin:/usr/sbin:/usr/local/bin:/usr/local/sbin".



If I link or copy nvidia-smi to /usr/local/bin/nvidia-smi, then I don't see the problem.  Any ideas what  to check here? Is there perhaps a bug in singularity when it comes to setting PATH, at least when using the --nv option?

Thanks,
   Keith

Jason Stover

unread,
Mar 14, 2018, 1:57:36 AM3/14/18
to singu...@lbl.gov
Hi Keith,

There's an issue with SINGULARITYENV_PATH and docker images. For the
most recent PR on working around the issue see:

https://github.com/singularityware/singularity/pull/1389

It comes from the docker manifest PATH being set _after_
SINGULARITYENV_PATH has been parsed and set into the PATH. So the PATH
with the SINGULARITYENV entry is being overwritten. We're hoping the
above PR will fix it, and be a fix that doesn't need images needing to
be rebuilt as some others we have looked at have been.

-J
> --
> You received this message because you are subscribed to the Google Groups
> "singularity" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to singularity...@lbl.gov.

David Godlove

unread,
Mar 14, 2018, 10:51:30 AM3/14/18
to singu...@lbl.gov
I Keith,

Jason is correct.  SINGULARITYENV_PATH is currently not working as it should and there is a PR in to try to fix it.  But I don't think that is actually the cause of your issue.  SINGULARITYENV_PATH sets the PATH inside the container.  Your problem has more to do with the PATH outside of the container that is seen when the container is initiating.  

Early in the Singularity code the PATH (on the system outside the container) is sanitized.  This is an important security feature.  Since some portions of the Singularity code flow are executed with elevated privs, we want to make sure that any binaries we are calling are indeed the safe, root-owned system binaries that we expect.  However, it also means that Singularity cannot find nvidia-smi if it is installed in a path outside of /bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin. This bug is fixed in PR 1082 which has been merged into development, but has not made it into a release yet.  

The good news is that things like tensorflow should still work without the nvidia-smi binary being present.  All the libraries should still be accessible and functioning.  Have you tried running any cuda code that does exist in your container?

Dave 



--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

Reply all
Reply to author
Forward
0 new messages