--nv & nvidia-smi

861 views
Skip to first unread message

Igor Yakushin

unread,
Oct 26, 2017, 2:54:59 PM10/26/17
to singularity
Hello,

When I use 
singularity shell --nv shub://opensciencegrid/osgvo-tensorflow-gpu
on my laptop,
nvidia-smi
works inside the container but it does not work on the cluster although it is found:
========
nvidia-smi
Failed to initialize NVML: Function Not Found
========
What is missing?

On the host:
========
$                 nvidia-smi
Thu Oct 26 13:49:42 2017        
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66                 Driver Version: 384.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:08:00.0 Off |                    0 |
| N/A   54C    P0   133W / 149W |    277MiB / 11439MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:09:00.0 Off |                    0 |
| N/A   41C    P8    30W / 149W |      1MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:88:00.0 Off |                    0 |
| N/A   54C    P0   147W / 149W |    345MiB / 11439MiB |     94%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:89:00.0 Off |                    0 |
| N/A   70C    P0   135W / 149W |    130MiB / 11439MiB |     87%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     16614    C   ...way2/maratandreev/hoomd-install/bin/hoomd   266MiB |
|    2     16389    C   ...way2/maratandreev/hoomd-install/bin/hoomd   332MiB |
|    3     34525    C   ./ChangeTask                                   119MiB |
+-----------------------------------------------------------------------------+

========
I have attached the file with debugging info

Another question: the image seems to be missing libcudnn.so.6 library and TensorFlow cannot run. I mounted instead my own directory with this library from the host and it fixed the problem. Is this OK to do so or not? I would think that this is dangerous since that library might link to something absent in the image and that reduces the portability of the image.


Thank you,
Igor

d.txt

Mats Rynge

unread,
Oct 26, 2017, 3:34:57 PM10/26/17
to singu...@lbl.gov

> When I use 
> singularity shell --nv shub://opensciencegrid/osgvo-tensorflow-gpu
> on my laptop,

Igor,

That image is used on the Open Science Grid, and we are not using --nv
yet, so I will not promise it will work with --nv. However, I do see it
is pulling in the wrong version of cudnn. I will update that and get
back to you.

--
Mats Rynge
USC/ISI - Pegasus Team <http://pegasus.isi.edu>

David Godlove

unread,
Oct 26, 2017, 3:36:30 PM10/26/17
to singu...@lbl.gov
Hi Igor,

That is indeed curious. I wonder if /bin/nvidia-smi on the cluster is a symlink?  Can you check?  Or maybe there are some libraries on the cluster that are not configured properly.  This issue on stack exchange seems related to your problem.


Can you confirm that nvidia-smi works as expected on the cluster?

As for your second question, no libcudnn.so.6 should not be bind mounted into the container at runtime.  It should be installed within the container along with the rest of the cuDNN libs.  I'm not familiar with the opensciencegrid image that you are using.  I would just use this one instead:

docker://tensorflow/tensorflow:latest-gpu 

It should have all the batteries included.  If you need to be able to reliably reproduce, you could build an image on shub using the docker image as a base and then you wouldn't need to worry about it changing. 

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

Igor Yakushin

unread,
Oct 26, 2017, 4:23:37 PM10/26/17
to singu...@lbl.gov
Hi Mats,
Is there anything special to do while preparing an image to be able to use --nv later?
I thought --nv is only used for 'shell' and 'exec'. No?
Thank you,
Igor


--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.



--
Igor Yakushin, Ph.D.
Computational Scientist
Kavli Institute for Cosmological Physics, ERC #413
Research Computing Center, room #2
The University of Chicago

Igor Yakushin

unread,
Oct 26, 2017, 4:48:31 PM10/26/17
to singu...@lbl.gov
Hi David,


That is indeed curious. I wonder if /bin/nvidia-smi on the cluster is a symlink? 

No, this is not symbolic link:
=========
[root@midway2-gpu01 ~]# ls -l /bin/nvidia-smi
-rwxr-xr-x 1 root root 511648 Oct 10 09:47 /bin/nvidia-smi
==========
This is on Scientific Linux 7.2

 

Can you confirm that nvidia-smi works as expected on the cluster?

Yes, I sent the output from running it on the host in my first message.

The only thing I can think of: according to 'rpm -qf nvidia-smi' it looks like NVIDIA driver was installed from the binary over the old one installed with rpm.  Perhaps, that somehow confuses Singularity.

 

As for your second question, no libcudnn.so.6 should not be bind mounted into the container at runtime.  It should be installed within the container along with the rest of the cuDNN libs.  I'm not familiar with the opensciencegrid image that you are using.  I would just use this one instead:

docker://tensorflow/tensorflow:latest-gpu 


Hmm, inside this image nvidia-smi does work fine.
How can something in the image confuse it?

Is there anything special one needs to do when building an image to be used with --nv? Besides not unpacking nvidia libraries?

Thank you,
Igor

Igor Yakushin

unread,
Oct 26, 2017, 5:06:35 PM10/26/17
to singu...@lbl.gov
Mats,
 

That image is used on the Open Science Grid, and we are not using --nv
yet, so I will not promise it will work with --nv.


Do you unpack nvidia driver inside the image? If so, what version?
I guess, different versions of driver libraries inside and outside might produce weird effects.

When I was packing drivers inside the container, I provided several versions and let users set LD_LIBRARY_PATH to point to the version they need.

Thank you,
Igor

David Godlove

unread,
Oct 26, 2017, 5:19:18 PM10/26/17
to singu...@lbl.gov
Glad to hear that it is working with the official tensorflow container.  The only thing that I would recommend when you are building a container to use the --nv option is this.  You must recognize that cuda, cuDNN, and other libraries are not the same as the nvidia driver.  This has confused some users.  You must install cuda, cuDNN and other other libraries that you plan to use in the container.  This is necessary because software such as TensorFlow will only work with certain versions of cuda.  So the libraries must be installed internally to keep the containers portable.  On the other hand you must not install the nivdia library within the container as it may conflict with the driver version on the host system.

To get an idea of what kinds of things are bind mounted into the container from the host system, you can see this config file.  NVIDIA is developing an interface called nvidia-container-cli that lists the libraries and binaries on a host system that are needed for a particular type of application (like compute or graphics).  In the next version of Singularity we will provide an option to leverage that tool directly to locate libs and bins in a more intelligent way.  So down the road you will want to install nvidia-container-cli on your host system. 

Reply all
Reply to author
Forward
0 new messages