Hello,
When I use
singularity shell --nv shub://opensciencegrid/osgvo-tensorflow-gpu
on my laptop,
nvidia-smi
works inside the container but it does not work on the cluster although it is found:
========
nvidia-smi
Failed to initialize NVML: Function Not Found
========
What is missing?
On the host:
========
$ nvidia-smi
Thu Oct 26 13:49:42 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66 Driver Version: 384.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:08:00.0 Off | 0 |
| N/A 54C P0 133W / 149W | 277MiB / 11439MiB | 92% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:09:00.0 Off | 0 |
| N/A 41C P8 30W / 149W | 1MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 00000000:88:00.0 Off | 0 |
| N/A 54C P0 147W / 149W | 345MiB / 11439MiB | 94% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 00000000:89:00.0 Off | 0 |
| N/A 70C P0 135W / 149W | 130MiB / 11439MiB | 87% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 16614 C ...way2/maratandreev/hoomd-install/bin/hoomd 266MiB |
| 2 16389 C ...way2/maratandreev/hoomd-install/bin/hoomd 332MiB |
| 3 34525 C ./ChangeTask 119MiB |
+-----------------------------------------------------------------------------+
========
I have attached the file with debugging info
Another question: the image seems to be missing libcudnn.so.6 library and TensorFlow cannot run. I mounted instead my own directory with this library from the host and it fixed the problem. Is this OK to do so or not? I would think that this is dangerous since that library might link to something absent in the image and that reduces the portability of the image.
Thank you,
Igor