Error running Torch + CUDA under Docker

630 views
Skip to first unread message

Traun Leyden

unread,
Nov 22, 2015, 4:02:29 PM11/22/15
to torch7

I'm installing CUDA 6.5 + Torch on an AWS GPU instance using these instructions, but when I run:

th -e "require 'cutorch'; require 'cunn'; print(cutorch)"

I'm getting this error:

/root/torch/install/share/lua/5.1/trepl/init.lua:378: cuda runtime error (38) : no CUDA-capable device is detected at /tmp/luarocks_cutorch-scm-1-4711/cutorch/lib/THC/THCGeneral.c:16

OTOH, if I follow the same instructions above but skip the docker stuff and install directly on the Host OS, it works.

Anyone have any idea how to debug this or what the "missing link" might be?

Inside the docker container, I can see the kernel module and the devices:

# lsmod | grep -i nvidia
nvidia_uvm            
35066  0
nvidia              
10540162  1 nvidia_uvm
drm                  
303102  1 nvidia


# ls -alh /dev | grep -i nvidia
crw
-rw-rw-  1 root root 251,   0 Nov 22 20:06 nvidia-uvm
crw
-rw-rw-  1 root root 195,   0 Nov 22 20:06 nvidia0
crw
-rw-rw-  1 root root 195, 255 Nov 22 20:06 nvidiactl


soumith

unread,
Nov 22, 2015, 5:34:01 PM11/22/15
to torch7 on behalf of Traun Leyden
On docker, there's a restriction for CUDA that the NVIDIA driver installed inside docker has to match the NVIDIA driver on the host machine. If it doesn't match, this error occurs.

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

Traun Leyden

unread,
Nov 23, 2015, 1:33:54 PM11/23/15
to torch7 on behalf of smth chntla
I tried matching them exactly, and now I'm seeing this error in the dmesg output:

[59573.522695] NVRM: API mismatch: the client has the version 352.39, but
[59573.522695] NVRM: this kernel module has the version 352.63.  Please
[59573.522695] NVRM: make sure that this kernel module and all NVIDIA driver
[59573.522695] NVRM: components have the same version.
[59573.522703] NVRM: nvidia_frontend_ioctl: minor 255, module->ioctl failed, error -22

On the host, I'm installing cuda via:


And the docker container I'm running has the exact same version of cuda installed:

I wonder if cuda version 7.5.18 was updated since the Docker image was built?  I'll try re-installing cuda inside the docker container to see if that fixes it.



--
You received this message because you are subscribed to a topic in the Google Groups "torch7" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/torch7/yCSNIzW590M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to torch7+un...@googlegroups.com.

Traun Leyden

unread,
Nov 23, 2015, 2:01:48 PM11/23/15
to torch7

I'll try re-installing cuda inside the docker container to see if that fixes it.


I was able to work around the problem by re-installing CUDA 7.5 inside the docker container using these commands:


deb
$ sudo apt
-get update
$ sudo apt
-get upgrade -y
$ sudo apt
-get install -y opencl-headers build-essential protobuf-compiler \
    libprotoc
-dev libboost-all-dev libleveldb-dev hdf5-tools libhdf5-serial-dev \
    libopencv
-core-dev  libopencv-highgui-dev libsnappy-dev libsnappy1 \
    libatlas
-base-dev cmake libstdc++6-4.8-dbg libgoogle-glog0 libgoogle-glog-dev \
    libgflags
-dev liblmdb-dev git python-pip gfortran
$ sudo apt
-get clean
$ sudo apt
-get install -y linux-image-extra-`uname -r` linux-headers-`uname -r` linux-image-`uname -r`
$ sudo apt
-get install -y cuda


Now running the same command produces:

th -e "require 'cutorch'; require 'cunn'; print(cutorch)"
{
  getStream
: function: 0x4054b760
  getDeviceCount
: function: 0x408bca58
 
.. etc
}

and nvidia-smi returns info on the gpu rather than an error.

Thanks for the help!


 


On Sun, Nov 22, 2015 at 2:33 PM, torch7 on behalf of smth chntla <tor...@googlegroups.com> wrote:
On docker, there's a restriction for CUDA that the NVIDIA driver installed inside docker has to match the NVIDIA driver on the host machine. If it doesn't match, this error occurs.
On Sun, Nov 22, 2015 at 4:02 PM, Traun Leyden via torch7 <torch7+APn2wQe4YpRsrV7DfLBb5DzJfLp4Sweec5oM6W7nE2tY-HpccjbjoqIpg@googlegroups.com> wrote:

I'm installing CUDA 6.5 + Torch on an AWS GPU instance using these instructions, but when I run:

th -e "require 'cutorch'; require 'cunn'; print(cutorch)"

I'm getting this error:

/root/torch/install/share/lua/5.1/trepl/init.lua:378: cuda runtime error (38) : no CUDA-capable device is detected at /tmp/luarocks_cutorch-scm-1-4711/cutorch/lib/THC/THCGeneral.c:16

OTOH, if I follow the same instructions above but skip the docker stuff and install directly on the Host OS, it works.

Anyone have any idea how to debug this or what the "missing link" might be?

Inside the docker container, I can see the kernel module and the devices:

# lsmod | grep -i nvidia
nvidia_uvm            
35066  0
nvidia              
10540162  1 nvidia_uvm
drm                  
303102  1 nvidia


# ls -alh /dev | grep -i nvidia
crw
-rw-rw-  1 root root 251,   0 Nov 22 20:06 nvidia-uvm
crw
-rw-rw-  1 root root 195,   0 Nov 22 20:06 nvidia0
crw
-rw-rw-  1 root root 195, 255 Nov 22 20:06 nvidiactl


--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+unsubscribe@googlegroups.com.

To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "torch7" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/torch7/yCSNIzW590M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to torch7+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages