Error running Torch + CUDA under Docker

Traun Leyden

unread,

Nov 22, 2015, 4:02:29 PM11/22/15

to torch7

I'm installing CUDA 6.5 + Torch on an AWS GPU instance using these instructions, but when I run:

th -e "require 'cutorch'; require 'cunn'; print(cutorch)"

I'm getting this error:

/root/torch/install/share/lua/5.1/trepl/init.lua:378: cuda runtime error (38) : no CUDA-capable device is detected at /tmp/luarocks_cutorch-scm-1-4711/cutorch/lib/THC/THCGeneral.c:16

OTOH, if I follow the same instructions above but skip the docker stuff and install directly on the Host OS, it works.

Anyone have any idea how to debug this or what the "missing link" might be?

Inside the docker container, I can see the kernel module and the devices:

# lsmod | grep -i nvidia
nvidia_uvm             35066  0
nvidia              10540162  1 nvidia_uvm
drm                   303102  1 nvidia

# ls -alh /dev | grep -i nvidia
crw-rw-rw-  1 root root 251,   0 Nov 22 20:06 nvidia-uvm
crw-rw-rw-  1 root root 195,   0 Nov 22 20:06 nvidia0
crw-rw-rw-  1 root root 195, 255 Nov 22 20:06 nvidiactl

soumith

unread,

Nov 22, 2015, 5:34:01 PM11/22/15

to torch7 on behalf of Traun Leyden

On docker, there's a restriction for CUDA that the NVIDIA driver installed inside docker has to match the NVIDIA driver on the host machine. If it doesn't match, this error occurs.

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

Traun Leyden

unread,

Nov 23, 2015, 1:33:54 PM11/23/15

to torch7 on behalf of smth chntla

I tried matching them exactly, and now I'm seeing this error in the dmesg output:

[59573.522695] NVRM: API mismatch: the client has the version 352.39, but

[59573.522695] NVRM: this kernel module has the version 352.63. Please

[59573.522695] NVRM: make sure that this kernel module and all NVIDIA driver

[59573.522695] NVRM: components have the same version.

[59573.522703] NVRM: nvidia_frontend_ioctl: minor 255, module->ioctl failed, error -22

On the host, I'm installing cuda via:

$ wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.5-18_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.deb

(here's the full installation instructions I'm following to install CUDA on the host)

And the docker container I'm running has the exact same version of cuda installed:


$ wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run

I wonder if cuda version 7.5.18 was updated since the Docker image was built? I'll try re-installing cuda inside the docker container to see if that fixes it.

--
You received this message because you are subscribed to a topic in the Google Groups "torch7" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/torch7/yCSNIzW590M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to torch7+un...@googlegroups.com.

Traun Leyden

unread,

Nov 23, 2015, 2:01:48 PM11/23/15

to torch7

I'll try re-installing cuda inside the docker container to see if that fixes it.

I was able to work around the problem by re-installing CUDA 7.5 inside the docker container using these commands:

$ wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.5-18_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.

deb
$ sudo apt-get update
$ sudo apt-get upgrade -y
$ sudo apt-get install -y opencl-headers build-essential protobuf-compiler \
    libprotoc-dev libboost-all-dev libleveldb-dev hdf5-tools libhdf5-serial-dev \
    libopencv-core-dev  libopencv-highgui-dev libsnappy-dev libsnappy1 \
    libatlas-base-dev cmake libstdc++6-4.8-dbg libgoogle-glog0 libgoogle-glog-dev \
    libgflags-dev liblmdb-dev git python-pip gfortran
$ sudo apt-get clean
$ sudo apt-get install -y linux-image-extra-`uname -r` linux-headers-`uname -r` linux-image-`uname -r`
$ sudo apt-get install -y cuda

Now running the same command produces:

th -e "require 'cutorch'; require 'cunn'; print(cutorch)"

{
  getStream : function: 0x4054b760
  getDeviceCount : function: 0x408bca58
  .. etc
}

and nvidia-smi returns info on the gpu rather than an error.

Thanks for the help!

On Sun, Nov 22, 2015 at 2:33 PM, torch7 on behalf of smth chntla <tor...@googlegroups.com> wrote:

On docker, there's a restriction for CUDA that the NVIDIA driver installed inside docker has to match the NVIDIA driver on the host machine. If it doesn't match, this error occurs.

On Sun, Nov 22, 2015 at 4:02 PM, Traun Leyden via torch7 <torch7+APn2wQe4YpRsrV7DfLBb5DzJfLp4Sweec5oM6W7nE2tY-HpccjbjoqIpg@googlegroups.com> wrote:

I'm installing CUDA 6.5 + Torch on an AWS GPU instance using these instructions, but when I run:

th -e "require 'cutorch'; require 'cunn'; print(cutorch)"

I'm getting this error:

/root/torch/install/share/lua/5.1/trepl/init.lua:378: cuda runtime error (38) : no CUDA-capable device is detected at /tmp/luarocks_cutorch-scm-1-4711/cutorch/lib/THC/THCGeneral.c:16

OTOH, if I follow the same instructions above but skip the docker stuff and install directly on the Host OS, it works.

Anyone have any idea how to debug this or what the "missing link" might be?

Inside the docker container, I can see the kernel module and the devices:

# lsmod | grep -i nvidia nvidia_uvm 35066 0 nvidia 10540162 1 nvidia_uvm drm 303102 1 nvidia

# ls -alh /dev | grep -i nvidia crw-rw-rw- 1 root root 251, 0 Nov 22 20:06 nvidia-uvm crw-rw-rw- 1 root root 195, 0 Nov 22 20:06 nvidia0 crw-rw-rw- 1 root root 195, 255 Nov 22 20:06 nvidiactl

--
You received this message because you are subscribed to the Google Groups "torch7" group.

To unsubscribe from this group and stop receiving emails from it, send an email to torch7+unsubscribe@googlegroups.com.

To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "torch7" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/torch7/yCSNIzW590M/unsubscribe.

To unsubscribe from this group and all its topics, send an email to torch7+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward