training/main.lua crashes with cuda runtime error (8)

329 views
Skip to first unread message

Amir Nb

unread,
Jan 5, 2016, 5:20:21 PM1/5/16
to CMU-OpenFace
Hi,


I installed the dependencies by hand from deckerfile and also followed the instructions to install fblualib and fbcunn. 

I'm trying to train a new NN model using training/main.lua. I use it on a small sat of data as: 
training/main.lua -data ~/data/fsc/aligned_96/ -peoplePerBatch 9 -batchSize 20

 it goes through nn model and prints some sort of model architecture and Criterion but it crashes with this message:

 ==> doing epoch on training data:
==> online epoch # 1
/home/ubuntu/torch/install/bin/luajit: ...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:264: 
[thread 1 endcallback] /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: cuda runtime error (8) : invalid device function at /home/ubuntu/torch/extra/cutorch/lib/THC/THCGeneral.c:586
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:76: in function </home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:72>
[C]: in function 'updateOutput'
/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/home/ubuntu/src/openface/training/train.lua:196: in function </home/ubuntu/src/openface/training/train.lua:175>
[C]: in function 'real_xpcall'
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:85: in function 'xpcall'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:173: in function 'dojob'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:220: in function 'addjob'
/home/ubuntu/src/openface/training/train.lua:117: in function 'train'
training/main.lua:41: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
[thread 2 endcallback] /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: cuda runtime error (11) : invalid argument at /home/ubuntu/torch/extra/cutorch/lib/THC/generic/THCTensor.cu:34
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:76: in function </home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:72>
[C]: in function 'updateOutput'
/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/home/ubuntu/src/openface/training/train.lua:196: in function </home/ubuntu/src/openface/training/train.lua:175>
[C]: in function 'real_xpcall'
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:85: in function 'xpcall'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:173: in function 'dojob'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:259: in function 'synchronize'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:198: in function 'addjob'
/home/ubuntu/src/openface/training/train.lua:117: in function 'train'
training/main.lua:41: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
[thread 1 endcallback] /home/ubuntu/src/openface/training/train.lua:195: cuda runtime error (11) : invalid argument at /home/ubuntu/torch/extra/cutorch/lib/THC/THCGeneral.c:586
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:76: in function </home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:72>
[C]: in function 'resize'
/home/ubuntu/src/openface/training/train.lua:195: in function </home/ubuntu/src/openface/training/train.lua:175>
[C]: in function 'real_xpcall'
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:85: in function 'xpcall'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:173: in function 'dojob'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:259: in function 'synchronize'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:198: in function 'addjob'
/home/ubuntu/src/openface/training/train.lua:117: in function 'train'
training/main.lua:41: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
stack traceback:
[C]: in function 'error'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:198: in function 'addjob'
/home/ubuntu/src/openface/training/train.lua:117: in function 'train'
training/main.lua:41: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

I'm using an AWS GPU instance with 4 GPUs and ubuntu 14.04. it's my first time working with torch/lua so I'd appreciate if someone could help me debugging this.

thanks
Amir

 

Brandon Amos

unread,
Jan 6, 2016, 10:04:24 AM1/6/16
to Amir Nb, CMU-OpenFace
Hi Amir,

> /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: cuda runtime
> error (8) : invalid device function at
> /home/ubuntu/torch/extra/cutorch/lib/THC/THCGeneral.c:586
>
> I'm using an AWS GPU instance with 4 GPUs and ubuntu 14.04. it's my first
> time working with torch/lua so I'd appreciate if someone could help me
> debugging this.

I'm not familiar with the CUDA runtime error you're getting or with
running the OpenFace training code on AWS with multiple GPUs.
I train on a single Tesla GPU, so you'll have to make some
modifications to use all of the GPUs.
I recommend starting by making sure simple use cases of cutorch work
well in a Lua interpreter or in a standalone file and discussing any
issues with the cutorch team.

By the way, you might be interested in some of the substantial
training improvements Bartosz has made and released a few days ago
that I haven't yet tested and merged into OpenFace:
https://groups.google.com/d/msg/cmu-openface/dcPh883T1rk/VUcfR19NBQAJ

-Brandon.
signature.asc

Sanjeev J

unread,
Feb 20, 2016, 12:19:46 AM2/20/16
to CMU-OpenFace
Hi Amir, 

How did you solve this problem. I have the same problem. I have a GeForce GTX 650 GPU running a Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-77-generic x86_64) OS.

It looks like we need to reset the compute compatibility of the GPU. Our attempts at doing that has not been successful. 

Any help will be greatly appreciated.

Regards
Sanjeev

Amir Nb

unread,
Feb 20, 2016, 1:54:29 PM2/20/16
to CMU-OpenFace
Hi, 
In my case the problem was the GPU memory size. AWS GPU machines have 4GB of GPU memory each and I had to reduce the size of data to fit in GPU memory. it seems like your GPU has even less memory (1GB) so you have to decrease batch sizes even more.

Amir

Brandon Amos

unread,
Feb 21, 2016, 4:08:01 PM2/21/16
to Amir Nb, CMU-OpenFace
> In my case the problem was the GPU memory size.

It's interesting that memory issues resulted in invalid argument
errors. From your original post:

cuda runtime error (11) : invalid argument

I usually see 'out of memory' in the error messages
when I try to use too much memory.

Sanjeev, if memory isn't an issue, this could be an issue in your
Torch installation. I suggest finding a minimal example using just
Torch and posting to their mailing list.

-Brandon.
signature.asc
Reply all
Reply to author
Forward
0 new messages