Hi,
I installed the dependencies by hand from deckerfile and also followed the instructions to install fblualib and fbcunn.
I'm trying to train a new NN model using training/main.lua. I use it on a small sat of data as:
training/main.lua -data ~/data/fsc/aligned_96/ -peoplePerBatch 9 -batchSize 20
it goes through nn model and prints some sort of model architecture and Criterion but it crashes with this message:
==> doing epoch on training data:
==> online epoch # 1
/home/ubuntu/torch/install/bin/luajit: ...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:264:
[thread 1 endcallback] /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: cuda runtime error (8) : invalid device function at /home/ubuntu/torch/extra/cutorch/lib/THC/THCGeneral.c:586
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:76: in function </home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:72>
[C]: in function 'updateOutput'
/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/home/ubuntu/src/openface/training/train.lua:196: in function </home/ubuntu/src/openface/training/train.lua:175>
[C]: in function 'real_xpcall'
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:85: in function 'xpcall'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:173: in function 'dojob'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:220: in function 'addjob'
/home/ubuntu/src/openface/training/train.lua:117: in function 'train'
training/main.lua:41: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
[thread 2 endcallback] /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: cuda runtime error (11) : invalid argument at /home/ubuntu/torch/extra/cutorch/lib/THC/generic/THCTensor.cu:34
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:76: in function </home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:72>
[C]: in function 'updateOutput'
/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/home/ubuntu/src/openface/training/train.lua:196: in function </home/ubuntu/src/openface/training/train.lua:175>
[C]: in function 'real_xpcall'
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:85: in function 'xpcall'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:173: in function 'dojob'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:259: in function 'synchronize'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:198: in function 'addjob'
/home/ubuntu/src/openface/training/train.lua:117: in function 'train'
training/main.lua:41: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
[thread 1 endcallback] /home/ubuntu/src/openface/training/train.lua:195: cuda runtime error (11) : invalid argument at /home/ubuntu/torch/extra/cutorch/lib/THC/THCGeneral.c:586
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:76: in function </home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:72>
[C]: in function 'resize'
/home/ubuntu/src/openface/training/train.lua:195: in function </home/ubuntu/src/openface/training/train.lua:175>
[C]: in function 'real_xpcall'
/home/ubuntu/torch/install/share/lua/5.1/fb/util/error.lua:85: in function 'xpcall'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:173: in function 'dojob'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:259: in function 'synchronize'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:198: in function 'addjob'
/home/ubuntu/src/openface/training/train.lua:117: in function 'train'
training/main.lua:41: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
stack traceback:
[C]: in function 'error'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:198: in function 'addjob'
/home/ubuntu/src/openface/training/train.lua:117: in function 'train'
training/main.lua:41: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
I'm using an AWS GPU instance with 4 GPUs and ubuntu 14.04. it's my first time working with torch/lua so I'd appreciate if someone could help me debugging this.
thanks
Amir