cifar-10 training crashed after few hours

6 views
Skip to first unread message

holgafreak

unread,
Jun 14, 2017, 3:34:40 AM6/14/17
to torch7
hi all, 

I left cifar-10 training overnight, and sometime in the night training crashed with this error:

/home/xxx/torch/install/share/lua/5.1/nn/THNN.lua:110: cublas runtime error : an internal operation failed at /home/xxx/torch/extra/cutorch/lib/THC/THCBlas.cu:246
stack traceback:
[C]: in function 'v'
/home/xxx/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'SpatialConvolutionMM_updateOutput'
...ik/torch/install/share/lua/5.1/nn/SpatialConvolution.lua:79: in function <...ik/torch/install/share/lua/5.1/nn/SpatialConvolution.lua:76>

I'm running torch on Ubuntu 16.04 on an Asus gaming laptop GL553W having Nvidia 960 GPU. The fans were running at full speed, computer not on battery. Could this be an overheating problem, or is there a bug or what? Nothing using Cuda worked after the error, reboot finally brought the computer to its senses.

tnx

-m


Reply all
Reply to author
Forward
0 new messages