I left cifar-10 training overnight, and sometime in the night training crashed with this error:
/home/xxx/torch/install/share/lua/5.1/nn/THNN.lua:110: cublas runtime error : an internal operation failed at /home/xxx/torch/extra/cutorch/lib/THC/THCBlas.cu:246
stack traceback:
[C]: in function 'v'
/home/xxx/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'SpatialConvolutionMM_updateOutput'
...ik/torch/install/share/lua/5.1/nn/SpatialConvolution.lua:79: in function <...ik/torch/install/share/lua/5.1/nn/SpatialConvolution.lua:76>
I'm running torch on Ubuntu 16.04 on an Asus gaming laptop GL553W having Nvidia 960 GPU. The fans were running at full speed, computer not on battery. Could this be an overheating problem, or is there a bug or what? Nothing using Cuda worked after the error, reboot finally brought the computer to its senses.
tnx
-m