I have multiple Titan Xs in a single machine. I train my models with some of them and use the rest to test. During the testing, sometimes torch gives me out of memory error:
THCudaCheck FAIL file=/torch/extra/cutorch/lib/THC/THCGeneral.c line=176 error=2 : out of memory
/torch/install/bin/luajit: /home/xtli/torch/install/share/lua/5.1/trepl/init.lua:384: cuda runtime error (2) : out of memory at /home/xtli/torch/extra/cutorch/lib/THC/THCGeneral.c:176
stack traceback:
[C]: in function 'error'
/torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'
but if I wait for a few minutes and try again, the error is gone.
The code caused this error is:
Also, if I run two tests consecutively without time interval, torch gives me this error too.
Does anyone know what might cause this wired error and how to solve it?