torch occasionally gives me out of memory error

130 views

Skip to first unread message

X.T. Li

unread,

Sep 28, 2016, 10:39:57 AM9/28/16

to torch7

I have multiple Titan Xs in a single machine. I train my models with some of them and use the rest to test. During the testing, sometimes torch gives me out of memory error:

THCudaCheck FAIL file=/torch/extra/cutorch/lib/THC/THCGeneral.c line=176 error=2 : out of memory

/torch/install/bin/luajit: /home/xtli/torch/install/share/lua/5.1/trepl/init.lua:384: cuda runtime error (2) : out of memory at /home/xtli/torch/extra/cutorch/lib/THC/THCGeneral.c:176
stack traceback:
        [C]: in function 'error'
        /torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'

but if I wait for a few minutes and try again, the error is gone.

The code caused this error is:

require 'cutorch'

Also, if I run two tests consecutively without time interval, torch gives me this error too.

Does anyone know what might cause this wired error and how to solve it?

X.T. Li

unread,

Sep 28, 2016, 11:30:39 AM9/28/16

to torch7

use

CUDA_VISIBLE_DEVICES=4

solved my problem, torch uses a tiny bit of memory of every GPU when require 'cutorch', since some of my GPUs are already 'full', it gives me out of memory error. Using CUDA_VISIBLE_DEVICES=4 tells torch to regard GPUs transparent other than GPU 4.