Could we free GPU memory between training and testing phase?

john1...@gmail.com

unread,

Apr 22, 2017, 11:32:04 AM4/22/17

to Caffe Users

I have a TitanX GPU (~12GB). My model takes 8GB for training phase and 6GB for the testing phase. I want to perform testing phase after each 100 iterations. Hence, my setting prototxt is

train_net: "train.prototxt"

test_net: "val.prototxt"

test_iter: 100

test_interval: 100

The problem is that the Caffe takes 8GB for training and it does not free the space when performing testing phase, hence my memory is not enough (total 15GB).

Do we have any setting in prototxt to handle the problem? Thanks

The error looks like

I0411 16:41:04.669823 6823 solver.cpp:331] Iteration 0, Testing net (#0)

I0411 16:43:31.625444 6823 solver.cpp:398] Test net output #0: intermediate_loss = nan (* 1 = nan loss)

I0411 16:43:31.625897 6823 solver.cpp:398] Test net output #1: loss = nan (* 1 = nan loss)

F0411 16:43:33.259964 6823 syncedmem.cpp:71] Check failed: error == cudaSuccess (2 vs. 0) out of memory

Przemek D

unread,

Apr 25, 2017, 6:17:14 AM4/25/17

to Caffe Users

Even if you succeeded with that, I fear you'd face significant degradation in performance. Look at all the memory management operations, free and allocation when the test phase starts and then again when you resume training. I don't think it's a good direction at all. You're better off reducing your batch size IMO (consider reducing your test phase batch first as this will not affect your gradient descent).

On a side note, it would be quite useful to be able to train on one GPU (or better yet: some subset of GPUs) and test on another (perhaps in parallel, without blocking the training).

john1...@gmail.com

unread,

Apr 26, 2017, 11:33:00 PM4/26/17

to Caffe Users

Thanks Przemek D,

Actually, I cannot reduce the batch size because it is 4 now. I just have one GPU.

For that, I think I have one idea. For each 100 iterations, I will save/snapshot caffemodel. After finish training phase, I will read these caffemodel and perform test phase for each 100 iterations. Is it possible?

Przemek D

unread,

Apr 27, 2017, 4:41:20 AM4/27/17

to Caffe Users

Yes this is possible. But even more inefficient, since you will have to reload the whole network every time. What is your test network batch size? What if you reduced that to 1, will you still run out of memory?

john1...@gmail.com

unread,

Apr 27, 2017, 9:08:56 AM4/27/17

to Caffe Users

Hi, my batch size for training is 4 and testing is 1. The problem is batch size of training at least 4. I cannot reduce it.

For the idea, I just do it in deploy phase. How could we show error loss of testing phase from caffemodel of each 100 iterations?

john1...@gmail.com

unread,

Apr 27, 2017, 9:21:54 AM4/27/17

to Caffe Users

I think I found the solution. It is

./build/tools/caffe test -model /path/to/*_train_test.prototxt -weights /path/to/trained/caffe/caffemodel -iterations #test images/batch_size

Thanks

Reply all

Reply to author

Forward