Could we free GPU memory between training and testing phase?

287 views
Skip to first unread message

john1...@gmail.com

unread,
Apr 22, 2017, 11:32:04 AM4/22/17
to Caffe Users
I have a TitanX GPU (~12GB). My model takes 8GB for training phase and 6GB for the testing phase. I want to perform testing phase after each 100 iterations. Hence, my setting prototxt is

    train_net: "train.prototxt"
    test_net: "val.prototxt"
    test_iter: 100
    test_interval: 100

The problem is that the Caffe takes 8GB for training and it does not free the space when performing testing phase, hence my memory is not enough (total 15GB). 
Do we have any setting in prototxt to handle the problem? Thanks 
The error looks like

    I0411 16:41:04.669823  6823 solver.cpp:331] Iteration 0, Testing net (#0)
    I0411 16:43:31.625444  6823 solver.cpp:398]     Test net output #0: intermediate_loss = nan (* 1 = nan loss)
    I0411 16:43:31.625897  6823 solver.cpp:398]     Test net output #1: loss = nan (* 1 = nan loss)
    F0411 16:43:33.259964  6823 syncedmem.cpp:71] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
    

Przemek D

unread,
Apr 25, 2017, 6:17:14 AM4/25/17
to Caffe Users
Even if you succeeded with that, I fear you'd face significant degradation in performance. Look at all the memory management operations, free and allocation when the test phase starts and then again when you resume training. I don't think it's a good direction at all. You're better off reducing your batch size IMO (consider reducing your test phase batch first as this will not affect your gradient descent).

On a side note, it would be quite useful to be able to train on one GPU (or better yet: some subset of GPUs) and test on another (perhaps in parallel, without blocking the training).

john1...@gmail.com

unread,
Apr 26, 2017, 11:33:00 PM4/26/17
to Caffe Users
Thanks Przemek D,
Actually, I cannot reduce the batch size because it is 4 now. I just have one GPU.
For that, I think I have one idea. For each 100 iterations, I will save/snapshot caffemodel. After finish training phase, I will read these caffemodel and perform test phase for each 100 iterations. Is it possible? 

Przemek D

unread,
Apr 27, 2017, 4:41:20 AM4/27/17
to Caffe Users
Yes this is possible. But even more inefficient, since you will have to reload the whole network every time. What is your test network batch size? What if you reduced that to 1, will you still run out of memory?

john1...@gmail.com

unread,
Apr 27, 2017, 9:08:56 AM4/27/17
to Caffe Users
Hi, my batch size for training is 4 and testing is 1. The problem is batch size of training at least 4. I cannot reduce it. 
For the idea,  I just do it in deploy phase. How could we show error loss of testing phase from caffemodel of each 100 iterations?

john1...@gmail.com

unread,
Apr 27, 2017, 9:21:54 AM4/27/17
to Caffe Users
I think I found the solution. It is 

./build/tools/caffe test -model /path/to/*_train_test.prototxt -weights /path/to/trained/caffe/caffemodel -iterations #test images/batch_size

Thanks
Reply all
Reply to author
Forward
0 new messages