Training loss vs. Testing loss

68 views

Skip to first unread message

Kyle Mills

unread,

Nov 29, 2016, 1:03:54 PM11/29/16

to Caffe Users

Hello,

When training a neural network, I consistently see that the training loss and testing loss are about a factor of two different. The peculiar thing is that the testing loss is lower than the training loss, which goes against all of my intuition. I'm using GoogLeNet pretty much as-is, which has three Euclidean loss outputs. Here is part of my output, for example:


...
I1128 08:19:03.594475 26968 solver.cpp:337] Iteration 5000, Testing net (#0)
I1128 08:19:03.594574 26968 net.cpp:693] Ignoring source layer loss_final
I1128 08:20:46.644037 26968 solver.cpp:404]     Test net output #0: loss1/loss1 = 0.0112497 (* 0.3 = 0.0033749 loss)
I1128 08:20:46.644196 26968 solver.cpp:404]     Test net output #1: loss2/loss1 = 0.0120051 (* 0.3 = 0.00360153 loss)
I1128 08:20:46.644218 26968 solver.cpp:404]     Test net output #2: loss_final_test = 0.0158317 (* 1 = 0.0158317 loss)
I1128 08:20:47.640658 26968 solver.cpp:228] Iteration 5000, loss = 0.0445988
I1128 08:20:47.640687 26968 solver.cpp:244]     Train net output #0: loss1/loss1 = 0.028075 (* 0.3 = 0.00842252 loss)
I1128 08:20:47.640693 26968 solver.cpp:244]     Train net output #1: loss2/loss1 = 0.0286013 (* 0.3 = 0.00858038 loss)
I1128 08:20:47.640714 26968 solver.cpp:244]     Train net output #2: loss_final = 0.0275959 (* 1 = 0.0275959 loss)
...

As you can see from above, the training loss is 0.0445988, and the testing loss is 0.02280813 (the sum of the three weighted contributions). While this doesn't seem to concerning for this specific case, this is the case for all testing and training iterations (see plot).

What could be causing this behaviour? Some information about my setup and hardware is below, if it is useful.

Batch size 50
2x GPUs (meaning effective batch size is 100)
500 test iterations performed every 5000 training iterations
display training info every 50 iterations
50000 training examples, so, 500 iterations per epoch.
prototxts for training and testing have identical loss definitions.

I'm thinking the problem could have to do with how the losses are normalized with respect to the batch sizes, and how the loss is averaged during testing. If someone could help me understand this, I'd be very appreciative.