Caffe training iteration loss is -nan

963 views
Skip to first unread message

Abhilash Panigrahi

unread,
Nov 18, 2015, 7:25:10 AM11/18/15
to Caffe Users
I'm trying to implement FCN-8s using my own custom data. While training, from scratch on the 20th iteration, I see that my loss = -nan. Could someone suggest what's going wrong. The train_val is similar to the one in the link. My solver.prototxt is as follows :-

net: "/home/ubuntu/CNN/train_val.prototxt"
test_iter: 13

test_interval: 500
display: 20
average_loss: 20
lr_policy: "fixed"

base_lr: 1e-4

momentum: 0.99

iter_size: 1
max_iter: 3000
weight_decay: 0.0005
snapshot: 200
snapshot_prefix: "train"
test_initialization: false

The images and labels are of size 512x640.

Mohit Jain

unread,
Nov 18, 2015, 12:59:06 PM11/18/15
to Caffe Users
That generally happens when your base_lr is too high. The model basically fails to converge. Try pluggin in a smaller base_lr value... say 1e-10 or something of that order.

PS : Im no caffe expert. But this might help :)

Regards,
Mohit 

Yin Li

unread,
Nov 20, 2015, 2:38:45 PM11/20/15
to Caffe Users
Your base_lr is too high. For FCN network you are using, the back propagation will aggregate gradients from all pixels. Thus, you will need a much lower learning rate to prevent gradient explosion in your case. Try something like 1e-6. That should work. Also adding batch normalization will help. 

Yin

Evan Shelhamer

unread,
Nov 20, 2015, 3:51:32 PM11/20/15
to Yin Li, Caffe Users
adding batch normalization will help

Actually batch norm
​might not​
 help and could even hurt for batch size == 1 as is common for FCNs
​. The​
estimates of mean and variance will be noisy. ParseNet [1] shows that L2 normalization helps for skip architecture training.

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/e4b7ed73-f47a-49c5-a760-16a6d9e53cf8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Etienne Perot

unread,
Dec 15, 2015, 2:05:27 PM12/15/15
to Caffe Users, happyh...@gmail.com
Hello there,

"Actually batch norm 
​might not​
 help and could even hurt for batch size == 1"

Why? from the Batch Norm paper :

"For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p × q, we use the effective mini-batch of size m′ = |B| = m · p q."
 
So if we put a batch normalization layer between last convolutional layer (which outputs the classes) and the softmaxWithLoss, shouldn't it properly average over locations, simulating a much larger batch than 1 (actually as large as the deconv output right??)

Once again, thanks a lot in advance for your answer man

Etienne
Reply all
Reply to author
Forward
0 new messages