When I set debug_info to true in the solver, I get the info that the diff gets nan for many layers beginning at conv6_2_bn during backward pass when I use the test net as well. When I only use the the train net, everythings works fine.
With test_net:
...
308 12:18:15.705454 32336 net.cpp:604] [Forward] Layer conv6_3, param blob 1 data: 0.682968
I0308 12:18:15.705965 32336 net.cpp:594] [Forward] Layer loss6, top blob loss6 data: 139479
I0308 12:18:15.706037 32336 net.cpp:618] [Backward] Layer loss6, bottom blob conv6_3 diff: 0.184859
I0308 12:18:15.709100 32336 net.cpp:618] [Backward] Layer conv6_3, bottom blob conv6_2 diff: 0.272764
I0308 12:18:15.709146 32336 net.cpp:627] [Backward] Layer conv6_3, param blob 0 diff: 2517.66
I0308 12:18:15.709172 32336 net.cpp:627] [Backward] Layer conv6_3, param blob 1 diff: 11723.2
I0308 12:18:15.709393 32336 net.cpp:618] [Backward] Layer relu6_2, bottom blob conv6_2 diff: 0.081838
I0308 12:18:15.712673 32336 net.cpp:618] [Backward] Layer conv6_2_bn, bottom blob conv6_2 diff: nan
I0308 12:18:15.712704 32336 net.cpp:627] [Backward] Layer conv6_2_bn, param blob 0 diff: 3900.29
I0308 12:18:15.712730 32336 net.cpp:627] [Backward] Layer conv6_2_bn, param blob 1 diff: 5232.67
I0308 12:18:15.716105 32336 net.cpp:618] [Backward] Layer conv6_2, bottom blob conv6_1 diff: nan
I0308 12:18:15.716148 32336 net.cpp:627] [Backward] Layer conv6_2, param blob 0 diff: nan
I0308 12:18:15.716172 32336 net.cpp:627] [Backward] Layer conv6_2, param blob 1 diff: nan
I0308 12:18:15.716389 32336 net.cpp:618] [Backward] Layer relu6_1, bottom blob conv6_1 diff: nan
I0308 12:18:15.719676 32336 net.cpp:618] [Backward] Layer conv6_1_bn, bottom blob conv6_1 diff: nan
...
without test_net:
...
I0308 12:22:57.532416 355 net.cpp:594] [Forward] Layer loss6, top blob loss6 data: 227830
I0308 12:22:57.532491 355 net.cpp:618] [Backward] Layer loss6, bottom blob conv6_3 diff: 0.184859
I0308 12:22:57.535567 355 net.cpp:618] [Backward] Layer conv6_3, bottom blob conv6_2 diff: 0.272764
I0308 12:22:57.535600 355 net.cpp:627] [Backward] Layer conv6_3, param blob 0 diff: 2517.66
I0308 12:22:57.535636 355 net.cpp:627] [Backward] Layer conv6_3, param blob 1 diff: 11723.4
I0308 12:22:57.535858 355 net.cpp:618] [Backward] Layer relu6_2, bottom blob conv6_2 diff: 0.081838
I0308 12:22:57.539136 355 net.cpp:618] [Backward] Layer conv6_2_bn, bottom blob conv6_2 diff: 0.0686222
I0308 12:22:57.539165 355 net.cpp:627] [Backward] Layer conv6_2_bn, param blob 0 diff: 3900.23
I0308 12:22:57.539189 355 net.cpp:627] [Backward] Layer conv6_2_bn, param blob 1 diff: 5232.67
I0308 12:22:57.542528 355 net.cpp:618] [Backward] Layer conv6_2, bottom blob conv6_1 diff: 0.29655
I0308 12:22:57.542564 355 net.cpp:627] [Backward] Layer conv6_2, param blob 0 diff: 738.306
I0308 12:22:57.542600 355 net.cpp:627] [Backward] Layer conv6_2, param blob 1 diff: 0.000258359
I0308 12:22:57.542820 355 net.cpp:618] [Backward] Layer relu6_1, bottom blob conv6_1 diff: 0.147435
I0308 12:22:57.546093 355 net.cpp:618] [Backward] Layer conv6_1_bn, bottom blob conv6_1 diff: 0.0291008
...
Obviously, the computation of the backward pass is the reason. Does anybody know the problem? I have to mention that Amulet is based on
Caffe SegNet and Caffe SegNet is based on an old version of caffe.