Different loss when using test net

32 views
Skip to first unread message

tkra...@gmail.com

unread,
Mar 7, 2018, 12:37:23 PM3/7/18
to Caffe Users
I recently wanted to finetune Amulet (https://arxiv.org/abs/1708.02001) to a more specific task. As a first try, I tried to use a single image as training data and a single image as test data to check generalization capability. I was able to decrease the loss when I only specify the train net in the solver.prototxt. But when I added the test net the loss increases to a max value of 34333668.0 after a few iterations without changing any other hyperparameter except test_iter and test_interval.

With this solver the loss decreases:

base_lr: 1e-11
display: 100
max_iter: 100
lr_policy: "step"
gamma: 0.5
momentum: 0.9
stepsize: 50000
solver_mode: GPU
random_seed: 42
net: "experiments/17_testNet/train_test.prototxt"
solver_type: NESTEROV
average_loss: 1
iter_size: 1



With this solver I get the following max loss (second plot with blue as train loss and orange as test loss):

test_iter: 1
test_interval: 10
base_lr: 1e-11
display: 100
max_iter: 100
lr_policy: "step"
gamma: 0.5
momentum: 0.9
stepsize: 50000
solver_mode: GPU
random_seed: 42
net: "experiments/17_testNet/train_test.prototxt"
solver_type: NESTEROV
average_loss: 1
iter_size: 1

The relevant part of the train_test.prototxt:

layer {
  name: "data"
  type: "ImageData"
  top: "data"
  top: "label_Pseudo_trainval"
  include {
    phase: TRAIN
  }
  transform_param {
    mean_value: 104
    mean_value: 117
    mean_value: 123
  }
  image_data_param {
    source: "/vol/lochfleck/experiments/15_trainAmuletOneImage/training_image.txt"
    batch_size: 1
    new_height: 256
    new_width: 256
  }
}
layer {
  name: "label"
  type: "ImageData"
  top: "label"
  top: "label_Pseudo_gt"
  include {
    phase: TRAIN
  }
  image_data_param {
    source: "/vol/lochfleck/experiments/15_trainAmuletOneImage/training_image_gt.txt"
    is_color:false
    batch_size: 1
    new_height: 256
    new_width: 256
  }
}
layer {
  name: "data"
  type: "ImageData"
  top: "data"
  top: "label_Pseudo_trainval"
  include {
    phase: TEST
  }
  transform_param {
    mean_value: 104
    mean_value: 117
    mean_value: 123
  }
  image_data_param {
    source: "experiments/17_testNet/test_images.txt"
    batch_size: 1
    new_height: 256
    new_width: 256
  }
}
layer {
  name: "label"
  type: "ImageData"
  top: "label"
  top: "label_Pseudo_gt"
  include {
    phase: TEST
  }
  image_data_param {
    source: "experiments/17_testNet/test_images_gt.txt"
    is_color:false
    batch_size: 1
    new_height: 256
    new_width: 256
  }
}

Does anyone have an idea what's going wrong here?

tkra...@gmail.com

unread,
Mar 8, 2018, 6:45:20 AM3/8/18
to Caffe Users
When I set debug_info to true in the solver, I get the info that the diff gets nan for many layers beginning at conv6_2_bn during backward pass when I use the test net as well. When I only use the the train net, everythings works fine.

With test_net:
...
308 12:18:15.705454 32336 net.cpp:604]     [Forward] Layer conv6_3, param blob 1 data: 0.682968
I0308 12:18:15.705965 32336 net.cpp:594]     [Forward] Layer loss6, top blob loss6 data: 139479
I0308 12:18:15.706037 32336 net.cpp:618]     [Backward] Layer loss6, bottom blob conv6_3 diff: 0.184859
I0308 12:18:15.709100 32336 net.cpp:618]     [Backward] Layer conv6_3, bottom blob conv6_2 diff: 0.272764
I0308 12:18:15.709146 32336 net.cpp:627]     [Backward] Layer conv6_3, param blob 0 diff: 2517.66
I0308 12:18:15.709172 32336 net.cpp:627]     [Backward] Layer conv6_3, param blob 1 diff: 11723.2
I0308 12:18:15.709393 32336 net.cpp:618]     [Backward] Layer relu6_2, bottom blob conv6_2 diff: 0.081838
I0308 12:18:15.712673 32336 net.cpp:618]     [Backward] Layer conv6_2_bn, bottom blob conv6_2 diff: nan
I0308 12:18:15.712704 32336 net.cpp:627]     [Backward] Layer conv6_2_bn, param blob 0 diff: 3900.29
I0308 12:18:15.712730 32336 net.cpp:627]     [Backward] Layer conv6_2_bn, param blob 1 diff: 5232.67
I0308 12:18:15.716105 32336 net.cpp:618]     [Backward] Layer conv6_2, bottom blob conv6_1 diff: nan
I0308 12:18:15.716148 32336 net.cpp:627]     [Backward] Layer conv6_2, param blob 0 diff: nan
I0308 12:18:15.716172 32336 net.cpp:627]     [Backward] Layer conv6_2, param blob 1 diff: nan
I0308 12:18:15.716389 32336 net.cpp:618]     [Backward] Layer relu6_1, bottom blob conv6_1 diff: nan
I0308 12:18:15.719676 32336 net.cpp:618]     [Backward] Layer conv6_1_bn, bottom blob conv6_1 diff: nan
...

without test_net:
...
I0308 12:22:57.532416   355 net.cpp:594]     [Forward] Layer loss6, top blob loss6 data: 227830
I0308 12:22:57.532491   355 net.cpp:618]     [Backward] Layer loss6, bottom blob conv6_3 diff: 0.184859
I0308 12:22:57.535567   355 net.cpp:618]     [Backward] Layer conv6_3, bottom blob conv6_2 diff: 0.272764
I0308 12:22:57.535600   355 net.cpp:627]     [Backward] Layer conv6_3, param blob 0 diff: 2517.66
I0308 12:22:57.535636   355 net.cpp:627]     [Backward] Layer conv6_3, param blob 1 diff: 11723.4
I0308 12:22:57.535858   355 net.cpp:618]     [Backward] Layer relu6_2, bottom blob conv6_2 diff: 0.081838
I0308 12:22:57.539136   355 net.cpp:618]     [Backward] Layer conv6_2_bn, bottom blob conv6_2 diff: 0.0686222
I0308 12:22:57.539165   355 net.cpp:627]     [Backward] Layer conv6_2_bn, param blob 0 diff: 3900.23
I0308 12:22:57.539189   355 net.cpp:627]     [Backward] Layer conv6_2_bn, param blob 1 diff: 5232.67
I0308 12:22:57.542528   355 net.cpp:618]     [Backward] Layer conv6_2, bottom blob conv6_1 diff: 0.29655
I0308 12:22:57.542564   355 net.cpp:627]     [Backward] Layer conv6_2, param blob 0 diff: 738.306
I0308 12:22:57.542600   355 net.cpp:627]     [Backward] Layer conv6_2, param blob 1 diff: 0.000258359
I0308 12:22:57.542820   355 net.cpp:618]     [Backward] Layer relu6_1, bottom blob conv6_1 diff: 0.147435
I0308 12:22:57.546093   355 net.cpp:618]     [Backward] Layer conv6_1_bn, bottom blob conv6_1 diff: 0.0291008
...

Obviously, the computation of the backward pass is the reason. Does anybody know the problem? I have to mention that Amulet is based on Caffe SegNet and Caffe SegNet is based on an old version of caffe.
Reply all
Reply to author
Forward
0 new messages