Caffe for regression predicts extremely wrong values, but low loss?

1,903 views
Skip to first unread message

Bo Moon

unread,
Mar 4, 2016, 6:56:08 AM3/4/16
to Caffe Users
I'm trying to use Caffe for regression, but I'm getting strange results and not sure what is causing it. I train on a small dataset using EuclideanLoss and get loss = -0.000549837 (which I'm already wary of, since I'm not sure how EuclideanLoss can be negative, and this seems too large for rounding errors). All the vector labels are values < 1, by the way. After, I do a forward pass on one of the training images, which has a vector label of [.1,.2,.3,.4,.5], but the net predicts [8098.089, 2752.4197, 1124.7037, 16813.717, 4724.4897], which is clearly wrong, but the loss is so low, so I'm very confused. I'm wondering if either I'm reading the results incorrectly from the net or using the wrong layer perhaps?

Solver:
net: "models/simplenet.prototext"
test_iter: 1
test_interval: 100
base_lr: 0.000001
momentum: 0.7
weight_decay: 0.0005
lr_policy: "inv"
gamma: 0.0001
power: 0.75
display: 50
max_iter: 4000
snapshot: 200
snapshot_prefix: "models/snapshots/simplenet/"
solver_mode: CPU

My deploy prototext is:
name: "simplenet"
input: "data"
input_dim: 10
input_dim: 3
input_dim: 224
input_dim: 224

layer {
  name: "conv1"
  type: "Convolution"
  param { lr_mult: 1 }
  param { lr_mult: 2 }
  convolution_param {
    num_output: 20 
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }

  bottom: "data" 
  top: "conv1"
}

layer {
  name: "pool1"
  type: "Pooling"
  pooling_param { 
    kernel_size: 2
    stride: 2
    pool: MAX
  }
  bottom: "conv1"
  top: "pool1"
}


layer {
  name: "ip1"
  type: "InnerProduct"
  param { lr_mult: 1 }
  param { lr_mult: 2 }
  inner_product_param {
    num_output: 5
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
  bottom: "pool1"
  top: "ip1"
}

layer {
  name: "prob"
  type: "Softmax"
  bottom: "ip1"
  top: "prob"
}

I included the prob layer because I was following some tutorials online, but I don't read from it. To extract the regression predictions, I look at the ip1 blob and view the values there. I confirm the blob shape is 1x5, which matches the vector label shape in the training data. Is this not the way to view predictions, or am I using a wrong parameter somewhere?

Jan

unread,
Mar 4, 2016, 7:14:07 AM3/4/16
to Caffe Users
Oh yes, negative loss values are definitely indicating something strange going on, as they should not be possible.

A Softmax layer has nothing to do with regression whatsoever; a regression network should not contain softmax. In my regression networks I just have an InnerProduct layer followed by an Euclidean loss. In the deploy net I just drop the loss layer, the output is the top blob of the IP layer. So yes, in your case reading the ip1 blob should do it.

Note that your network compresses information like crazy in a single step: your IP layer goes from 110x110x20 values to just 5. I wouldn't expect great performance...

I don't see your training net nor your data, so I don't know where exactly things go wrong, but a negative loss is definitely not normal, so, things are going wrong somehow. Need more info.

Jan

Bo Moon

unread,
Mar 4, 2016, 9:25:40 AM3/4/16
to Caffe Users
Thanks for the explanation! My goal right now isn't performance, but rather knowing I'm capable of making a regression net, so I'm trying a super simple data set. My data looks like:

image1.jpg .1 .2 .3 .4 .5
image2.jpg .90 .54 .12 .65 .34
...(some more images)...
image9.jpg .98 .12 .44 .33 22
image10.jpg .10 .32 .13 .5 .93

where I parse the images and values in a Python script and store them in an HDF5 file (I confirmed the parsed and stored weights are correct).

I'm very glad you pointed out the compression issue--that hadn't occurred to me. However, shouldn't I have really high loss? Could there a problem with the size of the labels or number of images perhaps? I'll go ahead and try to fix the compression to see if that changes anything.

My training prototext: (the only difference for the deployed net is that I drop the input layer and EuclideanLoss and add Softmax, which I don't use but will remove now)
name: "simplenet"

layer {
  name: "input"
  type: "HDF5Data"
  top: "data"
  top: "label"

  hdf5_data_param {
    source: "models/train_h5_list.txt"
    batch_size: 1
  name: "loss"
  type: "EuclideanLoss"
  bottom: "ip1" 
  bottom: "label" 
}


Jan

unread,
Mar 4, 2016, 9:46:05 AM3/4/16
to Caffe Users
Mhm, all of that looks ok to me. If you say that the data is correctly formatted there is something really weird going on. So I want to make that definitely sure. Could you upload the generated hdf5 for the ten images (shouldn't need to much space)? Yes, you should have high loss, especially at the start. Your base_lr looks really really small to me, so the network will probably not learn very well at all, but that is a different question.

Jan

Bo Moon

unread,
Mar 4, 2016, 10:00:27 AM3/4/16
to Caffe Users
Thanks! I attached my train.h5 file. About the learning rate, I originally had loss = nan for a rate of 0.001, but I read that lowering the rate helps that issue, so I lowered it and got rid of the nan issue.

I also just did a run with image sizes at 100x100. I see loss on the order of 1e-6, but I'm not sure if maybe it would turn negative if I leave it running long enough. The regression predictions are also much more reasonable but are still quite off. I'll try manually computing the loss to see if they match up.
train.h5

Bo Moon

unread,
Mar 4, 2016, 10:58:36 AM3/4/16
to Caffe Users
What's also worrisome is that I get different losses for the training and testing phase. I did some tweaking and did another training run and got a final loss of 

loss = -0.0010292. Afterwards, I ran 


caffe train -model=simplenet.prototxt -weights=snapshots/simplenet_iter_2000.caffemodel

which outputs after 50 iterations: loss = 6.91784e-06 


In my prototext, I don't use any TRAIN/TEST layers, and the data layer always reads from the same hdf5 source, so the training and testing losses ought to be the same, right? I'm wondering if maybe I ran something wrong, but I'm wondering what causes this.


On Friday, March 4, 2016 at 2:46:05 PM UTC, Jan wrote:

Jan

unread,
Mar 6, 2016, 12:57:23 PM3/6/16
to Caffe Users
Well, your hdf file looks fine to me. That should not be the source of a problem.

However, it seems your problem kind of vanished, right?

I don't really understand your statement about TRAIN/TEST. The command you issued starts training with the weights taken from the given caffemodel file (commonly called "finetuning"), so it is not surprising that your loss improves after that. That has nothing to do with training loss vs test loss. It is curious what caffe does with this command however, I always thought that you need to give a -solver for "caffe train". Does it just continue the training from the snapshot?

Jan

Bo Moon

unread,
Mar 6, 2016, 1:59:17 PM3/6/16
to Caffe Users
To clarify, I meant that I don't use any layer parameters such as "include { phase: TRAIN }", so I think the training and testing results ought to be the same. Both issues (negative loss and wildly incorrect regression values) persist. I also just realized I made a typo in my previous post: I meant to say "caffe test ...", not "caffe train ..."--my apologies for any confusion. Basically, I see that the loss during training is negative, but if I run a test using the exact same training data, I get a positive loss, so I'm confused why running a test on the training data outputs a different result.

I calculated the loss manually according to the formula listed under EuclideanLoss on the Caffe site (so 1/2m * sum of squared differences) , and I get a loss on the order of 10^6, whereas the training loss output by the binary is on the order of -10^-3, which makes sense seeing that the regression predictions are so large in magnitude when the correct labels are in the range [0,1], so it appears that the values I'm reading from the net aren't even the same values being used in the loss calculation...

I also used the GoogLeNet for object classification by importing the model and weights and giving it various images (of dogs, cars, etc.), and it identifies the labels correctly. It looks like I can run a pretrained classification net correctly. Is there a minimalistic toy example of net regression I can try that has an obvious result, e.g. training a net of one layer on a single image?

Jan

unread,
Mar 7, 2016, 2:47:55 AM3/7/16
to Caffe Users
There is no minimal example I know of, but you can easily make your own.

The negative loss is really strange, the only reason I can think of is a range overflow of the float values, but I am not sure if that is really applicable here. What you can do maybe is put the whole thing in pycaffe, do a training step or forward pass and inspect the blobs manually, to see in which layer the strange things happen. A toy example should help a lot there. Since there is squaring in the loss layer, negative values simply should not appear...

Jan

Anne

unread,
Jul 14, 2016, 10:57:13 AM7/14/16
to Caffe Users
Dear Jan,
You said that his learning rate is very low. I have to set mine to 1e-25 to get rid of NaN error... Do you have any experience on how get rid of them without having a model that does not learn anything?

Thanks a lot,
Anne

Youngson Zhao

unread,
Oct 31, 2016, 12:09:47 PM10/31/16
to Caffe Users
Hi, I am doing regression using caffe right now and I think I had met the same problem with you. My label is 11 and the net predict is 2290. The loss in train and in test are both very small. Did you solve this problem? Could you please give me some instructions? Thank you very much.

在 2016年3月4日星期五 UTC+8下午7:56:08,Bo Moon写道:

T Nguyen

unread,
Nov 27, 2016, 7:52:37 PM11/27/16
to Caffe Users
I got the same problem. The loss drops very fast then goes negative after a few iterations. However, I tried to deploy the same code and model on another machine, it works just fine. 

I guess it is some kind of floating point error but I dont know where to catch it. 

Any idea? 

Ranju Mandal

unread,
Jan 20, 2018, 2:59:02 AM1/20/18
to Caffe Users
I have got the same problem. Have you solved the problem? please reply.
Reply all
Reply to author
Forward
0 new messages