Fully convolutional neural net

17330 views
Skip to first unread message

Kien Nguyen Thanh

unread,
Jun 7, 2015, 10:00:23 PM6/7/15
to caffe...@googlegroups.com
Hi all,

Is there anyone managing to run semantic segmentation FCN models on the future branch of Caffe? I have been around with the previous version of Caffe sometime but now having trouble installing and running, testing the model provided in the Model Zoo.

1) When installing using the same procedure as previous, the making commands (make pycaffe, make all and make test) return errors.
2) How to prepare image data for segmentation. Are we using the same python script "classify.py" to segment the probe images?

I appreciate any ideas. Thanks in advance.


Christopher Catton

unread,
Jun 7, 2015, 10:23:48 PM6/7/15
to caffe...@googlegroups.com
1) Could you provide more detail on the errors? I'm guessing either you do not have a dependency installed or your graphics driver needs to be updated (the latest drivers from the nvidia website should work)
2) You can use a single image if you are just testing the model. Eval.py in the pascal context models show how to do testing on the model. If you are looking to train the model on your own dataset then your are probably about where I am. I'm still having trouble getting that bit going.

Kien Nguyen Thanh

unread,
Jun 8, 2015, 9:05:01 PM6/8/15
to caffe...@googlegroups.com

Thanks for the response Chris.
1) I got error on the math_function file: "make : *** [build/src/caffe/util/math_functions.cuo] Error 2". After modifying that file, the make commands work well now.
2) While using Eval.py for segmenting an image, I got the following error:
"  File "python/eval.py", line 15, in <module>
    net = caffe.Net('examples/FCN/deploy.prototxt', 'examples/FCN/fcn-32s-pascalcontext.caffemodel', caffe.TEST)
AttributeError: 'module' object has no attribute 'TEST'".
Did you have this problem before?
3) By the way, will the Eval.py output a segmented image, or we will need to add extra code to present it?

Thanks.

Christopher Catton

unread,
Jun 9, 2015, 1:51:50 AM6/9/15
to caffe...@googlegroups.com
1) I've never needed to modify any of the caffe code to build it. Are you using Atlas? I think I might of had a similar error using OpenBlas.
2) Are you exporting the python path as described in the installation guide?
3) You'll need to add the code the script to be able to present it or store it as you want.

Do you have any problems making the the master branch of the caffe repository https://github.com/BVLC/caffe ? If you don't have a problem with said branch then you may want to run "git branch" and make sure that you have cloned the correct branch.

Kien Nguyen Thanh

unread,
Jun 9, 2015, 9:13:09 PM6/9/15
to caffe...@googlegroups.com
1) I am using Atlas
2) I have no problem installing master branch so far, all working perfectly.

I still haven't found any way to walk around the issue "AttributeError: 'module' object has no attribute 'TEST'" in the eval.py file.

Cheers

Kien Nguyen Thanh

unread,
Jun 9, 2015, 9:32:20 PM6/9/15
to caffe...@googlegroups.com

Got it done. Thanks Chris for all your useful discussion.

Carlos Treviño

unread,
Jun 10, 2015, 4:53:57 AM6/10/15
to caffe...@googlegroups.com
The next PR explains a little bit how to generate the data for semantic segmentation, nevertheless I´m still stucked in that part.

https://github.com/BVLC/caffe/issues/1698

eran paz

unread,
Jul 5, 2015, 3:01:06 AM7/5/15
to caffe...@googlegroups.com
Hi
Were you able to run the network?
I'm having some trouble with the label matrix.
I've created an image with 0 as background and 1...K marking pixels belonging to each class.
I've created the lmdb according to PR#1698 for both images and labels.
when I run the net I get this error:
Check failed: outer_num_ * inner_num_ == bottom[1]->count() (187500 vs. 562500) Number of labels must match number of predictions; e.g., if softmax axis == 1 and prediction shape is (N, C, H, W), label count (number of labels) must be N*H*W, with integer values in {0, 1, ..., C-1}.

As far as I can tell, the problem is that my labels are saved with 3 channels and not 1, but I couldn't figure out how to save them with 1 channel.

Any help would be appreciated

THX

Gavin Hackeling

unread,
Jul 7, 2015, 11:53:34 PM7/7/15
to caffe...@googlegroups.com
Yes, the problem appears to be that your labels have three channels instead of one channel. Assuming that your image has the shape (C, H, W) and that the channel containing your integer class labels is c, you can index that channel using "img = img[c, :, :]".

Mansi Rankawat

unread,
Jul 18, 2015, 4:01:47 PM7/18/15
to caffe...@googlegroups.com
Hi,

I am training FCN-32 network using pretrained weights from ILSVRC 16 layer VGG net and finetuning on PASCAL VOC 11 dataset. Even after 17000 iterations the loss remains constant at 3.04452. Kindly help me as to what could be the reason behind the loss not decreasing at all and remaining constant. I am using the code here to create lmdb files (https://github.com/BVLC/caffe/issues/1698).

Thanks,
Mansi

Ben Gee

unread,
Jul 31, 2015, 5:48:15 AM7/31/15
to Caffe Users, mansira...@gmail.com
Hi, Mansi, have you solved your problem ? I think you might have some problem with the lmdb data. 
Have you tested the model on the pascal voc 11 or pascal context voc dataset ?
I'm having trouble in obtaining the same result as reported. 

Youssef Kashef

unread,
Aug 4, 2015, 6:34:06 AM8/4/15
to Caffe Users
Hello everyone,

I've been trying to train FCN-32s Fully Convolutional Semantic Segmentation on PASCAL-Context but keep getting very high loss values regardless of how many iterations:"Train net output #0: loss = 767455 (* 1 = 767455 loss)". Sometimes it would go as low as 440K but then it'll just jump back up to something higher and oscillate.
Ignoring the high and letting it go through 80K iterations, I still end up with a network that produces all zero output.
I can't tell what's throwing it off like that.

My procedure in detail:
  1. Follow instructions in future.sh from longjon:future, except that I apply the PR merges to BVLC:master instead of longjon:master. Building off of longjon:future results in cuDNN build errors like here. Applying some of the PR merges to BVLC:master is redundant since they've already been merged to the master branch.
  2. Build Caffe with cuda 6.5, cuDNN. I've tried a CPU only build and got the same high-loss behavior, so I don't think it's related to GPU or the driver (then again, I only let it run for 5K iterations).
  3. Generate LMDB for the PASCAL-Context database. The lmdb generation script is built around Evan Shellhammer's python snippet in this comment in PR#1698.
    1. The images are stored as numpy.uin8 with shape C x H x W, with C=3
    2. The ground truth is stored as numpy.int64 with shape C x H x W, with C=1
    3. The order in which the images are stored is the same as the ground truth. One lmdb for the images and one for the labels. I have two pairs of each to reflect the train/val split.
  4. Use net surgery to turn VGG-16 into a fully convolutional model. For this I pretty much followed the net surgery example and used the layer definitions from the FCN-32s' trainval.prototxt in Model Zoo.
    1. Not sure I did this one right though. The output I get for the cat image is still a single element 2D matrix.
    2. I've tried using VGG-16 fcn weights from HyeonwooNoh/DeconvNet/model/get_model.sh but still getting the same behavior.
    3. How can I verify my fully convolutional VGG-16 better?
  5. Apply the solve.py step for initialzing the deconv. parameters. According to shelhammer''s post here, not doing this, could leave things stuck at zero.
    1. What's a good way of verifying the initialization is correct? I'm worried that the problem is there.
  6. solver.protoxt and trainval.prototxt are identical to those shared on Model Zoo. They only differ in the paths to the lmdbs.
  7. I start the training, I start getting "Train net output #0: loss = 767455 (* 1 = 767455 loss)" sometimes it will go down by several 100K, but I never see values < 10.0 that I've seen some report.
I could really use some help in figuring out what I'm doing wrong and understanding why the loss is so high. It seems that people have figured out how to train these fcn's without the need of a detailed guide, so it seems I'm missing a critical step somewhere.

Thank you

Etienne Perot

unread,
Aug 4, 2015, 12:42:00 PM8/4/15
to Caffe Users
Thanks Youssef Kashef to sharing your detailed procedure.

1. i just built it without cudnn, not sure why we need to fuse?
3. i found that hdf5 was pretty easy to use : you create the dataset  this way :
import h5py

f
.create_dataset("Images", (maxSamples,3,imahe_height,image_width), dtype='float32')
f
.create_dataset("Mask", (maxSamples,1,mask_height,mask_width), dtype='float32') #in practice you want mask_height to be equal to image_height
#write your data as numpy arrays (use a transformer for the image)
shape=(1,3,imh,imw)
transformer
= caffe.io.Transformer({'data': shape})
transformer
.set_mean('data', np.array([100,109,113]))
transformer
.set_transpose('data', (2,0,1))
transformer
.set_raw_scale('data',255.0)
n
=0
for img,gt in zip(data,gt):
    f
["Images"][n] = transformer.preprocess(img)
    f["Mask"][n] = gt.reshape((1,mask_height,mask_width))
    n=n+1 #pardon my french


   

4. You do not need this step! you can just finetune from your model and replace "InnerProduct" layers by "Convolution". Of course by doing so, all weights in the fully connected part will be gone, but they are fast to train.


5. Here i'm not sure we need this : it seems there is this initialization which is possible : 


layer{ type:"Deconvolution"
...
convolution_params{
... 
   weight_filler
{
        type
: "bilinear"  
   }
}


6. about the solver, it seems it puts very high momentum and very small base learning rate (10^-10) for the un-normalzed softmax, i'm not sure to understand why...

Youssef Kashef

unread,
Aug 4, 2015, 12:51:03 PM8/4/15
to Caffe Users
Some indication of progress:
I think my problem of very high loss was due to all-zero parameters in the fc6 and fc7 layers.
The FCN32s model is trained by fine-tuning the fully convolutional variant of VGG-16 model (vgg16fc).
The VGG-16 model is made fully convolutional by following the net_surgery notebook example.
solve.py describes how to set things up for training, specifically how to initialize the weights of layers fc7, and all preceeding layers. Then it shows us how to initialize the weights fo the remaining Deconvolutional layer.
The initialization of all layers leading up to and including fc7 are done by calling solver.net.copy_from(base_weights) where base_weights is the path to vgg16fc.caffemodel.
For some reason that step is not copying the weights of the fc6 and fc7 layers, leaving them all zero. All earlier conv layer weights are copied correctly.
I solved it by copying the remaining weights in python similar to the net_surgery notebook example.
Will still need to verify what's keep it from copying all layers.
Might be too early to claim victory, but I'm already seeing the loss drop after the first 200 iterations. A significant improvement over the earlier behavior.

Youssef Kashef

unread,
Aug 4, 2015, 12:59:52 PM8/4/15
to Caffe Users
Hello Etienne,

Thanks for weighing in.
re-4: Not sure if I can do without this step, otherwise the weights of my last two convolutional layers (fc6 and fc7) are all zeros and just stay zeros regardless of how many iterations.
re-6: Maybe the small learning rate is because it assumes you're fine tuning VGG-16. Maybe this also contributed to the weights not changing in the last to conv. layers.

Will let it run some more and see if my earlier assumption was valid.

Thanks

Youssef

Fatemeh Saleh

unread,
Aug 6, 2015, 4:56:47 AM8/6/15
to Caffe Users
Hi Youssef,

I exactly do the same steps as you and the loss values is something about  623345 after about 4000 iteration! I would appreciate if you could help me with any successful solution.

Thank you

Youssef Kashef

unread,
Aug 6, 2015, 5:18:22 AM8/6/15
to Caffe Users
Hey Fatemeh,

It seems there's still something I'm doing wrong.
One problem I had was that in solve.py the weights for the fc6, and fc7 were not being copied correctly from my fully convolutional variant of VGG-16 model. They were all zeros.
The weight of the earlier layers were copied correctly.
I ended up copying the weights for fc6 and fc7 using step 9 in the net_surgery notebook example.
That step is just about transplanting a set of parameters from one network to another.
My fc6 and fc7 are no longer zero.
Within 200 iterations of training fcn32s with that initialization, the loss dropped from +600K to the 100K-300K. I also noticed that the "Train net output #0: loss" and "Iteration X, loss = " weren't the same values anymore. Not sure why so.
Unfortunately, I'm now at iteration  44K and the loss is still oscillating in that same range. So there's still something wrong with my setup.

Something I found odd was that the step for initializing the deconvolution layers in solve.py still leaves large bocks of all zero parameters. I don't know if that's intended. Still waiting on a response from Evan Shelhammer in another thread.

Fatemeh Saleh

unread,
Aug 6, 2015, 5:34:47 AM8/6/15
to Caffe Users
Thank you very much for your complete answer.
So, it seems that I should also wait for the response from Evan Shelhammer. I will also try your solution about copying the weights for fc6 and fc7. 

Thanks.

Evan Shelhamer

unread,
Aug 6, 2015, 11:53:43 AM8/6/15
to Caffe Users, Youssef Kashef
You are doing the right thing checking the weights as you go and should find your misstep this way.

The loss does oscillate quite a lot with whole image mini-batches but it should descend. Are you making use of gradient accumulation or high momentum as suggested by our paper and the model zoo models?

Are you starting with FCN-32s or directly training a skip architecture? The skips can be more sensitive to hyperparams if learned all at once than through stages of fine-tuning.

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/ed48deaa-8285-433c-b477-ae61115895b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Youssef Kashef

unread,
Aug 6, 2015, 12:16:36 PM8/6/15
to Caffe Users, youssef...@gmail.com
Hell Evan,

That's reassuring, I'm 50K iterations in and see oscillations around 90K-110K (train loss = 36K-145K). The loss on the training has much stronger variations. I'd like to think that the average loss is decreasing. I wasn't sure what magnitude to expect because I read people posting a loss of < 10.0 with less than 20K iterations into the training. What kind of loss magnitude should I expect.

Maybe it's worth computing the accuracy to compare with values in the paper.

I haven't gotten to trying out any other hyperparameters yet. This is pretty much just taking trainval.prototxt from Model Zoo. Isn't that the definition for FCN-32s (non-fixed) in the paper?
It doesn't use grad. accumulation, a very slow learning rate (base_lr: 1e-10), a very high momentum (momentum: 0.99). 
Haven't looked into the skipping archs yet. I figured it would make more sense to train those by fine-tuning.

I guess I'm still at a point where I want to set a baseline for my experiments. I guess then I'll start reading the paper more closely in regards of the recommended learning params and fiddeling with the hyper parameters. Then take on the skipping archs. Do you think that's a reasonable order for going about things. I do want to get a feel for the parameters before doing any crazy stuff.

Thanks,

Youssef

Vladimir Nekrasov

unread,
Aug 7, 2015, 4:43:21 AM8/7/15
to Caffe Users
Hello Youssef,

I have tried to do the same steps as you in 1), but, unfortunately, I have got the 'merge conflict' message when merging first PR:
CONFLICT (content): Merge conflict in include/caffe/vision_layers.hpp
Have you encountered the same problem?
If so, how have you solved it?

Vladimir



вторник, 4 августа 2015 г., 13:34:06 UTC+3 пользователь Youssef Kashef написал:

Youssef Kashef

unread,
Aug 7, 2015, 5:33:01 AM8/7/15
to Caffe Users
Hello Vladimir,

I manually resolved the conflict. The conflict was between SPPLayer and CropLayer class definitions. So I basically disentagled the definitions. More details in this PR. Still pending feedback.

Youssef

Vladimir Nekrasov

unread,
Aug 7, 2015, 9:54:49 AM8/7/15
to Caffe Users
Youssef,

Thank you very much!
Everything has worked fine with your PR.

Vladimir

пятница, 7 августа 2015 г., 12:33:01 UTC+3 пользователь Youssef Kashef написал:

zzz

unread,
Aug 17, 2015, 5:32:01 PM8/17/15
to Caffe Users
Hi Youssef,

Thanks for your details.
For PASCAL-Context database, there are 5105 training images. May I ask how do your split the train/val data?
Thanks in advance for helping!

Zizhao

Youssef Kashef

unread,
Aug 18, 2015, 12:05:04 PM8/18/15
to Caffe Users
Hello Zizhao,

I sort of guessed. My train/val split is based on the 59-category segmentation results reported on the PASCAL-Context webpage. When you scroll down to the "Project Specific Downloads" section, you can download segmentation results generated by Motthagi et al.'s CVPR paper from 2014. The generated segmentations for 5105 images. Those are the ones I grouped into the validation set. I don't know what their splitting strategy was, but curious to learn what it is.

Here's a link to a text file with those 5105 image names (excluding extension).

Youssef

zzz

unread,
Aug 18, 2015, 1:47:12 PM8/18/15
to Caffe Users
Hi Youssef,

I am kind of confused. You said you split train/val on 59-category segmentation results (totally 5105 labeled images) and you also said use 5105 as validation data. I want to make sure you are not using PASCAL full labeled training data with 400+ categories and 10000+ image right? I think you split this 5105 images as train/val right?
Thanks for your help !

Zizhao 

zzz

unread,
Aug 18, 2015, 2:07:08 PM8/18/15
to Caffe Users
Hi Youssef,

to train FCN. My train/val data is totally from those 5105 images from PASCAL-Context webpage.
When I set everything done and run solve.py. I got an error in fc6

I0818 13:46:41.312158  3458 net.cpp:703] Copying source layer relu5_1
I0818 13:46:41.312182  3458 net.cpp:703] Copying source layer conv5_2
I0818 13:46:41.331995  3458 net.cpp:703] Copying source layer relu5_2
I0818 13:46:41.332023  3458 net.cpp:703] Copying source layer conv5_3
I0818 13:46:41.351951  3458 net.cpp:703] Copying source layer relu5_3
I0818 13:46:41.351976  3458 net.cpp:703] Copying source layer pool5
I0818 13:46:41.351981  3458 net.cpp:703] Copying source layer fc6
F0818 13:46:41.351986  3458 blob.cpp:454] Check failed: ShapeEquals(proto) shape mismatch (reshape not set)

This error is from blob.cpp. Looks like the reshape variable is set to zero. Have you met this problem before. I though it may happen in how my data organized in lmdb. But I have checked this which is fine.

void Blob<Dtype>::FromProto(const BlobProto& proto, bool reshape) {
  if (reshape) {
    vector<int> shape;
    if (proto.has_num() || proto.has_channels() ||
        proto.has_height() || proto.has_width()) {
      // Using deprecated 4D Blob dimensions --
      // shape is (num, channels, height, width).
      shape.resize(4);
      shape[0] = proto.num();
      shape[1] = proto.channels();
      shape[2] = proto.height();
      shape[3] = proto.width();
    } else {
      shape.resize(proto.shape().dim_size());
      for (int i = 0; i < proto.shape().dim_size(); ++i) {
        shape[i] = proto.shape().dim(i);
      }
    }
    Reshape(shape);
  } else {
    CHECK(ShapeEquals(proto)) << "shape mismatch (reshape not set)";
  }

Thanks for your help

Youssef Kashef

unread,
Aug 18, 2015, 3:11:22 PM8/18/15
to Caffe Users
Hi Zizhao,

Out of the 10,103 annotated images in the PASCAL-Context dataset, I excluded the 5105 images used by the authors to demonstrate their segmentation results as the validation set and keeping the rest for training. It's a near 50-50 split with no overlap between the two subsets.

Does this clarify things? Happy to discuss more. Unfortunately, I haven't gotten around to inspecting the distribution of the labels in both subsets.

Youssef Kashef

unread,
Aug 18, 2015, 3:18:42 PM8/18/15
to Caffe Users
Hi Zizhao,

What do you mean by "My train/val data is totally from those 5105 images from PASCAL-Context webpage." Are you training and validating on the same data? If yes, I think you need to split your train and val so that there's no overlap and you don't get misleading evaluations due to overfitting.

You need to turn the fc6 and the fc7 layers of the VGG-16 model from fully connected into convolutional layers. I think the error you're getting in solve.py is because you're loading VGG-16 before making it fully convolutional. In that case, caffe will encounter a mismatch in the number of parameters it expects for layers fc6 and fc7. The trainval.prototxt for the FCN model expects them to be convolutional layers.

Hope this helps.

zzz

unread,
Aug 19, 2015, 3:58:24 PM8/19/15
to Caffe Users
(lets put the discuss back to google group)
Hi Youssef,

This time I got your data preparation method. 
I think the set test_iter = 5105 is trying to test all val images one test iteration.
if test_interval is higher than max_iter, the test should never be carried out. I think you are correct.

I haven't successfully run train the FCN. I will discuss with you when I reach to next step.
Thanks so much!!


On Wed, Aug 19, 2015 at 3:55 AM, Youssef Kashef <youssef...@gmail.com> wrote:
Hello Zizhao,

Yes, I'm doing the 59-category scenario where I group all categories outside the 59-set into the background class. So my outputs have dimensions 60xHxW.
Something I still don't understand when it comes to the number 5105:
In solver.prototxt, you see these two lines:

test_iter: 5105
# make test net, but don't invoke it from the solver itself
test_interval: 1000000

According to the solver of caffe's MNIST tutorial:
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500

I don't get it. The batch size in the FCN trainval.prototxt is exactly 1. I'm guessing due to memory restrictions.
This is what I think is going on, can you please correct me if I'm wrong:
test_iter and test_interval don't have anything to do with the batch size.
Since batch size is 1, the solver will perform a fwd pass on 5105 batches and calculate the gradients for each iteration. 1 iteration is equal to 1 image.
test_interval is so high the solver never carries out testing. But we still see it printing the loss for each iteration, once for on the training subset and once on val.

Is this correct?

Thanks,

Youssef

On Tue, Aug 18, 2015 at 10:26 PM, Zizhao Zhang <mr.zizh...@gmail.com> wrote:
Hi Youssef,

I am quite new to segmentation task. But now I am more clear meanings.

I thought you split the 5105 image into train and val (e.g., 3000 for train and 2105 for val). The reason why I am ask is that if you follow the Evan's FCN training instruction, the output layer is a 60 *H*W (so it is 59 categories + 1 background). That's how I infer you use all 5105 as train/val by non-overlapping split. 
However, the total PASCAL-Context webpage dataset (10,103) segmentation masks have more than 400+ categories. So if I you train in this one, which means the output of your last layer should be 400+*H*W. Are you training in this way? Or you set the labels outside the 59 categories as background.

For the bug, it is totally clear now. Thank you so much for your great help.


-- 
You received this message because you are subscribed to a topic in the Google Groups "Caffe Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/caffe-users/3eIMYV0OlY8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to caffe-users+unsubscribe@googlegroups.com.

To post to this group, send email to caffe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



-- 
Best Regards,
Zizhao




-- 
Best Regards,
Zizhao

Fatemeh Saleh

unread,
Aug 20, 2015, 7:33:32 AM8/20/15
to Caffe Users, youssef...@gmail.com
Hi Youssef,

Did you compare your accuracy with the paper? I have obtained the pixel accuracy and mean accuracy for 5105 test images and my results are 55.48 and 45.17 for FCN-32s and 58.57 and 47.01 for FCN-16s which are different from the paper.
I am now trying to fine tune using PASCAL VOC with 21 classes with the train and validation sets as mentioned in the paper and also the loss is very big. 
I did another experiment using (http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/03-fine-tuning.ipynb) to see the fine-tuning loss and the scratch loss. The fine-tuning loss values are much bigger than the scratch loss values and I'm now wondering if I am doing something wrong !!!
Would you please help me with your new information around training this network?

Thank you very much in advance.

Youssef Kashef

unread,
Aug 20, 2015, 8:00:11 AM8/20/15
to Caffe Users, youssef...@gmail.com
Hello Fatemeh,

I only got as far as training FCN32s for 80K iterations and with only a qualitative assessment of generated segmentations. They looked similar to those generated by the pretrained FCN32s model in the Model Zoo. I haven't performed any quantitative evaluations yet. I did notice that the loss was still in the 10K range after 80K iterations. Although lower than what I started off with, I was hoping for lower values. The paper doesn't mention the magnitudes of loss values except in Figure 5, but the context of that is different. They're pretty low and I don't think those are the values one should expect to get. Different context.
You said your loss is very big, but does it decrease after 10K iterations?

BenG

unread,
Aug 20, 2015, 8:57:24 AM8/20/15
to Caffe Users, youssef...@gmail.com
Hi, what data are you using? I mean the experiment on pascal voc 2011 and 2012. How many training and validation? 
I use the data from berkeley sds aside from the pascal voc data, about 10k images for training. And got high loss around 500k, I don't know why. 

Youssef Kashef

unread,
Aug 20, 2015, 9:16:00 AM8/20/15
to Caffe Users, youssef...@gmail.com
Hello Ben,

Currently only using the PASCAL-Context dataset with the 59-category subset. It's basically full image annotations added to images from VOC 2010. It has about 10K images, approx. 5x that of the VOC segmentation challenge and fully annotated. The train/val split is commonly 50/50.
Are you training from scratch or fine-tuning from another network?
How many iterations did it go through when it reached loss 500K?

Steve Bengi

unread,
Aug 20, 2015, 9:33:11 AM8/20/15
to Youssef Kashef, Caffe Users
Hi, Youssef, I'm finetuning on Pascal voc 2011 and 2012 21-category segmentation task. 
1. For voc2011, I use the data only from pascal voc 2011 with the given split: 1112/1111 train/val. 
    The error just keeps oscillating around 500k from the beginning, and doesn't decrease much after 10k iterations. 
I'm looking for the reason. 
2. For voc2012, I use the Semantic Boundaries Dataset and Benchmark plus the voc2012 training data for traning, about 10k images. And the error is high too. 

I'm looking for the reason. 




--
You received this message because you are subscribed to a topic in the Google Groups "Caffe Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/caffe-users/3eIMYV0OlY8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to caffe-users...@googlegroups.com.

To post to this group, send email to caffe...@googlegroups.com.

Youssef Kashef

unread,
Aug 20, 2015, 9:49:02 AM8/20/15
to Caffe Users, youssef...@gmail.com
Hi Ben,

If you're not getting any decrease in loss for 10K iterations, I suggest you take a few steps back and:
  1. Check the data and labels you're feeding into your network. Load your solver in python and run a single step, inspect the dimensions of your data and label blobs and display the images. You can also display the labels as images using pyplot's imshow().
  2. Check the initial weights of your network. Are there zero weights where there shouldn't be any. My problem was that I had conv. layers with all-zero weights that stayed zero. The network was basically not learning anything. You can display the weights in python (e.g. print solver.net.params['fc6'][0].data). Please see my earlier post for details. The all-zero problem can also be verified if your network always produces all-zero predicions (e.g. print solver.net.blobs['score'].data).

Fatemeh Saleh

unread,
Aug 20, 2015, 9:38:50 PM8/20/15
to Caffe Users, youssef...@gmail.com
Hi,
I used SBD training samples as mentioned in the paper which is 8498 images and validation set of 736 images which is mentioned in the foot not of the paper. The loss in around 500K. It decreases after 10K but still is high with strong variations. 

Steve Bengi

unread,
Aug 21, 2015, 2:44:57 AM8/21/15
to Fatemeh Saleh, Caffe Users, Youssef Kashef
It seems that we're facing similar problems, I'll update if any progress. 


Zizhao Zhang

unread,
Aug 21, 2015, 9:50:23 AM8/21/15
to Fatemeh Saleh, Caffe Users, Youssef Kashef
Hi Fatemeh,

You mentioned you may think you made wrong net surgery. Could you specific how you did? Or post the prototxt of model architecture of fully connected vgg16?
Thanks 

You received this message because you are subscribed to a topic in the Google Groups "Caffe Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/caffe-users/3eIMYV0OlY8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to caffe-users...@googlegroups.com.

To post to this group, send email to caffe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Best Regards,
Zizhao

zzz

unread,
Aug 24, 2015, 2:40:30 PM8/24/15
to Caffe Users, fateme...@gmail.com, youssef...@gmail.com
Hi,

Anyone has progresses of training FCN. How to solve the high loss issue?

Youssef Kashef

unread,
Aug 24, 2015, 2:58:11 PM8/24/15
to Caffe Users, fateme...@gmail.com, youssef...@gmail.com
How high of a loss is too high? The paper doesn't seem to mention much on loss values for the different datasets.
Is it correct to assume that the Euclidean loss generated is normalized and independent of the number of classes and image dimensions?

Zizhao Zhang

unread,
Aug 24, 2015, 3:48:01 PM8/24/15
to Youssef Kashef, Caffe Users, Fatemeh Saleh
I still have a loss around 10K and have a really large variation.
One difference is that for label lmab conversion, I use the type of numpy.uint8. There is only 60 categories, so the length is enough.
Does it influence?
I am going to use the already trained model FCN-32 provided by Evan to fine-tune and see the loss.



For more options, visit https://groups.google.com/d/optout.



--
Best Regards,
Zizhao

eran paz

unread,
Aug 24, 2015, 4:22:39 PM8/24/15
to Caffe Users
I'm still getting all 0 output, I'm using the bilinear weight_filler, but it still doesn't work (also using group==num_output of previous layer).
BTW, I'm using my own dataset and training from scratch (not fine tuning), should I use steps for learning rate or fixed? ideas are welcomed...

Steve Bengi

unread,
Aug 24, 2015, 8:36:20 PM8/24/15
to Zizhao Zhang, Youssef Kashef, Caffe Users, Fatemeh Saleh
It's ok if the loss reachs 10k. Because the softmax loss is unnormalized. Hope it helps. 

Zizhao Zhang

unread,
Aug 24, 2015, 8:50:57 PM8/24/15
to Steve Bengi, Youssef Kashef, Caffe Users, Fatemeh Saleh
Hi Steve,

Thanks for this information.
But in current situation, after about 90K iteration, the loss still oscillate a lot. I thought the loss could be large but should be goes down as iterating. Am I right?
--
Best Regards,
Zizhao

Fatemeh Saleh

unread,
Aug 24, 2015, 9:02:23 PM8/24/15
to Zizhao Zhang, Steve Bengi, Youssef Kashef, Caffe Users
Hi,
I have just plot the train loss. Although it is high and oscillate a lot but the diagram shows that it will decrease during the training process. 
Untitled.png

Steve Bengi

unread,
Aug 24, 2015, 9:20:38 PM8/24/15
to Fatemeh Saleh, Zizhao Zhang, Youssef Kashef, Caffe Users
Hi, Fatemeh, the loss seem to decrease a log. What batch size are you using? 
Message has been deleted
Message has been deleted

Etienne Perot

unread,
Sep 4, 2015, 11:39:20 AM9/4/15
to Caffe Users
Hi everyone!

something that worked for deconvolution layer, just set group number equal to class number

layer {
 name: "fc8-conv"
 type: "Convolution"
 bottom: "fc7-conv"
 top: "fc8-conv"
 convolution_param {
   num_output: num_of_classes
   kernel_size: 1
   weight_filler {
     type: "gaussian"
     std: 1.0
   }
   bias_filler {
     type: "constant"
     value: 0.0
   }
 }
}

layer {
type: "Deconvolution"
name: 'upscore'
bottom: 'fc8-conv'
top: 'upscore'
param {
  lr_mult: 0
}
convolution_param {          
       kernel_size: 64
        stride: 32
     pad: 16
num_output: num_of_classes
 group: num_of_classes
       weight_filler{
 type: "constant"  
     value: 1
       }
}
}

also normalizing in softmaxwithLoss actually does not hurt at all, you can keep it, and i used a slightly smaller learning rate in my case that mentioned in the paper (1e-5 instead of 1e-3 for alexnet)...

also, i know it will probably sound obvious, but if like me, you are using opencv to read images from hard disk not set the transformer with raw_scale and channel swap : 

#transformer init for preprocess pictures loaded with opencv cv2.imread(...)

shape=(1,3,imh,imw)

transformer = caffe.io.Transformer({'data': shape})

transformer.set_transpose('data', (2,0,1))

transformer.set_mean('data', np.array([100,109,113]))

#transformer.set_raw_scale('data',255) #this will do weird thing if you let it

#transformer.set_channel_swap('data', (2,1,0)) # the reference model (caffenet) has channels in BGR order, so does opencv!no need for another swap


Finally i combined conv4 & pool5 using the deconvolution by a factor of 2, added the eltwise sum operation, and did a final deconvolution by a factor of 16 to get a bit finer results. It works on daimler but not on mscoco so far (no idea why...)


i got those results below for daimler dataset. i trained it for 10k iterations...probably it could be much better with VGG.








yu Magic

unread,
Sep 8, 2015, 11:29:43 PM9/8/15
to Caffe Users
Hello Youssef ,
        I am a new to caffe .I can not understand.Since the author has provided a caffemodel in the http://dl.caffe.berkeleyvision.org/fcn-32s-pascalcontext.caffemodel,why should we be  training the model? Recently I have been using python loading model,how to get the final result will be displayed?

在 2015年8月4日星期二 UTC+8下午6:34:06,Youssef Kashef写道:

vijay john

unread,
Sep 8, 2015, 11:59:40 PM9/8/15
to Caffe Users
Hi Etienne,

I am trying to get the FCN working on the Kitti dataset for the road segmentation and haven't been able to do so. I followed your suggestions and changed the train_val.prototxt, but I haven't managed to get it running. I am glad you managed to get the FCN working on the Daimler dataset. Could you kindly share the working train_val and solver prototxt, so I can try it on the kitti dataset.

Cheers,
Vijay

Youssef Kashef

unread,
Sep 9, 2015, 4:20:11 AM9/9/15
to Caffe Users
Hello Yu,

True, the model is shared for off-the-shelf use. In my case, I'm training the model to understand more about it, in case I want to train it on a different dataset, different method,..
re-displaying the final result:
The eval.py script shows how to load the input image, load the model, perform inference, apply argmax on the network predictions to produce the final 2d output.
You can treat the output matrix as an image and display it with:
import matplotlib.pyplot as plt; plt.imshow(out)

vijay john

unread,
Sep 9, 2015, 9:34:48 PM9/9/15
to Caffe Users
Hello everyone,

I managed to get the FCN working on the Kitti dataset. I changed the lr_mult from 0 to 1 in the upsample layer and got it working. I also followed the suggestions of Etienne and set the num_output and groups to the number of class. 

layer {
  name: "upsample-new"
  type: "Deconvolution"
  bottom: "score-fc7-new"
  top: "bigscore"
  param {
    lr_mult: 1
  }
  convolution_param {
    num_output: num_class
    kernel_size: 63
    stride: 32
    pad: num_class
    bias_term: false
  }
}

Cheers,
Vijay

yu Magic

unread,
Sep 10, 2015, 11:18:31 AM9/10/15
to Caffe Users
Hello Youssef,

 I am very pleased to receive your reply, I follow your guide to get the final result .
    

在 2015年9月9日星期三 UTC+8下午4:20:11,Youssef Kashef写道:

yu Magic

unread,
Sep 10, 2015, 11:35:12 AM9/10/15
to Caffe Users
Hello Youssef,
I still have some problems , I saw     train_val-prototxt document.In preparation for the training data set,how to convert ground turth to lable file?Also, I do not understand what is the meaning of solve.py? Retraining a new caffemodel for myself?
I look forward to your reply!Thank you for helping me!

在 2015年9月9日星期三 UTC+8下午4:20:11,Youssef Kashef写道:
Hello Yu,

Youssef Kashef

unread,
Sep 10, 2015, 1:27:58 PM9/10/15