Trouble in training FCN-8s in DIGITS

2,290 views
Skip to first unread message

haozhe...@gmail.com

unread,
Nov 22, 2016, 3:47:23 AM11/22/16
to DIGITS Users
Hi, 

I am new to DIGITS and have tried FCN-Alexnet successfully. However, when I tried the FCN-8s in DIGITS, the loss and accuracy did not change at all. What is the reason of it? I used it by fine-tuning the VGG-16.

Any suggestions will be extremely appreciated!


                                                                                                                                                          Justin

Greg Heinrich

unread,
Nov 22, 2016, 4:48:59 AM11/22/16
to DIGITS Users
FCN-8s works well in DIGITS. Did you perform the necessary net surgery on your pre-trained VGG-16? Have a look at this post for some information on the steps required to "convolutionalize" a model: https://devblogs.nvidia.com/parallelforall/image-segmentation-using-digits-5/
This is not totally straightforward so there are many ways of getting it wrong.

haozhe...@gmail.com

unread,
Nov 22, 2016, 5:53:08 AM11/22/16
to DIGITS Users
Thanks for that, I will have a try.

在 2016年11月22日星期二 UTC+8下午5:48:59,Greg Heinrich写道:

haozhe...@gmail.com

unread,
Nov 22, 2016, 8:40:37 AM11/22/16
to DIGITS Users
Why I can not find any models in store and the model store page is like this:
save image
is there any extended setting necessary?


在 2016年11月22日星期二 UTC+8下午5:48:59,Greg Heinrich写道:
FCN-8s works well in DIGITS. Did you perform the necessary net surgery on your pre-trained VGG-16? Have a look at this post for some information on the steps required to "convolutionalize" a model: https://devblogs.nvidia.com/parallelforall/image-segmentation-using-digits-5/
Auto Generated Inline Image 1

Greg Heinrich

unread,
Nov 22, 2016, 9:15:47 AM11/22/16
to haozhe...@gmail.com, DIGITS Users
Hello, the DIGITS model store has not been made public yet. In the meantime you can refer to the FCN project page: https://github.com/shelhamer/fcn.berkeleyvision.org

--
You received this message because you are subscribed to the Google Groups "DIGITS Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digits-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digits-users/93632a8d-1bca-4ab4-a779-60104f768b2f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

haozhe...@gmail.com

unread,
Nov 23, 2016, 2:46:43 AM11/23/16
to DIGITS Users, haozhe...@gmail.com
I have tried to fine-tune the pretrained FCN-8s downloaded from the FCN project page. I got the error parameter in score_fr mismatch with the pretrained model because i set the different num_outputs with the pretrained FCN-8s. However, when I renamed the layer name the loss and accuracy did not change at all. After checking the activation map of rename layers, I found it had nothing at all. it seems the net can not initialize these rename layers normally. I searched the problem on Internet and found some guy said that when fine-tuning the FCN, if you use caffe like: caffe train --slover, the parameters of the layers after deconvolutions will not be initialized at all. so he recommended to use .py. However, I found DIGITS only can use caffe train --slover to call caffe. Is that mean I must set the num_output same to the pretrained model?
在 2016年11月22日星期二 UTC+8下午10:15:47,Greg Heinrich写道:
Hello, the DIGITS model store has not been made public yet. In the meantime you can refer to the FCN project page: https://github.com/shelhamer/fcn.berkeleyvision.org
On Tue, Nov 22, 2016 at 2:40 PM, <haozhe...@gmail.com> wrote:
Why I can not find any models in store and the model store page is like this:
save image
is there any extended setting necessary?

在 2016年11月22日星期二 UTC+8下午5:48:59,Greg Heinrich写道:
FCN-8s works well in DIGITS. Did you perform the necessary net surgery on your pre-trained VGG-16? Have a look at this post for some information on the steps required to "convolutionalize" a model: https://devblogs.nvidia.com/parallelforall/image-segmentation-using-digits-5/
This is not totally straightforward so there are many ways of getting it wrong.

On Tuesday, November 22, 2016 at 9:47:23 AM UTC+1, haozhe...@gmail.com wrote:
Hi, 

I am new to DIGITS and have tried FCN-Alexnet successfully. However, when I tried the FCN-8s in DIGITS, the loss and accuracy did not change at all. What is the reason of it? I used it by fine-tuning the VGG-16.

Any suggestions will be extremely appreciated!


                                                                                                                                                          Justin

--
You received this message because you are subscribed to the Google Groups "DIGITS Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digits-users...@googlegroups.com.

Greg Heinrich

unread,
Nov 23, 2016, 3:17:15 AM11/23/16
to DIGITS Users, haozhe...@gmail.com
Hello, the default weight initializer for convolution layers in Caffe sets all weights to zero, which makes it virtually impossible to learn anything if you're starting from scratch (as is the case if you rename a layer). If you have 21 classes or less in your dataset, the easiest thing to do is to keep the score_fr layer from the original model and just add an extra layer to reduce the number of classes. For example if you have 12 classes:

layer {
  name: "score_12classes"
  type: "Convolution"
  bottom: "score"
  top: "score_12classes"
  convolution_param {
    num_output: 12
    pad: 0
    kernel_size: 1
  }
}

I found this method to work well for many datasets. See attached for the full prototxt which you can directly use in DIGITS. You can use the pre-trained model from http://dl.caffe.berkeleyvision.org/fcn8s-heavy-pascal.caffemodel
fcn-8s-digits.protoxt

haozhe...@gmail.com

unread,
Nov 23, 2016, 3:43:23 AM11/23/16
to DIGITS Users, haozhe...@gmail.com
Thanks very much and really helps a lot to me. I will have a try immediately.

在 2016年11月23日星期三 UTC+8下午4:17:15,Greg Heinrich写道:

SakreB

unread,
Nov 24, 2016, 10:09:34 PM11/24/16
to DIGITS Users, haozhe...@gmail.com
Hi,

I tried the fcn-8s-digits.protoxt on VOC dataset with 21 classes.
I change this layer

layer {
  name: "score_12classes"
  type: "Convolution"
  bottom: "score"
  top: "score_12classes"
  convolution_param {
    num_output: 12
    pad: 0
    kernel_size: 1
  }
}

to 

layer {
  name: "score_21classes"
  type: "Convolution"
  bottom: "score"
  top: "score_21classes"
  convolution_param {
    num_output: 21
    pad: 0
    kernel_size: 1
  }
}


Then this error came out:

ERROR: error code -11

Top shape: 4 512 36 44 (3244032)
Memory required for data: 3634389504
Creating layer conv5_1
Creating Layer conv5_1
conv5_1 <- pool4_pool4_0_split_0
conv5_1 -> conv5_1
Setting up conv5_1
Top shape: 4 512 36 44 (3244032)
Memory required for data: 3647365632
Creating layer relu5_1
Creating Layer relu5_1
relu5_1 <- conv5_1
relu5_1 -> conv5_1 (in-place)
Setting up relu5_1
Top shape: 4 512 36 44 (3244032)
Memory required for data: 3660341760
Creating layer conv5_2
Creating Layer conv5_2
conv5_2 <- conv5_1
conv5_2 -> conv5_2

then I set the batch size to 1.

this error came out:

ERROR: Check failed: status == CUBLAS_STATUS_SUCCESS (11 vs. 0) CUBLAS_STATUS_MAPPING_ERROR

Iteration 0, Testing net (#0)
Check failed: status == CUBLAS_STATUS_SUCCESS (11 vs. 0)  CUBLAS_STATUS_MAPPING_ERROR
        


When i train using the DIGIT tutorial FCN-Alexnet, everything works well.

Is there anything i need to tweak to make the FCN-8s working?

Thank you

Greg Heinrich

unread,
Nov 25, 2016, 3:47:50 AM11/25/16
to SakreB, DIGITS Users, haozhe...@gmail.com
SakreB which GPU, OS, version of CUDA do you have?

To unsubscribe from this group and stop receiving emails from it, send an email to digits-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digits-users/05e6947a-2e13-438f-aa26-186f18bcc8f4%40googlegroups.com.

SakreB

unread,
Nov 25, 2016, 3:57:32 AM11/25/16
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
Thank for prompt reply,

It is 

Ubuntu 14.04 LTS
DIGITS 5.1-dev
caffe 0.15.15 . NVIDIA flavor
CUDA 8.0 with nvidia 370 driver

Greg Heinrich

unread,
Nov 25, 2016, 4:00:12 AM11/25/16
to SakreB, DIGITS Users, haozhe...@gmail.com
And which GPU?

To unsubscribe from this group and stop receiving emails from it, send an email to digits-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digits-users/e15a351f-a8a7-4c33-9d75-07fa5a6576e6%40googlegroups.com.

SakreB

unread,
Nov 25, 2016, 4:16:55 AM11/25/16
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
:( my bad to miss the info

it is gtx1080
And which GPU?

Greg Heinrich

unread,
Nov 25, 2016, 4:30:55 AM11/25/16
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
You might want to review that thread:
https://groups.google.com/d/msg/digits-users/Ro2oKJ7SYMQ/Z3oWx2wmBQAJ

People have been getting strange errors on Pascal GPUs and in the end it came down to installation issues. Did you move from CUDA 7.5 to CUDA 8.0? When you do so you need to recompile a bunch of dependencies. That might be related to your issue.

SakreB

unread,
Nov 27, 2016, 7:06:42 PM11/27/16
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
Thanks for the link.

The installation was from fresh ubuntu and CUDA8.0.

I added ignore label 255 for loss and accuracy layer, and it worked successfully.
For VOC data i got 72% with SGD and 93% for Adadelta. :)  Really happy to see the setup for fcn-8s finally working

Do you have any idea why without the ignore label 255, i got error on CUBLAS mapping?

Greg Heinrich

unread,
Nov 28, 2016, 2:51:55 AM11/28/16
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
Thanks for the update! So perhaps the delineation of object contours was introducing too strong gradients? Also perhaps since there are only 21 output feature maps, class #255 was confusing the network unless ignored?

a.m.l...@gmail.com

unread,
Dec 9, 2016, 2:17:07 PM12/9/16
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
Thanks for the update and your solution,(ignore label 255, batch=1) but i try to train the FCN8 with PascalVOC and give me the following error:


ERROR: Out of memory: failed to allocate 216115200 bytes on device 0


But i have your same hardware and configuration (GTX1080 CUDA 8.0) Can you help me?

Greg Heinrich

unread,
Dec 9, 2016, 2:23:15 PM12/9/16
to a.m.l...@gmail.com, SakreB, haozhe...@gmail.com, DIGITS Users
You may need to reduce the batch size.


To unsubscribe from this group and stop receiving emails from it, send an email to digits-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digits-users/6ec7ca73-58ee-4434-8429-55b5ded06cca%40googlegroups.com.

a.m.l...@gmail.com

unread,
Dec 9, 2016, 2:29:42 PM12/9/16
to DIGITS Users, a.m.l...@gmail.com, nikahm...@gmail.com, haozhe...@gmail.com
Thanks! but i have batch size = 1 yet, i can use a batch size <1?

Greg Heinrich

unread,
Dec 9, 2016, 2:33:12 PM12/9/16
to a.m.l...@gmail.com, DIGITS Users, SakreB, haozhe...@gmail.com
You can't set batch size < 1... can you attach the caffe_output.txt log?

To unsubscribe from this group and stop receiving emails from it, send an email to digits-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digits-users/6e75ab12-a209-4ffe-98a1-7bfc7c57faa4%40googlegroups.com.

a.m.l...@gmail.com

unread,
Dec 9, 2016, 2:39:23 PM12/9/16
to DIGITS Users, a.m.l...@gmail.com, nikahm...@gmail.com, haozhe...@gmail.com
Here is!
caffe_output (6).log

A Bhardwaj

unread,
Jan 14, 2017, 9:36:53 AM1/14/17
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
That was very helpful! Thanks.


On Monday, November 28, 2016 at 1:06:42 AM UTC+1, SakreB wrote:

Jackson Reese

unread,
Mar 6, 2017, 10:15:43 PM3/6/17
to DIGITS Users, haozhe...@gmail.com
This is great advice Greg...thanks!   If, however, I have MORE than 21 classes, can I increase the value of 21 in the multiple locations in the prototxt file?  Or, is that value matched to the network architecture so that increasing it to say, 42, would not work?

Jackson Reese

unread,
Mar 7, 2017, 12:13:52 AM3/7/17
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
Can you explain what this means for me: "I added ignore label 255 for loss and accuracy layer, and it worked successfully"

Is this a parameter to be added to the loss and accuracy layers in the prototxt network file?

layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "score"
  bottom: "label"
  top: "loss"
  exclude {
    stage: "deploy"
  }
  loss_param {
    normalize: true
  }
}
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "score"
  bottom: "label"
  top: "accuracy"
  include {
    stage: "val"
  }
}

I'm encountering the same problem...

ERROR: Check failed: status == CUBLAS_STATUS_SUCCESS (11 vs. 0) CUBLAS_STATUS_MAPPING_ERROR


Jackson Reese

unread,
Mar 7, 2017, 12:27:33 AM3/7/17
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com


I added the following loss param to the loss layer, but adding a similar parameter to the accuracy layer did not seem to be supported...(as below)

    ignore_label: 255


layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "score"
  bottom: "label"
  top: "loss"
  exclude {
    stage: "deploy"
  }
  loss_param {
    normalize: true
    ignore_label: 255

Jackson Reese

unread,
Mar 7, 2017, 12:31:23 AM3/7/17
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com

I tried to follow this advice below: "I added ignore label 255 for loss and accuracy layer, and it worked successfully."

I added the ignore_label: 255 to the loss layer,
However, I could not add the ignore_label: 255 parameter to the accuracy layer - the error message indicated that this is not supported. 

Will it work to just add this to the loss layer only, and not to the accuracy layer?

Has anyone been able to get the FCN-8s working without adding this parameter?

Thanks!


On Sunday, November 27, 2016 at 5:06:42 PM UTC-7, SakreB wrote:

Greg Heinrich

unread,
Mar 7, 2017, 3:04:53 AM3/7/17
to Jackson Reese, DIGITS Users, SakreB, haozhe...@gmail.com
> I have MORE than 21 classes, can I increase the value of 21 in the multiple locations in the prototxt file?

I expect this should work well too.


> However, I could not add the ignore_label: 255 parameter to the accuracy layer


To unsubscribe from this group and stop receiving emails from it, send an email to digits-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digits-users/25c3317b-7557-454d-a0fd-067485bceaf1%40googlegroups.com.

Jason Mecham

unread,
Mar 22, 2017, 3:24:02 PM3/22/17
to DIGITS Users, haozhe...@gmail.com
Is this expected to work just fine on a Maxwell based TitanX with CUDA 7.5?

I tried it after making all the necessary changes (21 classes, batch size =1, ignore_label: 255), and it seems to work. But, it crashes my entire machine after it gets started (around EPOCH .2 or maybe a bit later).

When I say it crashes my machine the entire machine turns off, and then starts back up a couple seconds later.

I have had zero issue with this machine with all my other testing in Digits. No issues with Detectnet, ImageNet, or FCN-AlexNet.

I'm probably going to try putting a fresh install on another hard drive, and installing the latest and greatest driver/cuda to see if that fixes it.

Jason Mecham

unread,
Mar 23, 2017, 8:25:21 PM3/23/17
to DIGITS Users, haozhe...@gmail.com
I did a fresh install of Ubuntu 16.04, CUDA 8.0, and installed Digits 5.0 using the deb package and I'm still getting the spontaneous reboots when I try this with the Maxwell Based TitanX.

I also tried the same card in a different workstation, and got the same reboot so I don't believe it's the computer.

The only way it runs is if I don't use the pretrained model. Of course it doesn't successfully train, but at least it doesn't reboot.

Hesam Moshiri

unread,
Apr 4, 2017, 3:30:49 PM4/4/17
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
I have the same issue as your, Did you solve it?

Hesam Moshiri

unread,
Apr 9, 2017, 4:57:33 PM4/9/17
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
Hello Saker,

I am not successful with Adadelta, but SGD is OK. You are great using AdaDelta but I waited till 11 epoch in AdaDelta training, but Dice value did not rise above 0.1

Did you use any specific configuration for that? such as learning rate or subtract mean or something?

SakreB

unread,
Apr 10, 2017, 9:47:32 PM4/10/17
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com

yes, i use substract image mean 

Hesam Moshiri

unread,
Apr 13, 2017, 3:55:02 PM4/13/17
to DIGITS Users
and what about the learning rate, and its condition (fixed, step down ...)?

Zisian Wong

unread,
Apr 19, 2017, 5:26:13 AM4/19/17
to DIGITS Users, haozhe...@gmail.com
Hi Jason,

Could it be a problem with power supply? What PSU are you using?

damm...@gmail.com

unread,
Apr 28, 2017, 3:50:23 PM4/28/17
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
I have the same problem also. How to fix the out of memory error? My caffe log attached.

Dammad
caffe_output.log

damm...@gmail.com

unread,
Apr 28, 2017, 7:09:33 PM4/28/17
to DIGITS Users, nikahm...@gmail.com, haozhe...@gmail.com
I'm using a 6GB 1060 card with Ubuntu 16. Everything else works fine.

Sai Kiran

unread,
Jun 22, 2017, 12:58:21 PM6/22/17
to DIGITS Users
Hi Greg,

1. I want to use pretrained model of fcn-8s. But for that how do I set mean values for B, G R =(104.00699, 116.66877, 122.67892) in digits version of fcn-8s_digits.prototxt.  Please help me.

2. My num_output is 3 classes.  Please correct me if did anything wrong. I wrote a python script to shrink the fcn-8s model classes from 21 to N (N = [3,20) classes). For N = 3, I stored parameters of only 1st three classes in all score and upscore layers only in my new model.  I also reduced no of nodes in fc6 and fc7 layers from 4096 to 512 in the same manner. Without digits, when I trained, I got good performance and almost similar to the original model. Please give your comments on this. My data is a medical data.

Thanks
SK06
Reply all
Reply to author
Forward
0 new messages