fine-tuning caffe FCN 8 CUBLAS failure after iteration 2980

102 views
Skip to first unread message

shay....@omniearthinc.com

unread,
Jun 13, 2016, 1:07:37 PM6/13/16
to Caffe Users
Hi. I am definitely new to the world of caffe and semantic segmentation with fcn. I have been lurking quite a while but finally reached a point I needed to ask a specific question. I have been fine-tuning an FCN with 6 classes. I have been performing net surgery and things go well for the first ~2980 iterations, however I ultimately hit this error: 

F0613 16:53:45.999119 68793 math_functions.cu:121] Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0)  CUBLAS_STATUS_INTERNAL_ERROR


I got this in the past when I have mis-specified the number of classes and/or ran into memory restrictions. I do not understand why the iterations run for a while and then crash. Any insight? Thanks in advance.

Ruud

unread,
Jun 15, 2016, 10:56:43 AM6/15/16
to Caffe Users
Same error here, but oh boy, this is 3 years ago. Found out why it went wrong?

shay....@omniearthinc.com

unread,
Jun 15, 2016, 11:11:11 AM6/15/16
to Caffe Users
no insight...i'll just keep iterating and see what happens...

Ruud

unread,
Jun 15, 2016, 1:39:20 PM6/15/16
to Caffe Users
I found the problem. There was a label present in my data that should't have been there. The number of labels did not match for some samples.

2 options to solve, clean the data or ignore the label in the loss layer:

layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "fc8_pascal"
  bottom: "label_shrink"

  loss_param {
     ignore_label: 255
  }
}


shay....@omniearthinc.com

unread,
Jun 18, 2016, 8:39:31 AM6/18/16
to Caffe Users
So I struggled with this for a while and finally found part of my problem. I had been converting my labels to jpg from tif format and in conversion to jpg and resizing, the pixel values were being changed. This equated to adding random classes. So even though I started with a small set of classes (~6) this conversion process created something like ~10 classes and there was a mismatch between my defined layer parameters in the train_val.prototxt and the label data. I fixed this by converting to png instead and forcing the resize process to pick only existing pixel values rather than interpolating to new ones.
Reply all
Reply to author
Forward
0 new messages