I got "Segmentation fault (core dumped)" while fine-tuning, how can I know what is the issue

2,789 views
Skip to first unread message

Yasser Souri

unread,
Nov 24, 2014, 9:34:03 AM11/24/14
to caffe...@googlegroups.com
Salam,

I first tried fine-tuning with the flickr_style tutorial and everything works well.

Now that I want to fine-tune with my own dataset I get this:

I1124 17:58:37.106029 14422 solver.cpp:160] Solving CUBCaffeNet
I1124 17:58:37.106073 14422 solver.cpp:247] Iteration 0, Testing net (#0)
I1124 17:58:51.061630 14422 solver.cpp:298]     Test net output #0: accuracy = 0.0056
I1124 17:58:51.137616 14422 solver.cpp:191] Iteration 0, loss = 3.81322
I1124 17:58:51.137655 14422 solver.cpp:403] Iteration 0, lr = 0.001
Segmentation fault (core dumped)


The issue is how can I know what caused the problem? Is there any log file or something?

Best regards,

Friendly User

unread,
Nov 24, 2014, 9:40:46 AM11/24/14
to caffe...@googlegroups.com
Ha! Same/similar problem here, independent of CPU or GPU.

I enabled debug mode in makefile.config fixed but it doesn't help much:

caffe test -model models/finetune_flickr_style/test.prototxt -weights data/flickr_style/finetune_flickr_style.caffemodel


I1124 15:38:45.287902  4025 caffe.cpp:169] Batch 49, prob = 0.051752

I1124 15:38:45.287912  4025 caffe.cpp:174] Loss: 0

I1124 15:38:45.287925  4025 caffe.cpp:186] prob = 0.0314738

I1124 15:38:45.287942  4025 caffe.cpp:186] prob = 0.0567422 (* 4.1037e-41 = 2.32896e-42 loss)

PC: @           0x52eda1 test()

*** SIGSEGV (@0x16237a5d0) received by PID 4025 (TID 0x7f9060a15a40) from PID 1647814096; stack trace: ***

    @     0x7f905684ec30 (unknown)

    @           0x52eda1 test()

    @           0x53044c main

    @     0x7f9056839ec5 (unknown)

 

Segmentation fault (core dumped)

Friendly User

unread,
Nov 24, 2014, 9:43:14 AM11/24/14
to caffe...@googlegroups.com

Just wondering in your case it says "Solving CUBCaffeNet" wheras I'm testing "FlickrStyleCaffeNet"

Friendly User

unread,
Nov 24, 2014, 9:57:13 AM11/24/14
to caffe...@googlegroups.com

I get this problem for other tests as well:

Always after batch 49 and always following a loss of zero

/caffe/examples/cifar10$ caffe test -model ./cifar10_full.prototxt -weights cifar10_full_iter_70000.caffemodel -gpu 0

I1124 15:53:43.828536  4194 caffe.cpp:169] Batch 49, prob = 0.0551675
I1124
15:53:43.828547  4194 caffe.cpp:174] Loss: 0
I1124
15:53:43.828567  4194 caffe.cpp:186] prob = 0.0796905
I1124
15:53:43.828580  4194 caffe.cpp:186] prob = 0.0331636 (* 5.60519e-45 = 0 loss)
I1124
15:53:43.828598  4194 caffe.cpp:186] prob = 0.115024 (* -5.31795e+37 = -6.11692e+36 loss)
I1124
15:53:43.828611  4194 caffe.cpp:186] prob = 0.158395 (* 5.60519e-45 = 1.4013e-45 loss)

PC
: @           0x52eda1 test()
*** SIGSEGV (@0x9d19510) received by PID 4194 (TID 0x7f9f08c5ba40) from PID 164730128; stack trace: ***
   
@     0x7f9efea8ec30 (unknown)
   
@           0x52eda1 test()
   
@           0x53044c main
   
@     0x7f9efea79ec5 (unknown)
   
@           0x52da09 (unknown)
Segmentation fault (core dumped)




Andriy Lysak

unread,
Nov 24, 2014, 12:24:22 PM11/24/14
to caffe...@googlegroups.com
Not sure if this will help i had similar issues with Caffe on 2 different occasions: first one was resolved by making the batch sizes a lot smaller and adjusting the settings accordingly and the other issue was me not properly modeling a NN. So my guess is this might be a ram issue. Depending on how much RAM you have (regardless CPU or VRAM for GPU) you should adjust and see what works.

I found that running examples from caffe ate up 95% of the 4GB of VRAM on my GPU.

Just a thought.

Best of luck!!!


Yasser Souri

unread,
Nov 25, 2014, 12:43:54 AM11/25/14
to caffe...@googlegroups.com
That is because I've changed the name of the network and also the name of the fc8 layer. Everything else is the same.

Phoenix Bai

unread,
Nov 25, 2014, 1:08:35 AM11/25/14
to caffe...@googlegroups.com
I encountered the same issue as yours, and my problem was that, I used value like 1001,1005 as label, 
and when I change it to 0,1, the segmentation fault disappears.

Not sure if it works for you as well, but certainly deserves a try.

thanks 

Yasser Souri

unread,
Nov 25, 2014, 1:09:38 AM11/25/14
to caffe...@googlegroups.com
Thanks,

I've decreased the batch size from 50 to 5, it goes few iterations more but the segmentation fault occurs.

I1125 09:38:07.639384 18102 solver.cpp:160] Solving CUBCaffeNet
I1125
09:38:07.639418 18102 solver.cpp:247] Iteration 0, Testing net (#0)
I1125
09:38:09.132885 18102 solver.cpp:298]     Test net output #0: accuracy = 0.01
I1125
09:38:09.151484 18102 solver.cpp:191] Iteration 0, loss = 3.77259
I1125
09:38:09.151518 18102 solver.cpp:403] Iteration 0, lr = 0.001
I1125
09:38:09.863337 18102 solver.cpp:191] Iteration 20, loss = 0
I1125
09:38:09.863363 18102 solver.cpp:403] Iteration 20, lr = 0.001
I1125
09:38:10.564663 18102 solver.cpp:191] Iteration 40, loss = nan
I1125
09:38:10.564692 18102 solver.cpp:403] Iteration 40, lr = 0.001
I1125
09:38:11.266839 18102 solver.cpp:191] Iteration 60, loss = nan
I1125
09:38:11.266866 18102 solver.cpp:403] Iteration 60, lr = 0.001
I1125
09:38:11.968343 18102 solver.cpp:191] Iteration 80, loss = nan
I1125
09:38:11.968369 18102 solver.cpp:403] Iteration 80, lr = 0.001
I1125
09:38:12.669984 18102 solver.cpp:191] Iteration 100, loss = nan
I1125
09:38:12.670012 18102 solver.cpp:403] Iteration 100, lr = 0.001
I1125
09:38:13.371335 18102 solver.cpp:191] Iteration 120, loss = nan
I1125
09:38:13.371362 18102 solver.cpp:403] Iteration 120, lr = 0.001
Segmentation fault (core dumped)


As you see it says the `loss = nan`. What cloud be the cause of that?

Yasser Souri

unread,
Nov 25, 2014, 1:14:49 AM11/25/14
to caffe...@googlegroups.com
Oh I have that issue. let me try and get back to you!

Yasser Souri

unread,
Nov 25, 2014, 1:18:46 AM11/25/14
to caffe...@googlegroups.com
I did fix it, and now my labels start from 0 to 199, but that did not fix my problem!


On Tuesday, November 25, 2014 9:38:35 AM UTC+3:30, Phoenix Bai wrote:

Yasser Souri

unread,
Nov 25, 2014, 1:30:26 AM11/25/14
to caffe...@googlegroups.com
I tried decreasing the base_lr to 0.0001 from 0.001. also decreased the batch_size to 2. Here is the output.

I1125 09:57:15.725137 20973 solver.cpp:160] Solving CUBCaffeNet
I1125
09:57:15.725178 20973 solver.cpp:247] Iteration 0, Testing net (#0)
I1125
09:57:16.387609 20973 solver.cpp:298]     Test net output #0: accuracy = 0.03
I1125
09:57:16.402441 20973 solver.cpp:191] Iteration 0, loss = 4.45611
I1125
09:57:16.402470 20973 solver.cpp:403] Iteration 0, lr = 0.001
I1125
09:57:16.952252 20973 solver.cpp:191] Iteration 20, loss = 0
I1125
09:57:16.952280 20973 solver.cpp:403] Iteration 20, lr = 0.001
I1125
09:57:17.499045 20973 solver.cpp:191] Iteration 40, loss = 0
I1125
09:57:17.499074 20973 solver.cpp:403] Iteration 40, lr = 0.001
I1125
09:57:18.040643 20973 solver.cpp:191] Iteration 60, loss = 87.3365
I1125
09:57:18.040671 20973 solver.cpp:403] Iteration 60, lr = 0.001
I1125
09:57:18.582733 20973 solver.cpp:191] Iteration 80, loss = 87.3365
I1125
09:57:18.582762 20973 solver.cpp:403] Iteration 80, lr = 0.001
I1125
09:57:19.124579 20973 solver.cpp:191] Iteration 100, loss = 87.3365
I1125
09:57:19.124608 20973 solver.cpp:403] Iteration 100, lr = 0.001
I1125
09:57:19.666509 20973 solver.cpp:191] Iteration 120, loss = 87.3365
I1125
09:57:19.666538 20973 solver.cpp:403] Iteration 120, lr = 0.001
I1125
09:57:20.208319 20973 solver.cpp:191] Iteration 140, loss = 87.3365
I1125
09:57:20.208346 20973 solver.cpp:403] Iteration 140, lr = 0.001
I1125
09:57:20.749861 20973 solver.cpp:191] Iteration 160, loss = 87.3365
I1125
09:57:20.749889 20973 solver.cpp:403] Iteration 160, lr = 0.001
I1125
09:57:21.291721 20973 solver.cpp:191] Iteration 180, loss = 87.3365
I1125
09:57:21.291751 20973 solver.cpp:403] Iteration 180, lr = 0.001
I1125
09:57:21.833292 20973 solver.cpp:191] Iteration 200, loss = 87.3365
I1125
09:57:21.833319 20973 solver.cpp:403] Iteration 200, lr = 0.001
I1125
09:57:22.374850 20973 solver.cpp:191] Iteration 220, loss = 70.6882
I1125
09:57:22.374876 20973 solver.cpp:403] Iteration 220, lr = 0.001
I1125
09:57:22.916332 20973 solver.cpp:191] Iteration 240, loss = 64.3139
I1125
09:57:22.916360 20973 solver.cpp:403] Iteration 240, lr = 0.001
I1125
09:57:23.457355 20973 solver.cpp:191] Iteration 260, loss = 87.3365
I1125
09:57:23.457381 20973 solver.cpp:403] Iteration 260, lr = 0.001
I1125
09:57:23.998904 20973 solver.cpp:191] Iteration 280, loss = 49.689
I1125
09:57:23.998937 20973 solver.cpp:403] Iteration 280, lr = 0.001
I1125
09:57:24.540182 20973 solver.cpp:191] Iteration 300, loss = 45.2114
I1125
09:57:24.540211 20973 solver.cpp:403] Iteration 300, lr = 0.001
I1125
09:57:25.081493 20973 solver.cpp:191] Iteration 320, loss = 41.3736
I1125
09:57:25.081521 20973 solver.cpp:403] Iteration 320, lr = 0.001
I1125
09:57:25.622854 20973 solver.cpp:191] Iteration 340, loss = 45.4241
I1125
09:57:25.622884 20973 solver.cpp:403] Iteration 340, lr = 0.001
Segmentation fault (core dumped)


I've also checked the memory usage of my GPU with `nvidia-smi` and it tops 1.5 GB, while my GPU, is a titan black with 6 gigs of ram.

Yasser Souri

unread,
Nov 25, 2014, 2:09:55 AM11/25/14
to caffe...@googlegroups.com
Sorry guys noob mistake.

I had to change the num_output of the fc8 layer to my number of classes which was 200.

The problem was I just had copied the train_val.prototxt from flickr fine tune example and in that example the number of classes are 20!

So to conclude for anyone else that might have this problem:
 - change num_output of your last layer to your number of classes
 - make sure in your train/test.txt files your class numbers start from 0 to C-1, where C is the number of classes

thanks to anyone who helped.
...

Friendly User

unread,
Nov 25, 2014, 8:39:20 AM11/25/14
to caffe...@googlegroups.com
solved here too, thanks everyone!!


another possible noob error:

make sure to test with the right model, for example:

caffe test -model ./cifar10_full_train_test.prototxt -weights cifar10_full_iter_10000.caffemodel -gpu 0

instead of

caffe test -model ./cifar10_full.prototxt -weights cifar10_full_iter_10000.caffemodel -gpu 0 

zhu jiejie

unread,
Jun 9, 2015, 4:33:55 PM6/9/15
to caffe...@googlegroups.com
I have the save issue following pascal fine tuning. Followed the instruction in this post and make sure that num_output is the number of classes+1 (0 is the background). I have still the segmentation fault at layer 2. Any suggestions? Thanks

I0609 16:28:52.562891 13447 net.cpp:66] Creating Layer norm1
I0609 16:28:52.562906 13447 net.cpp:329] norm1 <- pool1
I0609 16:28:52.562922 13447 net.cpp:290] norm1 -> norm1
I0609 16:28:52.562965 13447 net.cpp:83] Top shape: 64 96 27 27 (4478976)
I0609 16:28:52.562983 13447 net.cpp:125] norm1 needs backward computation.
I0609 16:28:52.563010 13447 net.cpp:66] Creating Layer conv2
I0609 16:28:52.563024 13447 net.cpp:329] conv2 <- norm1
I0609 16:28:52.563040 13447 net.cpp:290] conv2 -> conv2
Segmentation fault (core dumped)
Reply all
Reply to author
Forward
0 new messages