Floating point exceptions when training with GPU but not CPU

46 views
Skip to first unread message

Joe Lemley

unread,
Oct 30, 2017, 2:44:42 PM10/30/17
to Caffe Users
I'm trying to duplicate the results of a paper which provides a Caffe training file. 

When I train it in CPU mode, everything works as I expect and the results are similar to those reported in the paper but when I use the GPU I get:

Floating point exception (core dumped)

I've attached the full log in out.txt.

I've confirmed this is not a problem with my GPU or CUDA by running example Caffe code. I'm using HDF5 for a data source. I created this data from numpy arrays and used astype to ensure they were float32 before being written to a HDF5 file. 

Caffe reports that the data class is: H5T_FLOAT. 

Since I'm new to Caffe I suspect my problem is related to some "beginner mistake" with regard to data preparation or something and I was hoping someone here may have an idea about what could cause this. 

I've attached a full log. 

I appreciate any ideas or suggestions you may have.
out.txt

Przemek D

unread,
Oct 31, 2017, 3:39:39 AM10/31/17
to Caffe Users
Unfortunately the "Floating point exception" in Caffe does not always mean what you'd think it means. There used to be a bug that threw it when you tried to make a blob with one dimension equal to zero. So two questions to you: are you using the most recent version of Caffe? Could you attach also your network and solver prototexts?

Joe Lemley

unread,
Oct 31, 2017, 6:25:09 AM10/31/17
to Caffe Users
Thank you Przemek,

I've attached my solver and network prototexts.  

The network prototext came from the paper's author and the solver came from a caffe tutorial I found. 

I'm using Caffe version 1.0.0 compiled from source. One thing I forgot to mention before is that the author provided modified accuracy and euclidean distance source files. They did not initially compile with this version of caffe but I updated the #includes to adjust to the layer location changes mentioned here: and they compiled fine. 

Can you think of any reason why training would work on the CPU but not the GPU?
solver.prototxt
train_test2.prototxt

Joe Lemley

unread,
Oct 31, 2017, 10:58:19 AM10/31/17
to Caffe Users
By the way, I've tracked the error down to a division by zero error in accuracy_layer.cu. I doubt it impacts anyone else's code but mine or anyone else who is trying to duplicate the results of this paper.
It seems to be related to the authors modified accuracy layer code in some way. 

outer_num_; is zero in the below line leading to a div by zero error:

const int dim = bottom[0]->count() / outer_num_;

The reason this error happens on the GPU but not the CPU is that for the CPU, the author's modified accuracy_layer.cpp is used but when the GPU is used it's the standard accuracy_layer.cu. I've still not fully solved the problem but this narrows it down substantially. 

Przemek D

unread,
Nov 2, 2017, 3:36:40 AM11/2/17
to Caffe Users
This makes sense. The authors modify the accuracy layer, but probably at the time of their publication Caffe didn't have the GPU version of this layer yet (it's only been merged last month). So you're given a modified CPU code, but for the GPU implementation you're still using the code that Caffe ships.
The simplest solution (i.e. not involving writing the GPU code for the modified layer) would be to revert the effects of commit 62e0c85 - remove the Forward_gpu&Backward_gpu declarations from header and remove accuracy_layer.cu.
Reply all
Reply to author
Forward
0 new messages