Cublas error trying to train network on GPU

3,042 views
Skip to first unread message

Christopher Catton

unread,
Jun 17, 2015, 12:12:43 AM6/17/15
to caffe...@googlegroups.com
I've been trying to train a fully convolutional neural network and have succeeded in training it on the cpu, but I've ran into a problem training it on the cpu.

I0616 18:37:34.401193 29112 net.cpp:448] Collecting Learning Rate and Weight Decay.
I0616 18:37:34.401206 29112 net.cpp:218] Network initialization done.
I0616 18:37:34.401211 29112 net.cpp:219] Memory required for data: 1321237252
I0616 18:37:34.401331 29112 solver.cpp:42] Solver scaffolding done.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 542361728
[New Thread 0x7fff65985700 (LWP 29132)]
[New Thread 0x7fff65184700 (LWP 29133)]
[Thread 0x7fff65985700 (LWP 29132) exited]
[Thread 0x7fff65184700 (LWP 29133) exited]
F0616 18:37:35.315524 29112 math_functions.cu:123] Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0)  CUBLAS_STATUS_INTERNAL_ERROR

I've looked at the thread https://github.com/BVLC/caffe/issues/2334 and implemented the changes suggested in the thread. I get an error when I run 'make pycafffe' and it fails on the same error when I try train on the gpu.

My machine is using a titan-x and I am using Atlas. I don't have cuDNN installed if that is relevant. Has anyone ran into this problem or have any ideas on how to resolve it?

Thanks

Christopher Catton

unread,
Jun 21, 2015, 4:36:25 PM6/21/15
to caffe...@googlegroups.com
I should also mention that I am using the future branch https://github.com/longjon/caffe/tree/future
and that I am trying to train a ully convolutional net on the GPU

Christopher Catton

unread,
Jun 25, 2015, 12:40:59 PM6/25/15
to caffe...@googlegroups.com
I'm thinking that the reason this is failing is because of a bug within caffe. I've tried merging with the master branch, but it still fails at the same point. The point of failure seems to be line within the source file of Softmax loss layer at line 52. I'm a bit concerned that no one has also experienced this bug though, but all my test pass for cuda so I don't think it is an issue with my setup and I don't see any other possibilities other than those than I've already attempted to resolve. Could anyone else check to see if this is an issue?

template <typename Dtype>
void SoftmaxWithLossLayer<Dtype>::Forward_gpu(
const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
softmax_layer_->Forward(softmax_bottom_vec_, softmax_top_vec_);
const Dtype* prob_data = prob_.gpu_data();
const Dtype* label = bottom[1]->gpu_data();
const int dim = prob_.count() / outer_num_;
const int nthreads = outer_num_ * inner_num_;
// Since this memory is not used for anything until it is overwritten
// on the backward pass, we use it here to avoid having to allocate new GPU
// memory to accumulate intermediate results in the kernel.
Dtype* loss_data = bottom[0]->mutable_gpu_diff();
// Similarly, this memory is never used elsewhere, and thus we can use it
// to avoid having to allocate additional GPU memory.
Dtype* counts = prob_.mutable_gpu_diff();
// NOLINT_NEXT_LINE(whitespace/operators)
SoftmaxLossForwardGPU<Dtype><<<CAFFE_GET_BLOCKS(nthreads),
CAFFE_CUDA_NUM_THREADS>>>(nthreads, prob_data, label, loss_data,
outer_num_, dim, inner_num_, has_ignore_label_, ignore_label_, counts);
Dtype loss;
caffe_gpu_asum(nthreads, loss_data, &loss); //Fails here
if (normalize_) {
Dtype count;
caffe_gpu_asum(nthreads, counts, &count);
loss /= count;
} else {
loss /= outer_num_;
}
top[0]->mutable_cpu_data()[0] = loss;
if (top.size() == 2) {
top[1]->ShareData(prob_);
}
}

Saihui Hou

unread,
Jul 6, 2015, 4:48:10 AM7/6/15
to caffe...@googlegroups.com
I encountered the same problem and didn't solve it . Hope someone can fix it as soon as possible. Thanks a lot.

在 2015年6月17日星期三 UTC+8下午12:12:43,Christopher Catton写道:

eran paz

unread,
Jul 6, 2015, 7:09:09 AM7/6/15
to caffe...@googlegroups.com
Hi Christopher, Saihui
I'm having the exact same problem.
Trying to run a fully conv net with longjon future branch.

Couldn't figure out what's the issue yet.
If any of you solves this would appreciate if you can share the solution.

THX

Gavin Hackeling

unread,
Jul 6, 2015, 8:49:37 AM7/6/15
to eran paz, caffe...@googlegroups.com

This is inconsistent with your description, but I encountered the same CUBLAS_STATUS_INTERNAL_ERROR when the number of labels mismatched the "num_output" in my train_val.prototxt. E.g., with eight classes and background, "num_output" should be nine.

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/436c6a7c-2fcf-4ead-90d9-635d54250498%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher Catton

unread,
Jul 6, 2015, 1:09:09 PM7/6/15
to caffe...@googlegroups.com, era...@gmail.com
Hello Everyone,

I actually got around this issue by disabling the GPU methods of the SoftmaxWithLoss layer and fixing some array indexing errors in the CPU layer.

Saihui Hou

unread,
Jul 6, 2015, 8:37:01 PM7/6/15
to caffe...@googlegroups.com, era...@gmail.com
glad to hear that. Could you give some details? 

在 2015年7月7日星期二 UTC+8上午1:09:09,Christopher Catton写道:

Saihui Hou

unread,
Jul 6, 2015, 8:40:44 PM7/6/15
to caffe...@googlegroups.com, era...@gmail.com
It seems that it's not this bug that cause the problem. Is there any other possible reason?

在 2015年7月6日星期一 UTC+8下午8:49:37,Gavin Hackeling写道:

eran paz

unread,
Jul 7, 2015, 12:22:56 AM7/7/15
to caffe...@googlegroups.com, era...@gmail.com
Gavin
Thanks, that was it, not sure how but I had higher class values than i intended so my "num_output" was indeed too low.
Didn't even think to look there...

Thanks again.

Saihui Hou

unread,
Jul 7, 2015, 3:15:31 AM7/7/15
to caffe...@googlegroups.com
Have you solved the problem? I'm eager to know that...

在 2015年7月7日星期二 UTC+8下午12:22:56,eran paz写道:

eran paz

unread,
Jul 7, 2015, 6:09:40 AM7/7/15
to caffe...@googlegroups.com
Yes, as I mentioned the problem (at least my problem) was the the number of classes I had was larger than the number of "num_output" in my deconvolution layer.
I've just fixed the number of classes.
Now I'm running into memory issues, but that's a whole different story :)

Saihui Hou

unread,
Jul 8, 2015, 10:06:18 PM7/8/15
to caffe...@googlegroups.com
I got this problem solved too. Thanks a lot. I changed the "label_value" in the "soft_loss_layer.cu" to fit my "num_output" (my situation).  But I ran into another problem. The loss doesn't go down whatever the learning rate is. Could anyone give some hints about what cause this? 

在 2015年7月7日星期二 UTC+8下午6:09:40,eran paz写道:

Krishna Teja

unread,
Sep 2, 2015, 1:36:34 PM9/2/15
to Caffe Users
Hi Saihui Hou,

I have the same problem. Could you please help me to resolve?
If my num_output = 5, Should I have classes 0,1,2,3,4 or any 5 values between 0 to 255?

eran paz

unread,
Sep 2, 2015, 2:17:59 PM9/2/15
to Caffe Users
0-4

Krishna Teja

unread,
Sep 3, 2015, 6:49:06 AM9/3/15
to Caffe Users
Hi Eran Paz,

Thank you very much for the quick reply. The problem got solved.
Now I also have the memory issues with GPU. It is using the entire GPU memory.
Do you have a solution for this as well?

Krishna Teja

unread,
Sep 3, 2015, 7:34:37 AM9/3/15
to Caffe Users
Hi Eran Paz,

I have seen your other post regarding this problem and merged the PR#2016.
Now the memory usage is reduced a lot.
Thanks. :)

Teja

Majid Azimi

unread,
Mar 9, 2016, 8:47:00 PM3/9/16
to Caffe Users
Hi Eran,

I am experiencing the same problem. I don't know much about internal structure of caffe. basically I am going to do segmentation regression not segmentation classsification. I have 255 classes(grayvalues) and should I change the output of the last deconvolutional layer to meet this number? right now I am ussing train.prototxt for FCN32. 

Thanks

Majid Azimi

unread,
Mar 9, 2016, 8:59:05 PM3/9/16
to Caffe Users
I am having the same problem. how to tell the network how many classes we have? when I built lmdb file for labels in no where I mentioned the number of classes. in this case I have 255 classes for the beginning to see if it works. then I will change the softmax loss to euclidean loss to do regression, but first I have to get the net run. any help is appreciated.


On Thursday, July 9, 2015 at 4:06:18 AM UTC+2, Saihui Hou wrote:

eran paz

unread,
Mar 10, 2016, 3:26:58 AM3/10/16
to Caffe Users
Hi Majid
The number of classes is defined by the number of output in the deconvolution layer.
BTW, I'm assuming you have 256 classes (0-255), which means your number of outputs should be 255.

S. Majid Azimi

unread,
Mar 10, 2016, 6:26:22 AM3/10/16
to eran paz, Caffe Users
Hi Eran, thanks a lot for your help. I just did it, but because it is a regression problem I changed the loss to euclidean loss. then the caffe complained about 255 output which is obvious why. I changed it to 1 instead of 255 to do regression and it is being trained although with very high train loss. is that correct?

thanks for your advice.

李傲

unread,
May 19, 2016, 3:12:03 AM5/19/16
to Caffe Users, era...@gmail.com
Thank you very much! it worked! there are 2 classes for me, e.g. foreground and background, i modified the groundtruth to gray image which only contains value 0, 1, and set output_num to 2, finally it fixed !


在 2015年7月6日星期一 UTC+8下午8:49:37,Gavin Hackeling写道:

This is inconsistent with your description, but I encountered the same CUBLAS_STATUS_INTERNAL_ERROR when the number of labels mismatched the "num_output" in my train_val.prototxt. E.g., with eight classes and background, "num_output" should be nine.

FELIPE PETROSKI SUCH

unread,
Jun 29, 2016, 2:59:25 PM6/29/16
to Caffe Users
I'm running into this issue too, but only on GPU, CPU runs fine.
I've checked multiple times but I think I have only 0 and 1 labels with num_output = 2.
My concern is that I use reshape to do multi class classifier. 
Any other issues that could cause the same error besides the labels out-of-bounds?

15535...@qq.com

unread,
Sep 12, 2016, 7:26:02 AM9/12/16
to Caffe Users
Hi Majid,
 I also experience the same problem. I use caffe for regression issue.
The input are images, while the output are also images.
my data have no labels.
Can you help me? Thank you so much!
Reply all
Reply to author
Forward
0 new messages