Training network - works fine for batch size < 20, crashes and pops CUBLAS_STATUS_SUCCESS(14 vs. 0)

208 views

Skip to first unread message

sri

unread,

Jul 17, 2015, 12:24:54 AM7/17/15

to caffe...@googlegroups.com

Hi,

I'm training a network that does two tasks, I'm using a custom written data layer to read the image name, label 1 (a vector) and 2 (a single value). When training the network, it works fine with the batch size < 20, if not it pops an error stating "Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0) CUBLAS_STATUS_INTERNAL_ERROR" and crashes. I'm using ./build/tools/caffe train, and I'm not training in python.

Interesting thing to note is that I tried training another network where I just do simple fine-tuning classification task (just the 1 task), and this training works fine for all batch sizes. Has anyone encountered this issue before? or can someone help me figure out where I might've gone wrong?

More details: I use a custom written layer to read image names and labels from test file, it was modified from Image data layer in caffe. I have a 400 dimensional vector as label1 and a single value for label 2. I use softmax loss for both tasks, with a 400 individual (rather 400 dimensional?) softmax for one task. I also use the reshape layer in caffe in my network. I've verified that my labels are all integers >=0 (since I'm using a classification like approach, with a softmax).

If there is an error in any of my code, I'm curious why it works fine for a smaller batch size and then crash only for higher batch sizes?