Training network - works fine for batch size < 20, crashes and pops CUBLAS_STATUS_SUCCESS(14 vs. 0)

208 views
Skip to first unread message

sri

unread,
Jul 17, 2015, 12:24:54 AM7/17/15
to caffe...@googlegroups.com
Hi,

I'm training a network that does two tasks, I'm using a custom written data layer to read the image name, label 1 (a vector) and 2 (a single value). When training the network, it works fine with the batch size < 20, if not it pops an error stating "Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0)  CUBLAS_STATUS_INTERNAL_ERROR" and crashes. I'm using ./build/tools/caffe train, and I'm not training in python.

Interesting thing to note is that I tried training another network where I just do simple fine-tuning classification task (just the 1 task), and this training works fine for all batch sizes. Has anyone encountered this issue before? or can someone help me figure out where I might've gone wrong?

More details: I use a custom written layer to read image names and labels from test file, it was modified from Image data layer in caffe. I have a 400 dimensional vector as label1 and a single value for label 2. I use softmax loss for both tasks, with a 400 individual (rather 400 dimensional?) softmax for one task. I also use the reshape layer in caffe in my network. I've verified that my labels are all integers >=0 (since I'm using a classification like approach, with a softmax). 

If there is an error in any of my code, I'm curious why it works fine for a smaller batch size and then crash only for higher batch sizes? 

Thanks!
Reply all
Reply to author
Forward
0 new messages