Hi,
I'm training a network that does two tasks, I'm using a custom written data layer to read the image name, label 1 (a vector) and 2 (a single value). When training the network, it works fine with the batch size < 20, if not it pops an error stating "Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0) CUBLAS_STATUS_INTERNAL_ERROR" and crashes. I'm using ./build/tools/caffe train, and I'm not training in python.
Interesting thing to note is that I tried training another network where I just do simple fine-tuning classification task (just the 1 task), and this training works fine for all batch sizes. Has anyone encountered this issue before? or can someone help me figure out where I might've gone wrong?
More details: I use a custom written layer to read image names and labels from test file, it was modified from Image data layer in caffe. I have a 400 dimensional vector as label1 and a single value for label 2. I use softmax loss for both tasks, with a 400 individual (rather 400 dimensional?) softmax for one task. I also use the reshape layer in caffe in my network. I've verified that my labels are all integers >=0 (since I'm using a classification like approach, with a softmax).
If there is an error in any of my code, I'm curious why it works fine for a smaller batch size and then crash only for higher batch sizes?
Thanks!