Trainíng works fine , deploy throws configuration error CudaSuccess (9 vs. 0)

223 views
Skip to first unread message

p.Paul

unread,
May 3, 2017, 12:51:00 PM5/3/17
to Caffe Users
 
CudaSuccess (9 vs. 0)  invalid configuration argument CAFFE

Training with batch size of 10 works fine for this net. Is there any reason?

Deploying fro 1 image, gives the problem. I don't think its a memory problem. or hardware, since training is fine!

 Please help!

p.Paul

unread,
May 4, 2017, 11:59:17 AM5/4/17
to Caffe Users


From other discussions I figured out this can be a problem with Image size.
So when I try with another image of a different size. I get a new error as follows :


Check failed: K_ == new_K (256 vs. 512) Input size incompatible with inner product parameters.
*** Check failure stack trace: ***


What could be the reason, this happens only for the deploy!
Another info, which I think is important is that I am combining a large multistage network with alex net (as the last layer).

Error varies with different images :

Check failed: K_ == new_K (256 vs. 0) Input size incompatible with inner product parameters.
*** Check failure stack trace: ***

I cannot figure out why. The first part of the network(before combining with alex net use to work with all image sizes, and this fc layer that throws thsi error is a part of AlexNet)

Przemek D

unread,
May 5, 2017, 7:55:17 AM5/5/17
to Caffe Users
Consider attaching your train_val and deploy prototexts.

p.Paul

unread,
May 5, 2017, 8:52:31 AM5/5/17
to Caffe Users
Thank you very much for your kind reply.
 I am stuck at this error! Hereby attach my prototxts
pose_deploy.prototxt
pose_train_test.prototxt

p.Paul

unread,
May 5, 2017, 8:59:43 AM5/5/17
to Caffe Users
Also adding some more info :

I have combined network 1 (multistage VGG)with network 2( AlexNet) . I have modified it(network1) to add alex net(network2) for a regression.

With any input image at the deploy ,  the original network(network1)  works. But with the combined network,it throws this error.
 What I don't understand is, the input image is given to network 1 and not  directly connected to the alexnet layers.   The output of 1st network is the input of 2nd network(Alexnet).All the blob shapes are matched. Training works fine.

Przemek D

unread,
May 8, 2017, 4:04:44 AM5/8/17
to Caffe Users
I managed to load your pose_deploy.prototxt in python using
caffe.Net('pose_deploy.prototxt', caffe.TEST)
just fine, both on CPU and GPU.
Is it possible that there is some inconsistency between deploy and training networks? For some reason I was unable to render either net using draw_net.py, which would be a relatively simple way of verifying that (more convenient than reading through 3-3.5k lines of proto anyway).

p.Paul

unread,
May 8, 2017, 4:23:44 AM5/8/17
to Caffe Users
Thank you very much. I would try that, but I checked the  shape of blobs. I have the outputs of training and testing,which I attach . I use Beyond compare to compare them. Could you please have a look at that too. I see the shape between layers are matched.
deploy_test_net .txt
output_trainin.txt

p.Paul

unread,
May 8, 2017, 9:55:17 AM5/8/17
to Caffe Users

I drew the network and I attach them hereby .

The deploy1.png and deploy.png are of the same network which reverse the order of pooling layer, which should not be a problem when transferring weights. Moreover, I tried with both of these deploy network and received same error.

images.zip

Przemek D

unread,
May 9, 2017, 2:38:47 AM5/9/17
to Caffe Users
CUDA error 9 (cudaErrorInvalidConfiguration) means that a kernel launch was attempted with configuration that is either unsupported by your device, or impossible at all. This would indicate a problem with caffe, if I was unable to run your net either - but for me it loaded okay and I could call forward() with no problems. Which means that both caffe and your network are okay.
You should investigate your deploy script as the next step, as you're most likely making some mistake there.

p.Paul

unread,
May 9, 2017, 3:37:43 AM5/9/17
to Caffe Users
Okay, thank you very much. I suppose that you mean the input data format to the deploy network.

p.Paul

unread,
May 10, 2017, 4:14:05 PM5/10/17
to Caffe Users


Przemek D 
 So regarding the error, I have got the best explanation from Przemek D for why there  is no error for some image inputs and I copy it here:

"
This is quite complex to explain, but I'll try. Imagine a 100x100 input and a 7x7 conv filter with a stride of 4, no padding. The output shape, as we follow simple formulas, will be (100-7)/4+1 = 24.25, but this is impossible (shape must be integer). Instead, the layer fits only as many convolutions as it can - in this case the output will be of shape 24x24. Now, if we convolve that with a 3x3 filter with stride 2, you would get 11.5, which is impossible so your output will be cropped to largest smaller integer, in this case 11x11.
Look what happens if we reshape this input to 102x102, leaving convolutions as they were. First layer will try to output 24.75, which will be cropped to 24x24 like in the previous case. So in the end you'll get 11x11 as before. Reshaping down is also interesting: consider a 95x95 input. It will be convolved cleanly into 23x23 blob, which after the second convolution is 11x11 again! See, 95x95 and 102x102 are pretty much the same - they will both produce a 11x11 blob, which can be supplied to the same FC layer (with 121 inputs). The deeper your network and the larger strides you use, the more noticeable this effect becomes (the more you can vary your input with no influence on the final shape). I can imagine your net allowed you to go as far as 368 to 552 (though it's 50%, so quite a lot), but after crossing some threshold (656 must've been above it) the last conv layer also reshapes which causes your FC input mismatch (K_ == new_K). In our case if we did 103x103 the last conv would output 12x12, which is 144 elements - if we had an FC after that which was trained on 95x95 images and hence expecting 121 inputs, it would throw this error.
I hope I made this clear enough

"
 I really appreciate all the hard work you’ve done to help me.
 Thank you for your guidance and support.
 Thanks a lot  Przemek D. 

Przemek D

unread,
May 11, 2017, 1:32:02 AM5/11/17
to Caffe Users
I will only add that we managed to determine the problem was related to reshaping the network after loading an image larger than the ones it was trained on. For some images the network could be reshaped without changing the FC layer input, but after a certain threshold it caused a shape mismatch described in message above, which I tried to explain in the quote.
Reply all
Reply to author
Forward
0 new messages