I seek the little guidance on the problem I am currently facing with performance of Caffe Deep CNN for image classification on Jetson TK1.
My Caffe model is taking approx. 4 secs on ARM cpu and approx 7 secs on Jetson's GPU for forward pass during prediction phase for the batch size of 60K images with dimensions 20*5*5. Here GPU is taking more time than CPU for forward pass. However, I get about 1.2X speedup with the same code on GeForce desktop GPU with 48 cores.
What could be the reason behind this strange behaviour of Jetson TK1?