I've got an nVidia Pascal GPU running Caffe on Windows 10. When I run NSight on Visual Studio, forward propagation in testing mode shows only 4.3% utilization of the GPU with less than 1% use of the 16 kernel calls.
I'm working on a real time system so I'm trying to get forward propagation to work as quickly as possible.
If I increase the kernel size, I'm going to have to rerun my training which is a very expensive process, time-wise. Besides increasing the kernel size, what other tweaks can I make to Caffe or CUDA to increase the speed of the test?