I'm benchmarking a slightly modified version of the C++ classification example, and I'm getting some odd results. Converting and loading a 1920x1200 image onto the GPU and running the network forward take 0.06s and 0.02s respectively. However, copying the data off the GPU takes around 0.10s. For reference, the output layer copied is a 12x36x58 array, so about 100Kb of data.
According to the little bandwidth tester included in the CUDA samples, my 970 is consistently doing over 12Gb/s in Device to Host transfers. So why is Caffe only moving data at 1Kb/s? It's a huge performance constraint, and I can't figure out where it's coming from. If anyone can offer some clarity on this issue I'd appreciate it.