Thanks a lot,
I used to write in C/C++ a lot but it was many years ago :)
I'll check it, however I thought the data copying is necessary to have the correct data representation and the CPU-GPU copying will be the activity taking most CPU time, right?
And actually this is only one tiny piece of the puzzle.
Currently for low-dimensional versions (~100x10), CPU is 120% busy while reported GPU load is only 6%.
Which probably means it could run 15 times faster if we would eliminate the need of manipulations on CPU.
This bug can be easily reproduced with MNIST (GPU mode, large batches -- and it still wouldn't speed up to more than 10x from CPU, but for images the difference is up to 50x).
I would like to improve that, but I don't know how.
I got very good results with just CNN, but I'd like to be able to do RNN processing -- which should be very easy by itself... But as I understood it's not quite easy with Caffe at the moment...
I considered the following pipeline:
data -> sliding window -> normalizing to [0, 1] interval & adding noise -> several layers of neurons -> RNN
where consequent data chunks are terminated with zeros or in some other way -- then RNN should reset its values.
But then there is the batching issue: how to run these in parallel?
I would consider running those from python with net.forward / net.backward... then I will avoid any additional C/C++ code and could control timing and other points...
But then I'll be worried about speed much more than in case of running ./caffe directly! Or shouldn't I?
Multimodal networks are my passion... but how to develop them using the computing power efficiently?
This is my biggest question at the moment.