I would like to reproduce the Two-Stream Convolutional Networks for Action Recognition in Videos.
But it feels Like I have Hit a wall when It comes to giving multi frame input to caffe.
As the single frame network gives 50% accuracy. But when I give an input of 30*227*227 via an LMDB. 20(10 frames each with 3 channels). The accuracy barely reaches 4%.
Which leads me to believe that the input I'm giving to caffe is not in the required format or model is wrong(less likely).
In all cases I assume that the network architecture (arrangement and number of layers) & learning parameters (LR/decay/Regularization/etc) to be constant.
For example I could choose to give my input to the network as one of the following.
1) batch_size x (no_of_imgs*no_of_channels) x height x width {3 dimensional input}
2) batch_size x no_of_imgs x no_of_channels x height x width {4 dimensional input}
3) batch_size x no_of_channels x no_of_imgs x height x width {4 dimensional input}
How would the input shape influence the accuracy of the network?
Hi,