Good practices when giving a multi frame input to caffe CNN ? Video classification

Ashwin Nair

unread,

May 26, 2016, 6:46:48 AM5/26/16

to Caffe Users

Hi,

In the task of Video classification

I have a neural network I have to train with videos(group of images). I can choose to change the shape of the input to the network from several options.

I would like to reproduce the Two-Stream Convolutional Networks for Action Recognition in Videos.

But it feels Like I have Hit a wall when It comes to giving multi frame input to caffe.

As the single frame network gives 50% accuracy. But when I give an input of 30*227*227 via an LMDB. 20(10 frames each with 3 channels). The accuracy barely reaches 4%.

Which leads me to believe that the input I'm giving to caffe is not in the required format or model is wrong(less likely).

In all cases I assume that the network architecture (arrangement and number of layers) & learning parameters (LR/decay/Regularization/etc) to be constant.

For example I could choose to give my input to the network as one of the following.

1) batch_size x (no_of_imgs*no_of_channels) x height x width {3 dimensional input}

2) batch_size x no_of_imgs x no_of_channels x height x width {4 dimensional input}

3) batch_size x no_of_channels x no_of_imgs x height x width {4 dimensional input}

How would the input shape influence the accuracy of the network?

And If anyone has previously worked with video classification and have any tips for me ?

AIROBOTIAI

unread,

Aug 24, 2016, 2:34:50 PM8/24/16

to Caffe Users

Hi Ashwin,

I came across the same problem as you when dealing with the video data. Have you find any solution to solve the problem you mentioned? I think strategy 1) seems no problem, and I'm wondering whether its performance is reasonable now. Thx!

在 2016年5月26日星期四 UTC+8下午6:46:48，Ashwin Nair写道：

Guofeng Hu

unread,

Feb 28, 2017, 7:23:07 PM2/28/17

to Caffe Users

I do the similar work like you. I used to input one frame with one label, but its accuracy was about 14%. I have a idea and I don't know if it can works. It is that I want to input multi-frames with just one label, but I don't know how to do that

在 2016年5月26日星期四 UTC+8下午6:46:48，Ashwin Nair写道：

Hi,

shruti sneha

unread,

Jul 5, 2017, 1:37:17 AM7/5/17

to Caffe Users

Hey there,

Actually I have also been stuck in giving video frames as an input to caffe. But the first problem that arises is how to make lmdb of these video frames. I am not able to think of input size like, i have 16 frames each of size, 480x640x3, then for total shape of input must be- 16x480x640x3 right? can you plz give a brief idea of how to create lmdb for these frames. Kindly correct me wherever i am wrong.

Regards

Reply all

Reply to author

Forward