Understanding Faster-RCNN training input size

Hermann Hesse

unread,

Sep 29, 2016, 5:53:56 AM9/29/16

to Caffe Users

Hi all,

As mentioned in their paper (Faster R-CNN: Towards Real-Time ObjectDetection with Region Proposal Networks) in Section 3.3 "Implementation Details":

We train and test both region proposal and object detection networks on images of a single scale [1], [2]. We re-scale the images such that their shorter side is s = 600 pixels [2].

[...].

On the re-scaled images, the total stride for both ZF and VGG nets on the last convolutional layer is 16 pixels, and thus is ∼10 pixels on a typical PASCAL image before resizing (∼500×375). Even such a large stride provides good results, though accuracy may be further improved with a smaller stride.

The [2] cite is their previous version (Fast R-CNN) that explains:

All single-scale experiments use s = 600 pixels; s may be less than 600 for some images as we cap the longest image side at 1000 pixels and maintain the image’s aspect ratio. These values were selected so that VGG16 fits in GPU memory during fine-tuning. The smaller models are not memory bound and can benefit from larger values of s; however, optimizing s for each model is not our main concern. We note that PASCAL images are 384 × 473 pixels on average and thus the single-scale setting typically upsamples images by a factor of 1.6. The average effective stride at the RoI pooling layer is thus ≈ 10 pixels.

So, it means that they train the network with non-square inputs. Is it possible in Caffe (framework used)?

And, what do the 16 pixels and ∼10 pixels mean?

Thanks a lot!

Przemek D

unread,

Sep 29, 2016, 6:24:36 AM9/29/16

to Caffe Users

Using non-square inputs is possible in Caffe. You just have twice as much numbers to watch out for, I'm referring to correct blob sizes (Andrej Karpathy's instructions to CS231n explain this awesomely - check summary for section Convolutional Layer, particularly the equations describing conv output size). You can even use non-square convolution kernels - simply instead of stating kernel_size use kernel_h and kernel_w in your conv layer definition (the same applies to pooling layer and stride sizes - see Caffe layer catalogue for details).

By 16 and 10 pixels they refer to a convolution stride, that is by how many pixels you shift your kernel between two locations on the data (links I posted above explain the idea better). 16 pixels is a stride on a 600 px wide image that was created by upscaling a 375 px image by a factor of 1.6 - before resizing this would correspond to 16/1.6 = 10 pixels.

Message has been deleted

Hermann Hesse

unread,

Sep 29, 2016, 8:14:21 AM9/29/16

to Caffe Users

Oh! I already knew the course, but thanks for reminding me. It is clear now.

In reference to the second question, the concept is okay to me. But, how is it possible for a network with greater depth (VGG >> ZF) present the same final stride?. On the other hand, the VGGnet and its five pooling layers (stride 2) wouldn't cause a final jump of 32 (input_size/2/2/2/2/2)?. What am I missing?

I appreciate your response.

Abhijit Balaji

unread,

Sep 13, 2017, 11:43:29 AM9/13/17

to Caffe Users

I think the number 16 is the stride length and it is arbitrarily chosen for a( 600 x 600 x 3) re-scaled image. Refer section 3.3 of faster R-CNN paper (end of 1st para).

On the re-scaled images, the total stride for both ZF and VGG nets on the last convolutional layer is 16 pixels, and thus is ∼10 pixels on a typical PASCAL image before resizing (∼500×375). Even such a large stride provides good results, though accuracy may be further improved with a smaller stride.

The last line says: "accuracy can be improved by using smaller strides" so that means 16 is arbitrarily chosen (a hyperparameter). The other part is obvious if stride is 16 for 600 pixels then stride is ~10 for 375 pixels.

This is my understanding from reading the paper

Reply all

Reply to author

Forward