Why is image aspect ratio always 1:1 (square)?

wolge...@gmail.com

unread,

Jan 23, 2017, 8:12:16 AM1/23/17

to Caffe Users

Why is image aspect ratio always 1:1 (square) but cameras are usually 16:9 ? Why always 256 x 256 or so for image classification or semantic recognition?

For that 1:1 aspect ratio I have to crop all images. That doesn't make sense to me.

Why not 256 x 121 or 512 x 288 ?

Przemek D

unread,

Jan 23, 2017, 8:17:47 AM1/23/17

to Caffe Users

But it's not always 256x256, not even "or so". Take a look at PASCAL VOC segmentation dataset - there are all kinds of resolutions there.

Why not 256x121? Well, why not? Watch Andrej Karpathy's lectures and think of an answer. Protip: they're on youtube, just search for "CS231n lectures". 9 out of 10 newcomer questions are answered there before you even ask them :)

wolge...@gmail.com

unread,

Jan 25, 2017, 4:57:21 AM1/25/17

to Caffe Users

Thanks Przemek,

I followed your advice and searched on google for "PASCAL VOC segmentation dataset". On two pages with hits I didn't find even one information about what image size they used in their datasets.

Why not 256x121? Well it might be not efficient in using GPU memory or useless in other sense. This is what questions are for.

I learned in school early: There are no stupid questions, only stupid answers.

wolge...@gmail.com

unread,

Jan 28, 2017, 1:41:05 PM1/28/17

to Caffe Users

I started to study this topic (and more of course). I can confirm the high value of the youtube videos about "CS231n lectures" - great job. Thanks to Przemek.

The ratio 1 : 1 and the size have good reasons.

1. For training you will use image augmentation in most cases. For that you turn and flip the image - on that way it becomes a square format even it was rectangular. Or you need to crop and you will loose data.

2. I was under the impression that smaller images are faster to calculate. This is correct only for training, but for classification the size and amount of images doesn't matter. The response time is always the same.

3. But the size matter for the GPU memory you will need and have available. State of the art is 16 GB - maybe we can multiply by 4 (4 cards on one board) - so 64 GB. Calculate the size H x W x 3 x #images is what you have to handle during training. Larger than 256x256 need to have a good reason. (I just wonder how do the # of hidden layers take part in this calculation?)

I will continue my studies and write down here what I find.

ath...@ualberta.ca

unread,

Jan 30, 2017, 7:41:15 PM1/30/17

to Caffe Users

Standard CNN architectures expect a fixed size input image - there is no requirement that it be square. For these networks one must either resize or crop images.

For large datasets, where images can be of varying aspect ratios (landscape or portrait), picking a square image seems reasonable. In other words, if you are to resize or crop out an image, and there is no specific reason to pick any other aspect ratio, then square is the one you go with. But I stress that other aspect ratios might work as well (or better) depending on the data and task at hand.

Regarding resizing vs. cropping...

Generally, if one has information about some level of consistency within a dataset (images of eyeballs of the same resolution), then it would make sense to crop images of the same size since this data 'registration' can only help the network achieve it's goal.

If on the other hand, you have a diverse dataset with 1000s of classes with images of arbitrary resolution and objects found anywhere in each image, then often it makes sense to resize all to the same size. The reason is that in order to crop, one would have to know what size to crop, and without more information this question is difficult to answer. But let's say you find a good crop size somehow, then the next question becomes, where in the image to crop? If you crop in the wrong place you could completely crop out the object that you are trying to classify. Another point, what to do if some of the images are smaller than the input size of the network - how would cropping work then? In any case, to get around all these issues (for specific datasets), this is why resizing is often chosen over cropping.

Yes, resizing will distort the images but they generally have been found to still work great as the representations learned by the CNN operate in this distorted space. It just works.

See this paper that resizes arbitrary words to a non-square, fixed input size:

Reading Text in the Wild with Convolutional Neural Networks

https://arxiv.org/abs/1412.1842

wolge...@gmail.com

unread,

Feb 5, 2017, 2:52:55 AM2/5/17

to Caffe Users

thanks ath...@ualberta.ca. After all my studies I agree on that too. I will close this topic.

But please lease me continue with another question related to Dataset Training and ask you to follow me please.

background color in object classification https://groups.google.com/forum/#!topic/caffe-users/w-SrqcjLsVQ

Transparency in object classifivation https://groups.google.com/forum/#!topic/caffe-users/e6d4fEf_wQE

Reply all

Reply to author

Forward