Using images that go together during training and classification

32 views

Skip to first unread message

Caleb Belth

unread,

Jan 27, 2017, 2:18:10 PM1/27/17

to Caffe Users

I have a dataset where there are groups of images that go together. There are, for example, images of the same object just rotated at various angles. Since they all correspond to the same datapoint, I want to try training them all together. Likewise, when I classify new data, I want to classify over all images from a single datapoint. Is there a way to do this in Caffe?

Patrick McNeil

unread,

Jan 30, 2017, 9:19:52 AM1/30/17

to Caffe Users

There are a couple of ways I can think of doing this. The easiest is to ensure the images all have the same label and just train like normal. The rotation or different angle will help the network learn (and will probably perform better) overall versus if all of the images where the same view and angle. One common way of data augmentation is to rotate or mirror the source images.

The other method would be to use multiple input data layers. If you all of your data points contain the same size data set (for the sake of discussion lets say you have three images in each set), you could create an architecture where you input three different images into a single network. The implementation of an efficient architecture is a long discussion and involves a good understanding of different trade offs (I am using this topic for my dissertation). For some guidance on this, you can see the work of (Ngiam 2011). They have a couple of examples of reference architectures (for two modalities) you can use as a baseline. My recommendation is to extract the features from the inputs and then combine the extracted features for further processing. Although, if the data is all images, the Shallow architecture from the paper would work well if the initial layer is a convolutional layer.

If you have a different number of images for your data set (some have three and others have five for example), you could take the feature extraction module and implement that programmatically (the same feature extraction module for each input image), then create a method (CONCAT layer with a convolution layer for example) to combine the extracted features into a single network going forward). This would require some trial and error to get working correctly.

Patrick

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. Proceedings of the 28th International Conference on Machine Learning (ICML-11), 689-696.