Understanding caffe.io.Transformer

Fabio Maria Carlucci

unread,

Apr 17, 2015, 6:25:57 AM4/17/15

to caffe...@googlegroups.com

I'm following the Filter Visualization example and I'm trying to understand exactly what the Transformer does.

Below each line of code I'll write as a comment my understanding of it

# input preprocessing: 'data' is the name of the input blob == net.inputs[0]
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})

Informing the transformer of the necessary input shape

transformer.set_transpose('data', (2,0,1))

Defining the order of the channels in input data (doesn't matter for grayscale images)

transformer.set_channel_swap('data', (2,1,0))  # the reference model has channels in BGR order instead of RGB

Instructions on how to swap the channels (doesn't matter for grayscale images)

transformer.set_mean('data', np.load(caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy').mean(1).mean(1)) # mean pixel

We set the mean to normalize the data - but do we use the mean of the images used to train the network or the mean of the dataset i want to test?

transformer.set_raw_scale('data', 255)  # the reference model operates on images in [0,255] range instead of [0,1]

The code comment explains it

net.blobs['data'].reshape(1,3,227,227)

This I do not really understand - Why is it explicitly setting the shape? We can do it only once, or for every image?

net.blobs['data'].data[...] = transformer.preprocess('data', caffe.io.load_image(caffe_root + 'examples/images/cat.jpg'))

Execute the transormation

In red the points i have more doubts on.

What do you think? Any comments?

Thanks

Dixon Dick

unread,

Sep 7, 2015, 1:58:28 AM9/7/15

to Caffe Users

This is a terrific, well formed question on a topic that is not well documented or understood.

Fabio, thank you.

It would be great if there was a response to this.

dcd

Talat Shaikh

unread,

Sep 7, 2015, 9:33:09 AM9/7/15

to Caffe Users

The line

transformer.set_mean('data', np.load(caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy').mean(1).mean(1))

is used to calculate the mean of the images used for training the network.

& the line

net.blobs['data'].reshape(1,3,227,227)

sets the data part of the blob in the fashion (batch size, channel value, height, width). The batch size is the no. of concurrent images (or any data) that can be used for classification. It can be changed when the code is running as per your liking.

net.blobs['data'].data[...]

what is data[...]??

What I did not understand is..

thecro...@gmail.com

unread,

Sep 9, 2015, 8:59:10 AM9/9/15

to Caffe Users

First of all, keep in mind that the Transformer is only required when using a deploy.prototxt-like network definition, so without the Data Layer. When using a Data Layer, things get easier to understand.

Il giorno venerdì 17 aprile 2015 07:25:57 UTC-3, Fabio Maria Carlucci ha scritto:

# input preprocessing: 'data' is the name of the input blob == net.inputs[0]
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})

Informing the transformer of the necessary input shape

The input shape is built using the four input_dim fields. By default, using CaffeNet, your net.blobs['data'].data.shape == (10, 3, 227, 227). This is because 10 random 227x227 crops are supposed to be extracted from a 256x256 image and passed through the net.

transformer.set_mean('data', np.load(caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy').mean(1).mean(1)) # mean pixel
We set the mean to normalize the data - but do we use the mean of the images used to train the network or the mean of the dataset i want to test?

In theory, you shall use the mean of the ILSVRC dataset, as the pretrained Caffenet/Googlenet/VGG were trained on that images. This correspond to the ilsvrc_2012_mean.npy file, or, even better, the array [104, 117, 123].
It's because you need to "respect" the standardization used during training. Nevertheless, the mean of any dataset composed of natural images should be close to [104, 117, 123].

Of course, if one trains a network from scratch on a dataset different from ILSVRC, she needs to use that dataset's mean.

If you are finetuning an ILSVRC pretrained model on your data... this is an issue I'm interesting in too.

net.blobs['data'].reshape(1,3,227,227) 
This I do not really understand - Why is it explicitly setting the shape? We can do it only once, or for every image?

This is simply because in the example there is no need to pass 10 images through the Net.

At the end of the answer, I realize that this was a really old question. The new example notebook is very clear and maybe this long answer could be just a link ;)
http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb

Jinchao Lin

unread,

Dec 11, 2015, 7:46:07 PM12/11/15

to Caffe Users

Hi everyone,

This is a followup question on the caffe.io.Transformer. I have read through the given example in the 00-classification ipython notebook but still confused.

I am wondering how to use the following functions?

transformer.set_transpose('data', (2,0,1))

transformer.set_raw_scale('data', 255)

In CaffeNet's train_val.prototxt, there is no transform_param.scale. In this case, why should we set_raw_scale? Moreover, how is the magic value 255 be decided?

And what about set_transpose() function? I have no idea what does this function do?

Thanks for any comments!!

Muneeb Shahid

unread,

Jan 1, 2016, 9:50:40 AM1/1/16

to Caffe Users

set_transpose simply swaps your image dimensions. Normally when an image library loads an image the dimensions of the loaded array are H x W x C (where H is height, W is width and C is/are number of channels), but since caffe expects input in C x H x W, we transpose the data. transpose('data, (2, 0, 1)) simply means transpose data such that 0th dimension is replaced by 2nd (height with channels), 1st by zeroth (widht by height) and so on.
As for set_raw_scale it has to deal with how caffe.io.load_image loads data. caffe.io.load_image loads data in a normalized form (0-1), where as the model that they use in the example was trained on normal image values 0-255. The provide the param 255 to tell the transformer to re scale values back to 0-255 range.

Hope it helps.

Katja

unread,

Jan 14, 2016, 5:47:49 AM1/14/16

to Caffe Users

# input preprocessing: 'data' is the name of the input blob == net.inputs[0]
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})

Informing the transformer of the necessary input shape

Can you confirm that this part is setting up the Transformer for the deep net specific input dimension?

And is transformer.preprocess() then executing the image resize?

net.blobs['data'].data[...] = transformer.preprocess('data', caffe.io.load_image(os.getcwd() + "/" + img_filename))

I am extracting GoogLeNet features and predicted classes from a directory of images with various sizes, not only 256x256. Although this runs through without problems I am wondering though whether I need to $ convert for a resize first.

ada...@ucr.edu

unread,

Jan 18, 2016, 2:46:25 PM1/18/16

to Caffe Users

Hi thecro...@gmail.com

Your post could explain most of the doubts. Thanks for that. However, I was wondering if there is a good documentation of the Transformer class? I am encountering some other confusion too. I'm providing them below.

1. What is the role of the delimiter : in the code - transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape}) ? Is it some sort of slicing delimiter?

2. I'm not also comfortable with transformer.set_transpose('data', (2,0,1)) . I looked into the comments in the code in github. It just says -

Set the input channel order for e.g. RGB to BGR conversion as needed for the reference ImageNet model.

What does (2,0,1) signify here? I can also see that in a couple of lines there is another channel swapping code - transformer.set_channel_swap('data', (2,1,0)). Going to the description of set_channel_swap in github, I found the description is same as the function set_transpose. Why is it needed to perform the channel swap twice?

You can probably understand that I'm new to both python and caffe. Will be grateful if anybody can clear the confusions.

Abir

On Wednesday, September 9, 2015 at 8:59:10 AM UTC-4, thecro...@gmail.com wrote:

ada...@ucr.edu

unread,

Jan 18, 2016, 3:31:58 PM1/18/16

to Caffe Users

I got some answers by debugging and Googling. I'm modifying the post below accordingly. The modified text is in Green font.

On Monday, January 18, 2016 at 2:46:25 PM UTC-5, ada...@ucr.edu wrote:

Hi thecro...@gmail.com

Your post could explain most of the doubts. Thanks for that. However, I was wondering if there is a good documentation of the Transformer class? I am encountering some other confusion too. I'm providing them below.

1. What is the role of the delimiter : in the code - transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape}) ? Is it some sort of slicing delimiter? -- I figured out that the argument is a dictionary and : is used to differentiate between the key and the value. Specifically 'data' is the key and the tuple returned by 'net.blobs['data'].data.shape' (whose value for this particular example is (10, 3, 227, 227)) is the value. So basically the 'inputs' field of the transformer object is initialized with the key-value pair 'data', '(10, 3, 227, 227)'.

2. I'm not also comfortable with transformer.set_transpose('data', (2,0,1)) . I looked into the comments in the code in github. It just says -
Set the input channel order for e.g. RGB to BGR conversion as needed for the reference ImageNet model.

What does (2,0,1) signify here? -- This code snippet also sets a field inside the transformer object with a dictionary. The field is 'transpose' and the key-value pair is 'data', (2,10,1). Will see what these fields do and update the post accordingly.

ada...@ucr.edu

unread,

Jan 18, 2016, 8:28:42 PM1/18/16

to Caffe Users

Hi All,

This is just to let you know that things are clearer now. Exploring the file 'python/caffe/io.py' helped a lot. I'm editing the content based on my understanding in Red below. Hope it will help someone who is a newbie like me.

Many thanks!

On Monday, January 18, 2016 at 3:31:58 PM UTC-5, ada...@ucr.edu wrote:

I got some answers by debugging and Googling. I'm modifying the post below accordingly. The modified text is in Green font.

On Monday, January 18, 2016 at 2:46:25 PM UTC-5, ada...@ucr.edu wrote:

Hi thecro...@gmail.com

Your post could explain most of the doubts. Thanks for that. However, I was wondering if there is a good documentation of the Transformer class? I am encountering some other confusion too. I'm providing them below.
1. What is the role of the delimiter : in the code - transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape}) ? Is it some sort of slicing delimiter? -- I figured out that the argument is a dictionary and : is used to differentiate between the key and the value. Specifically 'data' is the key and the tuple returned by 'net.blobs['data'].data.shape' (whose value for this particular example is (10, 3, 227, 227)) is the value. So basically the 'inputs' field of the transformer object is initialized with the key-value pair 'data', '(10, 3, 227, 227)'.
2. I'm not also comfortable with transformer.set_transpose('data', (2,0,1)) . I looked into the comments in the code in github. It just says -
Set the input channel order for e.g. RGB to BGR conversion as needed for the reference ImageNet model.

What does (2,0,1) signify here? -- This code snippet also sets a field inside the transformer object with a dictionary. The field is 'transpose' and the key-value pair is 'data', (2,0,1). Will see what these fields do and update the post accordingly. -- Say the image is an array of size H x W x K, then the transposing operation (which is done in the line "net.blobs['data'].data[...] = transformer.preprocess('data', caffe.io.load_image(caffe_root + 'examples/images/cat.jpg'))") is going to generate a K x H x W array by just swapping the array axes/dimensions.
I can also see that in a couple of lines there is another channel swapping code - transformer.set_channel_swap('data', (2,1,0)). Going to the description of set_channel_swap in github, I found the description is same as the function set_transpose. Why is it needed to perform the channel swap twice? -- This is used in the transformer.preprocess method to convert the channels from RGB to BGR.

Corey Nolet

unread,

Jun 28, 2016, 12:34:28 PM6/28/16

to Caffe Users

I have a completely different dataset (greyscale images) and I'm implementing an end-to-end system from Max Jaderberg's Text Spotting dissertation. Max mentions in his dissertation that he not only normalizes by subtracting the mean but he also does unit variance (by dividing the subtracted mean by the std deviation).

My question is related to this question. In the dissertation Max is implementing, it seems like he is not using the dataset mean/stddev but just the mean/stddev of each image in isolation. What's the net effect of this vs using the dataset stats? It seems to me like the images in the dataset used to train the classifier are going to have a different avg mean than an image used to classify, right? I know you mentioned most natural scene images should have averages close to specific values but wouldn't that highly depend on the camera used, gamma, alpha, etc..? It seems to me like the 'average' pixel intensity across any set of randomly selected images should be 128, right?

Anyways, I'm not challenging your response, I'm just trying to get my own insight so that I can mean normalize my dataset to increase accuracy in my own application.

Thanks in advance!

Reply all

Reply to author

Forward