Training a sliding window detector

JohannesB

unread,

Nov 26, 2014, 4:47:32 PM11/26/14

to caffe...@googlegroups.com

Hello,

I am trying to train the LeNet network for sliding-window detection. I have already converted the network into a fully convolutional one using the knowledge from the "net surgery" article on the Caffe homepage.

Is there any elegant way to provide ground truth images to the network in just one LevelDB database? Or do I have to provide two databases -- one for the actual images and one for ground truth images? (Hopefully the database access is kept in sync by Caffe.) The images and ground truth images differ in size.

Best regards,

Johannes

Evan Shelhamer

unread,

Nov 26, 2014, 4:57:48 PM11/26/14

to JohannesB, caffe...@googlegroups.com

With the current data pipeline you have to define two data layers, one for the actual input image and one for the ground truth "image" that defines the output labels. This means you have to generate two input DBs with the input and corresponding truth in the same order. Exactly how to do this is on our list for documentation and examples, but hasn't quite materialized yet. However, once the fully convolutional model is defined and the data DBs are defined training is a breeze since many Caffe losses are happy to take vector / matrix predictions and ground truths.

To help you along, check out this code sample for generating an LMDB in Python with custom data:

import caffe

import lmdb

in_db = lmdb.open('image-lmdb', map_size=int(1e12))

with in_db.begin(write=True) as in_txn:

for in_idx, in_ in enumerate(inputs):

im = caffe.io.load_image(in_)

im_dat = caffe.io.array_to_datum(im.transpose((2, 0, 1)))

in_txn.put('{:0>10d}'.format(in_idx), im_dat.SerializeToString())

in_db.close()

While this code makes an image DB, you can likewise make the ground truth DB by forming the array of window labels and calling `caffe.io.array_to_datum`. Note that the indices are zero padded to preserve their order: LMDB sorts the keys lexicographically so bare integers as strings will be disordered.

If you are able to get this working on your own, contributing back a LeNet detector example done in the fully convolutional way would be a great help. Good luck!

Evan Shelhamer

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/3a317b13-805d-49d6-b115-d11e16fea2c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

JohannesB

unread,

Nov 26, 2014, 5:41:09 PM11/26/14

to caffe...@googlegroups.com, shel...@eecs.berkeley.edu

Thanks for the very fast and comprehensive response! I will then try to create two parallel databases with synchronized images and their ground truth image data.

Do you have any experiences with that kind of ground truth image data? My idea was to create a 1 x N x H x W blob for each image, where N is the number of classes and H, W are the height and width of the network output map. In that blob, which is initialized with zeros, I would only set positions to "1" where I expect the network to do so. (My bounding box top left corner would thus be translated into a "1" pixel in the ground truth image.) Probably it would be better to place a Gaussian distribution at the expected positions rather than a single peak.

Btw: Is there any method in the Caffe framework which translates from input coordinates to output coordinates, considering all the shifts and scalings by the layers automatically?

Johannes

Evan Shelhamer

unread,

Nov 27, 2014, 10:30:03 AM11/27/14

to JohannesB, caffe...@googlegroups.com

The loss determines how to define the ground truth map. For instance, for the softmax loss (SOFTMAX_LOSS) the ground truth blob should be 1 x 1 x H x W int where each value is the class index. It's analogous to classifier training where the output is a vector and the ground truth is a scalar class index -- here the output is a K x H x W map where K is the number of classes so the ground truth is 1 x H x W. This works for sliding window detection, scene parsing / semantic segmentation, and so on.

Probably it would be better to place a Gaussian distribution at the expected positions rather than a single peak.

You could do this by the sigmoid cross-entropy loss (SIGMOID_CROSS_ENTROPY_LOSS) which regresses to probabilities. In this case the ground truth blob would be the same dimension as the network output, 1 x K x H x W for each instance, where each cell is the probability of detection for class k at each (h, w) location.

Is there any method in the Caffe framework which translates from input coordinates to output coordinates, considering all the shifts and scalings by the layers automatically?

No, but that is indeed a useful utility function that should likely be rolled into util or the Python / MATLAB interfaces. It isn't such tricky indexing once you're used to it, but a general indexing method to map coordinates layer-to-layer is helpful.

Evan Shelhamer

To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/701a7914-9d5b-45eb-bc43-94284d1360f6%40googlegroups.com.

Reply all

Reply to author

Forward