Hi,
I'm currently working on a network combining RGB and depth images. The depth data is pre-processed to fill up 3 channels which makes it possible to fine-tune a pre-trained model.
The network starts with some convolutional layers for both modalities. At some point, I need to fuse the two feature maps without changing the dimensions. My current approach is to simply concatenate the feature maps and apply another convolutional layer to reduce the number of features to the original one:
# ... some layers resulting in conv5_1 and conv5_2 (each 256x39x64)
layer {
name: 'conv5_concat'
type: 'Concat'
bottom: 'conv5_1'
bottom: 'conv5_2'
top: 'conv5_concat'
concat_param {
axis: 1
}
}
# conv_concat is now 512x39x64
layer {
name: 'conv5_fusion'
type: 'Convolution'
bottom: 'conv5_concat'
top: 'conv5_fusion'
convolution_param {
num_output: 256
kernel_size: 3 pad: 1 stride: 1
}
}
# conv fusion is againg 256x39x64
While it basically works, I think it would be beneficial to apply some kind of normalization to the concatenated features. What's the best way to achieve this? My first guess would be to use a LRN layer with mode ACROSS_CHANNELS but it would only normalize nearby channels, whereas the channels from the different modalities aren't nearby. Is there another way to do this?