Sabako,
I am doing a similar function in my research.
When you have multiple inputs into a single network (three in your case), you need to make sure the images in each LMDB in the same order. For example, if you have three images (front, side, and top view for example) of an object, when you put the images into each database, they need to be at the same entry number in the LMDB. Then, you can use the label from one of the input sources (since they should all have the same label) to calculate the error through the network. Otherwise you end up having to pass all three labels to the end (I have not tried that method, but it may work).
Here is the basic method I have used:
layer {
name: "MOD1_data"
type: "Data"
top: "MOD1_data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mean_file: "/data/models/image_mean.binaryproto"
}
data_param {
source: "/ssd2/final/output/dataset/MOD3/training/MOD3-clean-Array1"
batch_size: 4
backend: LMDB
}
}
< Rest of the network for the first input>
layer {
name: "MOD2_data"
type: "Data"
top: "MOD2_data"
include {
phase: TRAIN
}
transform_param {
mean_file: "/data/models/image_mean.binaryproto"
}
data_param {
source: "/ssd2/final/output/dataset/MOD3/training/MOD3-clean-Array2"
batch_size: 4
backend: LMDB
}
}
< Rest of the network for the second input>
layer {
name: "MOD3_data"
type: "Data"
top: "MOD3_data"
include {
phase: TRAIN
}
transform_param {
mean_file: "/data/models/image_mean.binaryproto"
}
data_param {
source: "/ssd2/final/output/dataset/MOD3/training/MOD3-clean-Array3"
batch_size: 4
backend: LMDB
}
}
< Rest of the network for the third input>
layer {
name: "FINAL_CONCAT"
type: "Concat"
bottom: "MOD1_output"
bottom: "MOD2_output"
bottom: "MOD3_output"
top: "FINAL_CONCAT"
}
<Put your loss calculations here>
I would recommend looking at your network architecture and see if the concatenation is the best architecture. From an efficiency standpoint, you will want to extract the relevant features from each of your inputs and then use the extracted features to determine final output. If you take a standard model (GoogleNet, Alexnet, etc.) and just concatenate the output you may not get correlation between your inputs. You will want to look at when the relevant features are extracted and then combine the relevant features earlier in the process and continue your architecture.
Here is a good reference of what I am trying to describe:
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. Proceedings of the 28th International Conference on Machine Learning (ICML-11), 689-696.
Keeping everything else the same, I got almost 10% better recognition performance combining the extracted features instead of combing the final outputs.
Patrick