You need data (images & labels)
The label depends on the task, i.e. whatever output that you want, for example if you want output yes/no, your label is yes/no, you want a segmentation map, your label is a segmentation map, I guess this is the case of FCN.
To train the network you need data layer that can read your data (images & labels). Now you can store your data in anyway you want, as long as there's a suitable, or you write a suitable data layer to read them.
For example, your input is a 200x200 image, and you want to classify each pixel is sea/non-sea, so your label is a 200x200 array segmentation map, you can put this array into, say, a text file, and then write a data layer that reads it.
Caffe comes with some data layers, and if you are to use them you have to follow their input format. For example if you want to use lmdb data layer, then you have to learn lmdb first, then write a program to convert your text file data to lmdb data, it's most likely that you have to look at caffe code to see how caffe access lmdb and organize your data accordingly.
I've never use lmdb layer, so I don't know the specific. If you are not writing your own layer, I suggest using the image data layer.
PS, it seems that this FCN is not like typical caffe task, so if you are to train you'll have to do a lot of coding & tinkering.