Caffe stores all data in a serial form internally. Blobs typically have dimension NCHW, indexed from the end: so when you continuously offset your position by 1 you travel in W dimension first, then H, then C and N (i.e. to get to the same pixel in the next channel, you need to offset yourself by H*W).
Hard to give a precise answer to your question since we don't know what does your network really look like (why flatten the image in a FCN?), but hope that gives you some direction at least.