Do you know what S is? Is this space in any way interpretable? Can you provide a label s in space S for each pair (x,y) belonging to X and Y? My point is that S is meaningless to work with - as an output it provides no information (so why would you want to obtain it), and it's not possible to come up with a loss function relating (X,Y) and S.
I have a feeling that you want to do two things here. First is regress Y from X - this should be easily solved by a network of a structure similar to:
X->(conv)->(fc)->Y
Notice that this network does learn correspondencies between elements of Y - why shouldn't it? After all, you're backpropagating through a dense layer, transforming the entire vector.
Another thing you seem to be doing is similarity. This sounds like a variation on siamese networks, where you input two images and expect the net to tell it for example whether they show the same person. Only in your case you would show a plant image and climate vector and ask whether they "match". Networks like that can be trained using
ContrastiveLoss (also see
this example on image-image nets). Your image X and vector Y become input data, but you still need to provide appropriate label - that is, a similarity metric for each (X,Y) pair. In caffe implementation it is a scalar in 0-1 range, 1 meaning the same thing and 0 total dissimilarity.
I don't know whether those tasks can be trained jointly, but in none of them you need to use space S directly, in my opinion.