Hi Peter,
You are correct, for prediction it uses the learned mean and variance, as long as the use_global_stats is set to true (or if you leave it unset, caffe will automatically set it to true in TEST phase).
I've been using batchnorm for caffe successfully using U-Net and others for segmentation and classification. But on a new problem I'm working on, I've also encountered the same issue as you in terms of differing training and test performance (even on the same data). However, in my case it's a classification problem.
The only thing I can think of is that the learned mean and variance are maybe not representative somehow of my dataset. When I look at the blobs before and after BN, it is after BN that they diverge from training and testing on the same data. So the per-batch normalisation in training must be doing something quite different than the normalisation using "global" mean and variances. I do notice that with greater training iterations, this divergence tends to lessen, but I need more observations to be sure. That is the only solution or idea I have so far, but if you come up with anything else, I would love to hear it.
Finally, I also notice the third parameter blob in the BN layer is always 999.98236084. I find that strange, since I believe that blob is supposed to collect a scaling factor that approximates iteration count weighted appropriately by the moving_average_fraction. Why is always tends to be that number of every model and dataset I use is a giant mystery to me. So I must be missing something there.