Hi, George.
Recently I've encountered exactly the same strange behavior.
I was making simple regression task with CNN, after several BatchNorm layers were added training speed and loss value improved drastically, but the catch was that interference results on the same dataset I'd used for train stage, seemed to be rather random than generated by a trained network (loss value was much higher). According to the original paper by Ioffe & Szegedy TRAIN and TEST stages of the algorithm differ. At TRAIN stage mean and variance are computed for each batch independently and their unbiased averages are used at TEST stage (controlled by use_global_stats parameter) - correct me if I'm wrong. So after I left just three images in dataset and set batch_size: 3 accordingly, so all images could be processed in one batch, I expected TRAIN and TEST stages mean and variance be the same and produce the same result on the same dataset which is not the case. If I print test loss values on the same dataset while training (as George do, according to his solver.prototxt) I see that TEST loss decrease along with TRAIN loss decreasing, but with much higher order of magnitude (TRAIN loss about 10^-7, TEST loss about 0.1 for example) and more shallow convergence slope than TRAIN loss do.
I'm wondering if someone got an explanation now.