I've been using triplet networks lately in Keras, and had a question around using BatchNormalization within my base network architecture. The code raises an exception when trying to use batch normalization mode 0, which is where running averages are computed during training to then be used during testing. Everything runs fine if I use mode 2, but then it's kind of a pain, because if I want to use the internal architecture to simply extract the feature representation at the final layer, I have to provide mini-batches with the same distribution as training (I cannot provide one at a time). This is less than desirable from a representation learning perspective, in that the representation of an input to a model should be independent of other inputs (which is why mode 0 is preferable). What prevents the computing of running averages when using multiple data flows? There is likely a good explanation that I am simply unaware of, but is it just an issue of lack of implementation, or is there a stronger reason? Thanks!
- Mason