I had some questions about distribute strategy custom training. https://www.tensorflow.org/tutorials/distribute/custom_training.
Why are regularization loss treated different compared to other type of losses?
How does the batch size vary step to step per replica? Also, I am not exactly sure why we would not tf.reduce mean if the batch size changes step instead of diving by the global batch size
per_example_loss /= tf.cast(tf.reduce_prod(tf.shape(labels)[1:]), tf.float32)Caution: Verify the shape of your loss. Loss functions in tf.losses/tf.keras.losses typically return the average over the last dimension of the input. The loss classes wrap these functions. Passing reduction=Reduction.NONE when creating an instance of a loss class means "no additional reduction". For categorical losses with an example input shape of [batch, W, H, n_classes] the n_classes dimension is reduced. For pointwise losses like losses.mean_squared_error or losses.binary_crossentropy include a dummy axis so that [batch, W, H, 1] is reduced to [batch, W, H]. Without the dummy axis [batch, W, H] will be incorrectly reduced to [batch, W]."
DS doesn't override the behavior for reduce_mean (or other TF operations), so it doesn't do what you expect: instead of computing the global mean across all replicas, you get the per-replica mean value. There's explicit code in the optimizers/losses to perform global reductions.The default losses are already "hacked" to handle global reductions :/
--
You received this message because you are subscribed to the Google Groups "TPU Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tpu-users+...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tpu-users/563e39e9-6f85-4858-9984-e734769d37c6n%40tensorflow.org.