Custom loss scaling with distributed training

60 views

Skip to first unread message

Sayak Paul

unread,

Feb 27, 2021, 9:58:02 AM2/27/21

to Discuss

Hi folks,

I am following this guide to familiarize myself with the different loss scaling strategies to note during distributed training: https://www.tensorflow.org/tutorials/distribute/custom_training.

I would be grateful if someone could verify if my calculations are correct -

* Let's say I have implemented a custom loss function that is based on a weighted sum of the cross-entropy loss (with no reduction) along with the MSE-based pixel loss.

* Now, in order to use this inside distributed training, here's what I am doing:

* Calculate the loss for each replica.

* Scaling the loss with tf.nn.compute_average_loss with the global batch size.

Notes:

* I am implementing my training logic by overriding train_step (refer here).

* The labels and the predictions that go inside the cross-entropy loss are multi-dimensional. Hence, I am following what is suggested in the last point of the section "How to do this in TensorFlow?" of this guide. The final shape of the output generated by this is (replica_batch_size, 16, 16).

* MSE-based pixel loss returns an output of (replica_batch_size, 256, 256) shape.

* In order to make the addition operation compatible, I am taking means for both of the loss terms, then adding them, and then I am scaling it with tf.nn.compute_average_loss. In code, it looks like so -

Thanks in advance for your time.

Sayak Paul | sayak.dev

Reply all

Reply to author

Forward

0 new messages