TPU vs GPU Loss is bigger on TPU

22 views
Skip to first unread message

Gil Motta

unread,
Nov 16, 2020, 1:25:18 AM11/16/20
to TPU Users
Hello,

I'm training my model on SSD and I'm testing on Colab GPU and TPU at the same time.

The same dataset is on Google Storage for TPU and on Google Drive for GPU.

I noticed that at 62000 steps the loss on GPU is better than the loss on TPU.

It is showing 0.17 in average for GPU and 0.27 for TPU

TPU is faster my time is 0.10s in average against 0.62s for the GPU.

So why is the TPU showing a worse loss?

Thanks,
Gil

Russell Power

unread,
Nov 16, 2020, 11:56:48 AM11/16/20
to Gil Motta, TPU Users
Changes in loss/precision tend to be model dependent. The TPU might have lower precision for a portion of the network and this is causing the increase in loss. For object detection models like SSD, you will want to make sure you aren't using bfloat16 for the final regression layer, as that can cause precision problems.

--
You received this message because you are subscribed to the Google Groups "TPU Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tpu-users+...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tpu-users/80be3674-3720-494f-a153-fb6fd1e73d9bn%40tensorflow.org.

Gil Motta

unread,
Nov 16, 2020, 8:50:38 PM11/16/20
to TPU Users, Russell Power, TPU Users, Gil Motta
Hi Russel,

Thanks for the suggestion but I have no idea where to look for bfloat16 for the final regression layer. Would you be able to provide some guidance?

My training on TPU seems to have a high loss. After 200K steps my results were as follows:
I1116 10:50:06.300898 140649977771904 model_lib_v2.py:645] Step 200000 per-step time 0.116s loss=0.139  

My goal is to lower the loss below 0.07 but I don't know how to achieve that.

Thanks,
Gil
Reply all
Reply to author
Forward
0 new messages