Could be a bug in a cpu kernel. If you could reduce your model to as few computations as possible while still experiencing this behavior, we may be able to identify the source.
I'm seeing different behavior in terms of when NaNs are generated between GPUs and CPUs. I know this has been raised elsewhere, for example here, but I don't see a resolution anywhere. When I train my models on GPUs (Titan X), I have not encountered a single NaN despite 100s of different initializations and configurations. Recently i started training on CPUs, and after less than 10% of total time nearly 30% of my runs have crashed due to NaNs. Is this to be expected? Why such a massive difference given that in principle both are using 32-bit floats?
--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/22b438a7-b4de-43c3-a50a-215411fabdbe%40tensorflow.org.
tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: HistogramSummary [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag,BiRNN_FW/LSTMCell/W_0/read)]]
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/f25b7571-d4fc-4069-9a75-8798be5b617a%40tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/207c5560-4378-4af4-b882-6941a76c4183%40tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/bcb4b4eb-e1aa-41de-8b3e-a48012f3b7bb%40tensorflow.org.
Why would random differences consistently lead to NaNs though? I have not had any issues with NaNs on the GPU, but about 30% of my runs fail because of NaNs on the CPU.
--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/c4984964-a4c6-4c3a-b3a8-e4519d5075f4%40tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/CALW49E68K0W4-gaqcA6G2VVoFZGXENLXM0349-VWphtZLOLipw%40mail.gmail.com.