NaN behavior different depending on hardware platform?

552 views
Skip to first unread message

Mohammed AlQuraishi

unread,
May 22, 2016, 11:41:08 AM5/22/16
to Discuss
I'm seeing different behavior in terms of when NaNs are generated between GPUs and CPUs. I know this has been raised elsewhere, for example here, but I don't see a resolution anywhere. When I train my models on GPUs (Titan X), I have not encountered a single NaN despite 100s of different initializations and configurations. Recently i started training on CPUs, and after less than 10% of total time nearly 30% of my runs have crashed due to NaNs. Is this to be expected? Why such a massive difference given that in principle both are using 32-bit floats?

Eugene Brevdo

unread,
May 22, 2016, 12:12:09 PM5/22/16
to Mohammed AlQuraishi, Discuss

Could be a bug in a cpu kernel. If you could reduce your model to as few computations as possible while still experiencing this behavior, we may be able to identify the source.

On May 22, 2016 8:41 AM, "Mohammed AlQuraishi" <nom...@gmail.com> wrote:
I'm seeing different behavior in terms of when NaNs are generated between GPUs and CPUs. I know this has been raised elsewhere, for example here, but I don't see a resolution anywhere. When I train my models on GPUs (Titan X), I have not encountered a single NaN despite 100s of different initializations and configurations. Recently i started training on CPUs, and after less than 10% of total time nearly 30% of my runs have crashed due to NaNs. Is this to be expected? Why such a massive difference given that in principle both are using 32-bit floats?

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/22b438a7-b4de-43c3-a50a-215411fabdbe%40tensorflow.org.

Mohammed AlQuraishi

unread,
May 23, 2016, 8:35:50 AM5/23/16
to Discuss
It's a big model but I'll try add_check_numerics_ops() and see if I can localize the nodes that are generating the NaNs. I initially caught it because histogram summaries were throwing up errors, and they were mostly localized to the W matrix of LSTMCell. They may have propagated from elsewhere however.

Mohammed AlQuraishi

unread,
May 24, 2016, 7:33:50 AM5/24/16
to Discuss
I added the add_check_numerics_ops() to the model, which I thought would help me catch the NaN when it first arises, but the errors I'm getting are the same as before, occurring in histogram_summaries:

tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: HistogramSummary
         [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag,BiRNN_FW/LSTMCell/W_0/read)]]


Shouldn't it throw up an error earlier on, before it's made it to histogram_summaries?


On Sunday, May 22, 2016 at 12:12:09 PM UTC-4, Eugene Brevdo wrote:

Martin Wicke

unread,
May 24, 2016, 7:55:46 AM5/24/16
to Mohammed AlQuraishi, Discuss
The check in the histogram op is simply checking all summarized entries for NaNs, so you must have missed it earlier, or it was just created. You may be summarizing a variable, the NaN may just have been written to it, and the summary is just the first op to read it. A check_numerics_op right before the histogram op should definitely catch it (i.e. summarize the output of the check_numerics_op, not the variable are you are probably doing now).

Mohammed AlQuraishi

unread,
May 24, 2016, 9:18:07 AM5/24/16
to Discuss
But I used tf.add.check_numerics_ops(), which I thought adds a check_numerics op to basically everything? So that as soon as the NaN is generated, it is caught. Is that not the case?

Yaroslav Bulatov

unread,
May 24, 2016, 12:55:10 PM5/24/16
to Mohammed AlQuraishi, Discuss
It should be the case, but right now it doesn't add check numerics ops to ops created by optimizers


Mohammed AlQuraishi

unread,
May 24, 2016, 1:19:10 PM5/24/16
to Discuss
Ah, that might explain it then. Is there a simple workaround? The docs seem to indicate that tf.check_numerics (not tf.add_check_numerics_ops) only works on tensors and not ops, so it's not clear how to apply it to the optimizer. I'm using Adam, but with apply_gradients called on the clipped gradients and variables. There has to be an output tensor for one of the optimizer ops that I can intercept?

It also seems by implication that the NaNs are being generated by the Adam optimizer. I have epsilon set to a small value, 1e-08. Could that explain the difference in behavior between GPUs and CPUs?

Yaroslav Bulatov

unread,
May 24, 2016, 1:50:44 PM5/24/16
to Mohammed AlQuraishi, Discuss, stein...@gunderson.no
+cc: steinar

That seems feasible. I've seen differences in numerics between CPU and GPU in functions like tanh/exp, and also when summing up large vectors. Also I've seen a significant numeric deviation (10e-5) on CPU when compiling with avx2. Also, even on the same CPU, different entries of the vector may use different algorithms, so you could end up with different results even if all entries are identical as was discussed here


Mohammed AlQuraishi

unread,
May 24, 2016, 2:02:59 PM5/24/16
to Discuss
Why would random differences consistently lead to NaNs though? I have not had any issues with NaNs on the GPU, but about 30% of my runs fail because of NaNs on the CPU.

Yaroslav Bulatov

unread,
May 24, 2016, 3:41:40 PM5/24/16
to Mohammed AlQuraishi, Discuss, Geoffrey Irving
It doesn't need to be random, it could be a systematic bias. For instance, GPU and CPU probably differ in how they treat denormal numbers which would make impact if you have underflow in your calculations

+cc geoffreyi because I think he disabled denormals on CPU


On Tue, May 24, 2016 at 11:02 AM, Mohammed AlQuraishi <nom...@gmail.com> wrote:
Why would random differences consistently lead to NaNs though? I have not had any issues with NaNs on the GPU, but about 30% of my runs fail because of NaNs on the CPU.
--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.

Geoffrey Irving

unread,
May 24, 2016, 9:38:59 PM5/24/16
to Yaroslav Bulatov, Mohammed AlQuraishi, Discuss
-ir...@naml.us

Denormals are a possibility, but it's hard to say without knowing what the source of the NaNs are on CPU.  If you want to test, you could change ScopedFlushDenormal to disable the denormal flushing.  Otherwise I would recommend diagnosing why you are getting nans.

Mohammed AlQuraishi

unread,
May 24, 2016, 9:51:28 PM5/24/16
to Discuss
Any recommendations on how to pin down the source of NaNs, given that they appear to be caused by the Adam optimizer, and add_check_numerics_ops() doesn't currently add any ops to check the optimizer?

Steinar H. Gunderson

unread,
May 25, 2016, 5:40:32 AM5/25/16
to Yaroslav Bulatov, Mohammed AlQuraishi, Discuss, se...@google.com
On Tue, May 24, 2016 at 10:50:42AM -0700, Yaroslav Bulatov wrote:
> +cc: steinar

Cc: My work address instead of my home address :)

> That seems feasible. I've seen differences in numerics between CPU and GPU
> in functions like tanh/exp, and also when summing up large vectors. Also
> I've seen a significant numeric deviation (10e-5) on CPU when compiling
> with avx2. Also, even on the same CPU, different entries of the vector may
> use different algorithms, so you could end up with different results even
> if all entries are identical as was discussed here
> <https://github.com/tensorflow/tensorflow/issues/2234#issuecomment-217184455>

Exactly what you said. tanh/exp are typically not correct down to the last
bit, neither on CPU nor GPU, and sum accuracy will depend on ordering. As an
obvious example, summing N floating-point numbers the obvious way will
introduce errors on the order of O(sqrt(n)). If using AVX, you have eight
different sums (combined to a single one in the end), so errors go down by a
factor of sqrt(8) ~= 2.83 (and then you get a tiny additive constant for the
last sum-of-sums). For a GPU, you will have lots and lots of sums, so you get
sort of the same effect.

>>> It should be the case, but right now it doesn't add check numerics ops to
>>> ops created by optimizers
>>>
>>> https://github.com/tensorflow/tensorflow/issues/2288

Note that the bug is about checking _outputs_ of optimizers, not _ops_ of
optimizers. And I'm not 100% sure my analysis is correct; it's mostly a
drive-by bug to make sure it wasn't forgotten.

/* Steinar */
--
Software Engineer, Google Switzerland
Reply all
Reply to author
Forward
0 new messages