You can read through the trace to try to understand what operation in your training function it relates to. The beginning is:
GpuElemwise{Composite{((i0 * (i1 / i2) * i3) * i4)},no_inplace} [id A] '' |
|CudaNdarrayConstant{[-1.]} [id B] |
|GpuDimShuffle{0} [id C] '' |
| |GpuElemwise{Composite{((i0 / i1) / i2)},no_inplace} [id D] '' |
|Assert{msg='Theano Assert failed!'} [id R] ''
...
|GpuDimShuffle{0} [id IU] '' |
| |GpuElemwise{ScalarSigmoid}[(0, 0)] [id V] '' |
|GpuDimShuffle{0} [id S] ''
So the operation the problem occurs in computes i0 * (i1 / i2) * i3 * i4, where
- i0 is CudaNdarrayConstant{[-1.]} [id B]
- i1 is GpuDimShuffle{0} [id C] ''
- i2 is Assert{msg='Theano Assert failed!'} [id R] ''
- i3 is GpuDimShuffle{0} [id IU] ''
- i4 is GpuDimShuffle{0} [id S] ''
If a node doesn't have anything underneath, it's either an input variable or it has been explained before -- e.g., i4 has "[id S]" which is listed earlier in the trace.
Now about the only thing that can go wrong in i0 * (i1 / i2) * i3 * i4 is a division by zero. Any other way to produce a NaN would require either of the inputs to be NaN or inf already, and this would have been caught earlier (make sure you configured the NaN guard mode to catch inf and -inf as well). The beginning of i2 tells that it's doing a (1 - sigmoid(something + b)), where "something" involves a lot of scalar_softplus and dot products, probably a neural network.
Try to figure out where you divide by one minus the output of a sigmoid layer, and ensure this output cannot get too large.