--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.
You should co-ordinate with Afredo Canziani.We've seen such reported behavior because of Kernel/CUDA driver issues (an issue much deeper than torch itself).Also, do you see this happen with something other than LogSoftMax/ClassNLL as the last layer/cost?
Hi,
in my case these nan's were produced after the gradient were too high.
what i did was to make the network less deep & wider but im sure there are better solutions (for example trim the gradient norm when it gets too high)
--
You received this message because you are subscribed to a topic in the Google Groups "torch7" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/torch7/9BuboTiV_O4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to torch7+un...@googlegroups.com.
Hi,
in my case these nan's were produced after the gradient were too high.
what i did was to make the network less deep & wider but im sure there are better solutions (for example trim the gradient norm when it gets too high)
About the tip i gave i took it from this blog post (goto practical advice part) :
http://yyue.blogspot.ca/2015/01/a-brief-overview-of-deep-learning.html?m=1
He offers the gradient shrinking when training LSTMs & RNNs but it might be relevant to other cases too.
Regarding thr architecture :
When you have deepet or wider network, then its more likely to get larger output and gradient. Very large gradients eventually can become nan/inf values.
About the tip i gave i took it from this blog post (goto practical advice part) :
http://yyue.blogspot.ca/2015/01/a-brief-overview-of-deep-learning.html?m=1
He offers the gradient shrinking when training LSTMs & RNNs but it might be relevant to other cases too.Regarding thr architecture :
When you have deepet or wider network, then its more likely to get larger output and gradient. Very large gradients eventually can become nan/inf values.
Jose, I had many problems with NaNs and I 'solved' them by using different model's architecture. I believe it was due to an instability of the network itself associated with the floating point approximation GPUs are using.
This thing of view vs. reshape is a totally new piece of information for me. At the time I was playing with that stuff there was no view, so I don't know what to say.
Perhaps Soumith has some ideas...