NAN or INF inside Parameters

1,876 views
Skip to first unread message

Hugo Alberto Perlin

unread,
Mar 27, 2014, 3:10:57 PM3/27/14
to tor...@googlegroups.com
Hi,
I'm trainning a CNN using cuda. 
Some information about the trainning setup:
 - Criterion: ClassNLLCriterion
 - SGD with 128 min-batch
  - SGD Config 
      Learning Rate:nil 
      Learning Rate Decay:1e-07 
      Weight Decay:1e-05 
      Momentum:0.5 

During the trainning, randomly some NAN of INF appears inside the network parameters vector.
Trying to find the reason, I managed to track a number explosion after the execution of a backward pass on the model.
This is some statistics from the gradParameters vector from the model:
Max:1.8667173810953e+27 
Min:-1.719263848851e+27 
Std:inf 

After this, the network outputs become inconsistently.
This number explosion seems to be random, some times it occours some times not (using the same image dataset but a random seed)

Anyone had something like this?

Thanks.
 



soumith

unread,
Mar 27, 2014, 3:17:29 PM3/27/14
to torch7 on behalf of Hugo Alberto Perlin
You should co-ordinate with Afredo Canziani.
We've seen such reported behavior because of Kernel/CUDA driver issues (an issue much deeper than torch itself).

Also, do you see this happen with something other than LogSoftMax/ClassNLL as the last layer/cost?


--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

Hugo Alberto Perlin

unread,
Mar 27, 2014, 9:49:09 PM3/27/14
to tor...@googlegroups.com
Yes,
this occurs some times (again randomly) with tanh/MSE as last layer/cost.

By the way, my execution environment is:
 - Ubuntu 12.04
 - Linux kernel  3.11.6-031106-generic
 - Nvidia driver 331.20
 - Cuda 5.5

Thanks.




Em quinta-feira, 27 de março de 2014 16h17min29s UTC-3, smth chntla escreveu:
You should co-ordinate with Afredo Canziani.
We've seen such reported behavior because of Kernel/CUDA driver issues (an issue much deeper than torch itself).

Also, do you see this happen with something other than LogSoftMax/ClassNLL as the last layer/cost?

Hugo Alberto Perlin

unread,
Mar 28, 2014, 10:17:08 AM3/28/14
to tor...@googlegroups.com
Executing some additional tests, using small learning rate the number explosion don't occur.
I don't know if this isn't a torch problem or this is a cuda driver problem.

How could I determine this fact?

Thanks.

Alfredo Canziani

unread,
Jul 16, 2014, 10:07:41 AM7/16/14
to tor...@googlegroups.com
Can you define "small learning rate"?

Yossi Biton

unread,
Nov 9, 2014, 11:45:31 AM11/9/14
to tor...@googlegroups.com
Hello guys,

Unfortunately this error occurs to me too since few days ago.
I'm using LogSoftMax/ClassNLL as last layer/cost.
My learning rate is 0.01, and i'm afraid using smaller one will be too bad for the training...

Are there any magic solutions i can use ?
It's really surprising it's happening just out of nowhere after using torch for the last 2 months.
Till now i was using models with nn.sequential as the main container, where each layer was nn.Linear or convolutional (from ccn2).
The change i just did was adding layer of type nn.Concat, and i see that the nan's output are coming from this layer.
This is the model i'm using :

nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> output]
  (1): nn.Transpose
  (2): ccn2.SpatialConvolution
  (3): nn.ReLU
  (4): ccn2.SpatialMaxPooling
  (5): ccn2.SpatialConvolution
  (6): nn.ReLU
  (7): ccn2.SpatialMaxPooling
  (8): ccn2.SpatialConvolutionLocal
  (9): nn.ReLU
  (10): ccn2.SpatialMaxPooling
  (11): nn.Concat {
    input
      |`-> (1): nn.Sequential {
      |      [input -> (1) -> (2) -> (3) -> output]
      |      (1): nn.Transpose
      |      (2): nn.Reshape
      |      (3): nn.Linear
      |    }
      |`-> (2): nn.Sequential {
      |      [input -> (1) -> (2) -> (3) -> (4) -> (5) -> output]
      |      (1): ccn2.SpatialConvolutionLocal
      |      (2): nn.ReLU
      |      (3): nn.Transpose
      |      (4): nn.Reshape
      |      (5): nn.Linear
      |    }
       ... -> output
  }
  (12): nn.Linear

smth chntla

unread,
Nov 9, 2014, 11:51:30 AM11/9/14
to tor...@googlegroups.com
Yossi,

FIrst thing to do is replace nn.Reshape with nn.View. The reasoning is that if you slightly screw up the math for nn.Reshape, you might actually be hitting uninitialized memory, while if you screw up the math with nn.View, it will give you a clear error.

If that would not be the error, then you are clearly running into numerical instability for some reason. You can check ccn2.SpatialConvolutionLocal's weight initialization, I believe none of us checked that before. Third thing to do is to decrease the learning rate, or not update when there's a nan/inf gradient.
--
S

Yossi Biton

unread,
Nov 9, 2014, 12:04:01 PM11/9/14
to tor...@googlegroups.com
thanks for the quick reply !
regarding your comment about initialization - 
i'm checking the validity of my model after each step.
at the beginning it's ok (no nan's at all) and the nan's start to show only after some back-prop iterations.

Anoop Katti

unread,
Mar 19, 2015, 12:11:18 AM3/19/15
to tor...@googlegroups.com
Hi all,

Is this issue fixed? I am training a fully convolutional network with multiple stages of conv, relu and max-pool followed by MSError. And my network is producing NaNs after some 60 iterations. My learning rate is high too (0.05).

Yossi Biton

unread,
Mar 19, 2015, 3:39:42 AM3/19/15
to torch7 on behalf of Anoop Katti

Hi,

in my case these nan's were produced after the gradient were too high.
what i did was to make the network less deep & wider but im sure there are better solutions (for example trim the gradient norm when it gets too high)


--
You received this message because you are subscribed to a topic in the Google Groups "torch7" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/torch7/9BuboTiV_O4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to torch7+un...@googlegroups.com.

Anoop Katti

unread,
Mar 19, 2015, 6:50:06 AM3/19/15
to tor...@googlegroups.com
Yossi,

Thanks for replying. Why do you think appearance of NaNs depends on the network architecture?
And deciding what a "high" norm is seems tricky :(

On Thursday, March 19, 2015 at 1:09:42 PM UTC+5:30, Yossi Biton wrote:

Hi,

in my case these nan's were produced after the gradient were too high.
what i did was to make the network less deep & wider but im sure there are better solutions (for example trim the gradient norm when it gets too high)


Yossi Biton

unread,
Mar 19, 2015, 8:09:58 AM3/19/15
to torch7 on behalf of Anoop Katti

About the tip i gave i took it from this blog post (goto practical advice part) :
http://yyue.blogspot.ca/2015/01/a-brief-overview-of-deep-learning.html?m=1
He offers the gradient shrinking when training LSTMs & RNNs but it might be relevant to other cases too.

Regarding thr architecture :
When you have deepet or wider network, then its more likely to get larger output and gradient. Very large gradients eventually can become nan/inf values.

Alexey Chernyavskiy

unread,
Apr 29, 2015, 11:40:24 AM4/29/15
to tor...@googlegroups.com
I seem to experience the same kind of problem that you have. My training accuracy grows when suddenly it hits almost zero, when NANs suddenly appear in my net parameters and gradients. I print the min/max values of gradients and parameters for all the SGD steps, and there is no indication that overflow is the issue, because prior to this sudden event all the values are in their normal ranges, from -0.01 to +0.01.

By the way, the crash happens every time on slightly different epochs/iterations, despite the fact that I manually set the same random seed for repeatability of results.

I use CUDA GPU for my convnet, running on Ubuntu. Several of my modules come from cuda-convnet2.torch, and I have a nn.Concat as well.
Are there any ideas what else should I check or repair in my code?




On Thursday, March 19, 2015 at 3:09:58 PM UTC+3, Yossi Biton wrote:

About the tip i gave i took it from this blog post (goto practical advice part) :
http://yyue.blogspot.ca/2015/01/a-brief-overview-of-deep-learning.html?m=1
He offers the gradient shrinking when training LSTMs & RNNs but it might be relevant to other cases too.

Regarding thr architecture :
When you have deepet or wider network, then its more likely to get larger output and gradient. Very large gradients eventually can become nan/inf values.


Alfredo Canziani

unread,
Apr 29, 2015, 12:58:29 PM4/29/15
to torch7 on behalf of Alexey Chernyavskiy
Post here the output to what I've suggested by email.

Alfredo Canziani

Alexey Chernyavskiy

unread,
May 6, 2015, 9:24:17 AM5/6/15
to tor...@googlegroups.com
Hi, Alfredo.
It seems that I had experienced some video hardware (or maybe driver related) glitch. The next morning I got to the computer, all the NANs disappeared. Currently all my layers' outputs are within decent bounds, from -5e-2 to +5e-2, with no apparent outliers. But if they arise, batch normalization might work, I think I will experiment with that some time. 

Jose Part

unread,
Jul 24, 2015, 9:15:36 AM7/24/15
to torch7
Hi,

I am having a similar issue when trying to train a convnet on the GPU. What puzzles me is that when training the same network on the CPU, the trained parameters are ok, but once I switch to the GPU (using cunn) I get NANs for some of the parameters. Hence, I guessed it may be an issue with the cunn library rather than torch. I was wondering if anybody else experienced this inconsistency between training on the CPU and the GPU?

Kind regards,

Jose

Jose Part

unread,
Jul 24, 2015, 9:44:14 AM7/24/15
to torch7
Apparently, by using nn.View instead of nn.Reshape does solve the issue, but I don't fully understand why... The operations should be the same (and their order as well) shouldn't they? I assume that the process of parallelization should be transparent. Sorry if this is a silly question, I am just beginning to use torch and experimenting with the GPU...

Cheers,

Jose

Alfredo Canziani

unread,
Jul 31, 2015, 11:21:07 AM7/31/15
to torch7 on behalf of The Chemist

Jose, I had many problems with NaNs and I 'solved' them by using different model's architecture. I believe it was due to an instability of the network itself associated with the floating point approximation GPUs are using.
This thing of view vs. reshape is a totally new piece of information for me. At the time I was playing with that stuff there was no view, so I don't know what to say.
Perhaps Soumith has some ideas...

Reply all
Reply to author
Forward
0 new messages