Loss and accuracy go to NaN and 0.

14,966 views
Skip to first unread message

Gunnar Aastrand Grimnes

unread,
May 28, 2015, 10:02:49 AM5/28/15
to keras...@googlegroups.com
I'm trying to train a RNN/LSTM network on some text input, after a handful of epochs the training loss goes to NaN and the accuracy to 0 ... 

My dataset is quite skewed - there are about 10x more +1 examples than -1, maybe the weights simply grow without limit... 

Did anyone else have this happen? 

Using gpu device 0: GeForce GTX 760
Loading data...
(105545, 100)
Build model...
/usr/local/lib/python2.7/dist-packages/Theano-0.6.0-py2.7.egg/theano/sandbox/rng_mrg.py:774: UserWarning: MRG_RandomStreams Can't determine #streams from size (Shape.0), guessing 60*256
  nstreams = self.n_streams(size)
/usr/local/lib/python2.7/dist-packages/Theano-0.6.0-py2.7.egg/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *
Train...
Train on 94990 samples, validate on 10555 samples
Epoch 0
WARNING: unused streams above 1024 (Tune GPU_mrg get_n_streams)
94990/94990 [==============================] - 569s - loss: 0.2330 - acc.: 0.9206 - val. loss: 0.2549 - val. acc.: 0.8398
Epoch 1
94990/94990 [==============================] - 572s - loss: 0.1611 - acc.: 0.9441 - val. loss: 0.2438 - val. acc.: 0.8614
Epoch 2
94990/94990 [==============================] - 573s - loss: 0.1012 - acc.: 0.9640 - val. loss: 0.2386 - val. acc.: 0.8551
Epoch 3
94990/94990 [==============================] - 563s - loss: 0.0746 - acc.: 0.9743 - val. loss: 0.2094 - val. acc.: 0.8787
Epoch 4
94990/94990 [==============================] - 557s - loss: 0.0554 - acc.: 0.9810 - val. loss: 0.2862 - val. acc.: 0.8534
Epoch 5
94990/94990 [==============================] - 566s - loss: 0.0416 - acc.: 0.9864 - val. loss: 0.3430 - val. acc.: 0.8556
Epoch 6
94990/94990 [==============================] - 565s - loss: nan - acc.: 0.7591 - val. loss: nan - val. acc.: 0.0000
Epoch 7
26224/94990 [=======>......................] - ETA: 406s - loss: nan - acc.: 0.0000



François Chollet

unread,
May 28, 2015, 1:12:34 PM5/28/15
to Gunnar Aastrand Grimnes, keras...@googlegroups.com
You were already overfitting at that point ;-)

Chances are that some quantity in the optimizer or in a recurrent layer is going to zero and triggers a divide-by-zero error on the Theano side. You can try clipping your gradients (might help, or maybe not) with the optimizer parameter clipnorm=0.1, or switching to a different optimizer... 

Are you using BatchNormalization? If not, you should try it, it might help.

You can also try different values of the truncate_gradient parameter of your recurrent layers. More info about this parameter in the Theano doc: http://deeplearning.net/software/theano/library/scan.html

--
You received this message because you are subscribed to the Google Groups "Keras-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keras-users...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/keras-users/6d25c6bf-f358-48f1-83f2-3a854e146751%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

p.nec...@gmail.com

unread,
May 28, 2015, 2:30:04 PM5/28/15
to keras...@googlegroups.com
Maybe your learning rate is too high and saturates all those hard sigmoids and then learning dies. Try replacing them with normal sigmoids or just leave them at linear.

在 2015年5月28日星期四 UTC+2下午4:02:49,Gunnar Aastrand Grimnes写道:

Gunnar Aastrand Grimnes

unread,
May 29, 2015, 7:42:12 AM5/29/15
to keras...@googlegroups.com
Thanks for your input! I fixed it now - as usual it was a stupid mistake. 

First I played a bit more with this - clipnorm, batchnormalisation and varying optimizer (adagrad, adam, rmsprop ... ) or batchsize all make no difference. Then I realised it only happens when running on the GPU. 

Then I realised I had not rebooted since I upgraded the nvidia driver, I had modprobe'd the new module, but maybe something wasn't quite right. I also realised my numpy/scipy/theano stack was slightly out of date. 

A reboot and pip upgrade later, it appears to work. 

Sorry to have wasted your time, hopefully the next person with this problem will see this and try to reboot first :) 

Cheers, 
- Gunnar
Message has been deleted

Eduardo Franco

unread,
Jun 2, 2015, 12:19:39 PM6/2/15
to keras...@googlegroups.com
Hi All,
I'm also having an issue with loss going to nan, but using only a single layer net with 512 hidden nodes. I've tried using clipnorm, batch normalization, various optimizers and updates to numpy/scipy/theano yet loss goes to nan quickly before 3rd epoch at the most, sometimes in the 1st epoch. I have roughly 2.7m training samples, each with 90 dimensions. Here's the model I'm using:

model = Sequential()
model.add(Dense(90, 512, init='uniform', activation='relu'))
model.add(Dropout(0.5))
model.add(BatchNormalization((512),epsilon=1e-6,weights=None))
model.add(Dense(512, 1, init='uniform', activation='sigmoid'))
sgd = SGD(lr=1e-2, decay=1e-6, momentum=0.9, nesterov=True,clipnorm=0.1)
model.compile(loss='binary_crossentropy', optimizer=sgd)

I've tried using sigmoid and tanh activations instead of relu in the hidden layer and have tried different learning rates (which just prolongs the time to when the nan loss happens). Looking at all the weights at every layer, they are all nan. Any other suggestions that might help remedy the situation?

Thanks

Dan Becker

unread,
Jun 8, 2015, 1:51:28 PM6/8/15
to keras...@googlegroups.com
I recently observed the same phenomenon, and it disappeared when I removed dropout.  I still need to think more deeply about what was going on there, but I'd be curious to hear whether you stop getting the nan's if you remove dropout.

XJ Wang

unread,
Jun 14, 2015, 2:22:05 AM6/14/15
to keras...@googlegroups.com
Same here with the nan issue with simple single layer network. No luck after removing dropout...

Marcin Elantkowski

unread,
Jun 14, 2015, 9:33:07 AM6/14/15
to keras...@googlegroups.com
Go to your keras directory, 
find objectives.py
change line 7:
epsilon = 1.0e-15 
to
epsilon = 1.0e-7

Rebuild and reload the package. My guess is this should help, although I haven't tested it.

François Chollet

unread,
Jun 14, 2015, 1:21:41 PM6/14/15
to Marcin Elantkowski, keras...@googlegroups.com
To note: the epsilon parameter is only used in binary_crossentropy and categorical_crossentropy. 

It would be useful to know what loss and optimizer you were using when you experienced this problem. 

--
You received this message because you are subscribed to the Google Groups "Keras-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keras-users...@googlegroups.com.

Gunnar Aastrand Grimnes

unread,
Jun 16, 2015, 7:26:38 AM6/16/15
to keras...@googlegroups.com, marcin.el...@gmail.com
Despite what I said above, I keep encountering this ever so often. 

Two things fix it:

 * Running not on the GPU
 * Setting floatX=float32 in the THEANO options

(I've not tried the epsilon fix)

- Gunnar 

XJ Wang

unread,
Jun 19, 2015, 8:20:38 PM6/19/15
to keras...@googlegroups.com, marcin.el...@gmail.com
In my case, the problem only happens in GPU mode (with float32),  for binary/categorical cross entropy loss, regardless of which optimizer. CPU is okay. The epsilon fix sounds relevant, haven't tried it either.

Eduardo Franco

unread,
Jun 24, 2015, 2:22:42 PM6/24/15
to keras...@googlegroups.com, marcin.el...@gmail.com
changing the loss from binary/categorical to mean_squared_error fixed this for me and still yielded similar results

Dan Becker

unread,
Jun 25, 2015, 12:06:16 AM6/25/15
to keras...@googlegroups.com, marcin.el...@gmail.com
Sorry for just getting around to this now.  After pulling the latest version of keras, I found I could not replicate the nan for training loss (but I was still getting it for val loss).

Nevertheless, I tried Marcin's suggestion to change epsilon, and that resolved the nan's I was seeing in the val loss.  In case it is useful, my model is

    model = Sequential()
    model.add(Dense(283, 1500, init='glorot_uniform', activation='relu'))
    model.add(Dropout(0.5))
    for i in range(4):
        model.add(BatchNormalization((1500,)))
        model.add(PReLU((1500,)))
        model.add(Dropout(0.5))
    model.add(Dense(hidden_size, 1, init='glorot_uniform', activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='rmsprop', class_mode="binary")

After looking at objectives.py, my intuition is that changing epsilon would resolve nan issues for training loss too, but I wasn't able to replicate those to test this.

rex.y...@gmail.com

unread,
Jul 27, 2015, 5:58:39 AM7/27/15
to Keras-users, grom...@gmail.com, francois...@gmail.com
I tried different things listed in this thread, and clipnorm setting works, but can you give me a brief explanation why it works or simply point me some related reading materials.

Many thanks,

Anuj Gupta

unread,
Feb 2, 2016, 7:19:58 AM2/2/16
to Keras-users
I am facing somewhat similar problem: In my case the loss and validation loss are NaN from 1st epoch, however unlike the problem stated by some people,
my accuracy and validation accuracy is 1.0

Train on 3962 samples, validate on 992 samples
Epoch 1/20
0s - loss: NaN - acc: 1.0000 - val_loss: NaN - val_acc: 1.0000
Epoch 2/20
0s - loss: NaN - acc: 1.0000 - val_loss: NaN - val_acc: 1.0000
Epoch 3/20
0s - loss: NaN - acc: 1.0000 - val_loss: NaN - val_acc: 1.0000
Epoch 4/20
0s - loss: NaN - acc: 1.0000 - val_loss: NaN - val_acc: 1.0000
Epoch 5/20
0s - loss: NaN - acc: 1.0000 - val_loss: NaN - val_acc: 1.0000

As suggested by some people, I changed my loss function
#model.compile(loss='categorical_crossentropy', optimizer=rms)
model.compile(loss='mean_squared_error', optimizer=rms)

For now it works

danlanc...@gmail.com

unread,
Apr 8, 2016, 8:52:21 PM4/8/16
to Keras-users
Why do you changed it into mean-squared error? If it is classification problem, should not you use categorical cross_entropy?

daniel...@datarobot.com

unread,
Apr 9, 2016, 7:30:40 AM4/9/16
to Keras-users, danlanc...@gmail.com
Though a mean-squared error probably is a very uncommon (and most people would say inappropriate) metric for classification, it should resolve the NaN loss... since MSE for the range of probabilities that might come from the model will be between 0 and 1 (whereas cross_entropy error could take the log of a very small number, or even 0). 

Though gradient clipping seems a common (and effective) approach to address NaN losses, these notes on debugging neural networks give some background on resolving this with a lower learning rate.

jjm...@uah.edu

unread,
Sep 9, 2016, 4:04:02 PM9/9/16
to Keras-users
I was having a similar issue while trying to fine tune the VGG16 model with a global average pooling layer added after the final Convolution2D layer. I removed the optimizer = 'fast_compile' from my .theanorc and it seems to be working correctly now. 
Reply all
Reply to author
Forward
0 new messages