Batch Normalization Non Deterministic (seed?)

Elias Sprenger

unread,

Jul 2, 2016, 8:45:58 AM7/2/16

to lasagne-users

Hi,

I'm trying to make my results reproducible. I set the seeds at the beginning of the program (numpy and lasagne) and as long as I'm not using batch normalization I get the same result from every run - even with dropout the result is consistent. Adding the batch normalization function changes this. I still start with the same initial weights of the network (a pass over the validation set without any training still gives me the same results) but as soon as I start training I see different results (losses) in each run. Questions:

1) As far as I can see, there should be no randomness in the batch normalization function, why is it making my results non-reproducible?

2) Why doesn't it help to set the seed? What other way is there to solve this issue?

Thank you very much. Some additional information: My training functions uses deterministic=False, my validation function uses deterministic=True. For the batch normalization I use the batch_norm helper function which is called on the convolutional layers, the one dense layer and the final softmax output layer (dense as well).

Jan Schlüter

unread,

Jul 4, 2016, 6:59:40 AM7/4/16

to lasagne-users

I still start with the same initial weights of the network (a pass over the validation set without any training still gives me the same results) but as soon as I start training I see different results (losses) in each run. Questions:

1) As far as I can see, there should be no randomness in the batch normalization function, why is it making my results non-reproducible?

That's a good question. I wouldn't have expected this either.

2) Why doesn't it help to set the seed?

Probably because whatever causes this is not a source of randomness that we control.

What other way is there to solve this issue?

We need to figure out question 1. A couple of questions that come into my mind:
1) Is your minibatch iterator deterministic? I.e., does it produce the same set of mini-batches in the same order in every run? (I expect so if results without batch normalization are reproducible, I just want to be sure.)
2) Are results reproducible if you compute your training predictions with "deterministic=False, batch_norm_update_averages=False"?
3) Are results reproducible if you compute your validation predictions with "deterministic=True, batch_norm_use_averages=False"?
4) Are results reproducible if you set THEANO_FLAGS=dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic?
5) Can you provide a minimal runnable script exhibiting this behaviour (with a small network and some randomly generated toy dataset)?

Best, Jan

Elias Sprenger

unread,

Jul 4, 2016, 12:01:41 PM7/4/16

to lasagne-users

Okay, let us see:

0. I used export PYTHONHASHSEED="0" to ensure that sets behave the same (just in case).

1. Yes it is, I tested it with and without batch normalization and both times the input data is exactly the same.

2. No. Still not the same accuracy in two consecutive runs. (When I remove the batch_norm_update line the accuracy of two consecutive runs was 40% and 35% after the second epoch. With the line added the accuracy was 14% and 15% so still not identical. Also, as expected, the accuracy dropped significantly, so I think I did the test correctly.)

3. No. Same as above (for these runs I used "deterministic=False, batch_norm_update_averages=False" for the training and! "deterministic=True, batch_norm_use_averages=False" for the test function.

4. I used everything above + the new Theano flags and .. it worked!

Even after removing the additional batch_norm_use_averages parameters, adding the theano flags made my runs reproducible. Awesome!

So is there a good reason why these algorithms are non deterministic? I feel like a lot of people might fall into this trap and spend a lot of wasted time, trying to find the root cause. Maybe this can be avoided with a hint in the documentation / warning / different default?

Thank you very much for your help!

Best regards!

Jan Schlüter

unread,

Jul 4, 2016, 1:26:39 PM7/4/16

to lasagne-users

Great to see it's not a problem with batch normalization! Seems this was a red herring. It's fun that results only became indeterministic when it was enabled.

So is there a good reason why these algorithms are non deterministic?

They use atomic adds (that's the reason) for improved performance (that's the good reason). The gist is that they allow to summarize results from different threads in a non-deterministic order, avoiding the overhead of synchronizing threads to do this in a fixed ordering.

I feel like a lot of people might fall into this trap and spend a lot of wasted time, trying to find the root cause. Maybe this can be avoided with a hint in the documentation / warning / different default?

Changing the default is not an option as it would hamper performance, and getting exactly reproducible results is a special use case. Adding a hint in the documentation would be possible, but where would you put it / would you have looked for it / would you have stumbled upon it? It's documented at http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html and http://deeplearning.net/software/theano/library/config.html#config.config.dnn.conv.algo_bwd_filter, but that's not exactly easy to find. Any suggestions are welcome!

Best, Jan

Elias Sprenger

unread,

Jul 4, 2016, 6:14:09 PM7/4/16

to lasagne-users

Thanks for taking the time to explain this to me. I understand the issue and I don't see a good solution either. I only looked at the lasagne documentation to see

a) what batch normalization is actually doing and

b) that setting the seed, sets the theano seed as well (https://github.com/Lasagne/Lasagne/blob/master/lasagne/layers/noise.py#L57)

Since batch normalization seems to be a red herring, the only place I can think of where this could be useful is, therefore, this one: http://lasagne.readthedocs.io/en/latest/modules/random.html - maybe people look at that page if they want deterministic behavior and discover that it is more complicated than setting a seed.

Thanks once again!

Jan Schlüter

unread,

Jul 5, 2016, 6:31:48 AM7/5/16

to lasagne-users

Since batch normalization seems to be a red herring, the only place I can think of where this could be useful is, therefore, this one: http://lasagne.readthedocs.io/en/latest/modules/random.html - maybe people look at that page if they want deterministic behavior and discover that it is more complicated than setting a seed.

That's the best place I could think of as well. I'll create an Issue in our tracker. Thanks!

deephive

unread,

Jul 28, 2016, 11:31:54 AM7/28/16

to lasagne-users

I was having these same issues and have asked around in this forum - but at the time, I could not get a useful solution. Thanks for the tips. This definitely has to go into the documentation!

dh

Jan Schlüter

unread,

Jul 28, 2016, 12:00:04 PM7/28/16

to lasagne-users

This definitely has to go into the documentation!

Thanks, good to know it's important. By the way, feel free to take the issue: https://github.com/Lasagne/Lasagne/issues/718
I won't be able to until mid August or so.

Best, Jan

deephive

unread,

Jul 28, 2016, 3:04:20 PM7/28/16

to lasagne-users

From the Theano documentation, it looks like the options [dnn.conv.algo_bwd_filter=deterministic] and [dnn.conv.algo_bwd_data=deterministic] is not available for 3D Convolutional layers.

Just a note for those who use this layer...

DH

Jan Schlüter

unread,

Jul 29, 2016, 6:38:03 AM7/29/16

to lasagne-users

From the Theano documentation, it looks like the options [dnn.conv.algo_bwd_filter=deterministic] and [dnn.conv.algo_bwd_data=deterministic] is not available for 3D Convolutional layers.

That would surprise me. Could you provide a link to where you found this?

Best, Jan

deephive

unread,

Jul 29, 2016, 1:43:24 PM7/29/16

to lasagne-users

Hi Jan,

It is stated in the Theano documentation you linked:

http://deeplearning.net/software/theano/library/config.html#config.config.dnn.conv.algo_bwd_filter

I've tested and verified that this is true. A colleague of mine found a fork of lasagne that doesn't use CuDNN : https://github.com/gyglim/Lasagne