Okay, let us see:
0. I used export PYTHONHASHSEED="0" to ensure that sets behave the same (just in case).
1. Yes it is, I tested it with and without batch normalization and both times the input data is exactly the same.
2. No. Still not the same accuracy in two consecutive runs. (When I remove the batch_norm_update line the accuracy of two consecutive runs was 40% and 35% after the second epoch. With the line added the accuracy was 14% and 15% so still not identical. Also, as expected, the accuracy dropped significantly, so I think I did the test correctly.)
3. No. Same as above (for these runs I used "deterministic=False, batch_norm_update_averages=False" for the training and! "deterministic=True, batch_norm_use_averages=False" for the test function.
4. I used everything above + the new Theano flags and .. it worked!
Even after removing the additional batch_norm_use_averages parameters, adding the theano flags made my runs reproducible. Awesome!
So is there a good reason why these algorithms are non deterministic? I feel like a lot of people might fall into this trap and spend a lot of wasted time, trying to find the root cause. Maybe this can be avoided with a hint in the documentation / warning / different default?
Thank you very much for your help!
Best regards!