Discrepancy in training between two machines

74 views
Skip to first unread message

aksha...@gmail.com

unread,
Jul 7, 2016, 11:54:07 AM7/7/16
to lasagne-users
Hi

I am training a GAN network (following the Improved GAN paper by Salimans et. al)- the network trained fine on my machine but when I run the same code on another machine it gives me nan in the first or second epoch itself. Both machines are using TitanX with the same versions of theano (0.8.2) and lasagne(latest from github). 

To further check if this was an affect of randomness I tried with setting seeds but it doesn't help:
    rng = np.random.RandomState(123)
    theano_rng
= MRG_RandomStreams(rng.randint(2 ** 15))
    lasagne
.random.set_rng(np.random.RandomState(rng.randint(2 ** 15)))

 I have tried many restarts with/without the seeds but the issue persists every time. What could be the reason - is it due to Batch Normalization on the generator layers (I am using deterministic=False while training) ? 

def GAN_g(Z_DIM,X_DIM,inp=None):
    model
={}
    model
['input'] = L.InputLayer(input_var=inp, shape=(None,Z_DIM))
    model
['dense1'] = L.batch_norm(L.DenseLayer(model['input'],num_units=400,

                    nonlinearity
=lasagne.nonlinearities.softplus,W=lasagne.init.GlorotNormal()))
    model
['dense2'] = L.batch_norm(L.DenseLayer(model['dense1'],num_units=400,
                    nonlinearity
=lasagne.nonlinearities.softplus,W=lasagne.init.GlorotNormal()))
    model
['out'] = L.batch_norm(L.DenseLayer(model['dense2'],num_units=X_DIM**2,
                    nonlinearity
=lasagne.nonlinearities.sigmoid,W=lasagne.init.GlorotNormal()))
   
return model


def GAN_d(X_DIM,inp=None):
    model
={}
    model
['input'] = L.InputLayer(shape=(None,X_DIM**2),input_var=inp)
 
    model
['dense1'] = L.DenseLayer(model['input'], num_units=400,
                     nonlinearity
=lasagne.nonlinearities.rectify, W=lasagne.init.GlorotNormal())
    model
['drop1'] = L.GaussianNoiseLayer(model['dense1'],sigma=0.2)
    model
['dense2'] = L.DenseLayer(model['drop1'], num_units=400,
                     nonlinearity
=lasagne.nonlinearities.rectify, W=lasagne.init.GlorotNormal())
    model
['drop2'] = L.GaussianNoiseLayer(model['dense2'],sigma=0.4)
    model
['dense3'] = L.DenseLayer(model['drop2'], num_units=200,
                     nonlinearity
=lasagne.nonlinearities.rectify, W=lasagne.init.GlorotNormal())
    model
['drop3'] = L.GaussianNoiseLayer(model['dense3'],sigma=0.4)
    model
['out'] = L.DenseLayer(model['drop3'],num_units=1,nonlinearity=lasagne.nonlinearities.sigmoid)
   
return model


Any suggestions how I can identify the problem ?

Thanks

aksha...@gmail.com

unread,
Jul 7, 2016, 5:28:16 PM7/7/16
to lasagne-users
I ran the code with NanGuardMode and got the following log : https://goo.gl/nUwyVT
The function where the stack trace occurs is the training of discriminator. These are the theano expressions it uses: 

d_dist_fake = L.get_output(d['dist'],{d['input']:g_out})
d_dist_real
= L.get_output(d['dist'])
value_d_dist
= T.mean(T.log(d_dist_real+TINY) + T.log(1-d_dist_fake+TINY))

acc_dist_fake
= T.mean(T.eq(T.gt(d_dist_real,0.5), np.ones((BATCH_SIZE,1)).astype(np.int8)), dtype=theano.config.floatX)
acc_dist_real
= T.mean(T.eq(T.le(d_dist_fake,0.5), np.ones((BATCH_SIZE,1)).astype(np.int8)), dtype=theano.config.floatX)
acc_dist
= (acc_dist_real+acc_dist_fake)/2
updates_d_dist
= lasagne.updates.adam(-value_d_dist,params_d_dist,learning_rate=lr_d,beta1=0.5)

train_d_dist
= theano.function([inp_z,inp_y,inp_x,lr_d],[value_d_dist,acc_dist],updates=updates_d_dist)



Can somebody point out where the issue might be occuring ?

Thanks

Jan Schlüter

unread,
Jul 8, 2016, 5:47:10 AM7/8/16
to lasagne-users
Can somebody point out where the issue might be occuring ?

The only direct source of potential trouble I see is the T.log -- how large is TINY? Can you try making it a little larger?

What's troublesome is that it only runs into problems on one of the two machines. To make it fully reproducible (up to rounding differences in the hardware), in addition to setting the random seed, you'll need to use:
THEANO_FLAGS=dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic
This will make it slower, though. If possible, also try rebooting the machine, I've seen untrackable problems disappear by doing this.

Good luck, and let us know what you find!
Best, Jan

aksha...@gmail.com

unread,
Jul 8, 2016, 6:53:57 AM7/8/16
to lasagne-users

So I ran my code by trying optimizer=None on the faulty machine and it seems to be running fine - just like the first machine.  TINY is currently 1e-8, I'll try increasing it. FWIW the first machine is not using TINY.  

Also my code isn't using any conv layers - will making changes to the layers you proposed make any difference ?

Is there a better way to understand from the trace what op might be causing this issue ?

Thanks

Jan Schlüter

unread,
Jul 8, 2016, 9:08:55 AM7/8/16
to lasagne-users, aksha...@gmail.com
Is there a better way to understand from the trace what op might be causing this issue ?

You can read through the trace to try to understand what operation in your training function it relates to. The beginning is:

GpuElemwise{Composite{((i0 * (i1 / i2) * i3) * i4)},no_inplace} [id A] ''
|CudaNdarrayConstant{[-1.]} [id B]
|GpuDimShuffle{0} [id C] ''
| |GpuElemwise{Composite{((i0 / i1) / i2)},no_inplace} [id D] ''
...
|Assert{msg='Theano Assert failed!'} [id R] ''
...
|GpuDimShuffle{0} [id IU] ''
| |GpuElemwise{ScalarSigmoid}[(0, 0)] [id V] ''
|GpuDimShuffle{0} [id S] ''

So the operation the problem occurs in computes i0 * (i1 / i2) * i3 * i4, where
- i0 is CudaNdarrayConstant{[-1.]} [id B]
- i1 is GpuDimShuffle{0} [id C] ''
- i2 is Assert{msg='Theano Assert failed!'} [id R] ''
- i3 is GpuDimShuffle{0} [id IU] ''
- i4 is GpuDimShuffle{0} [id S] ''
If a node doesn't have anything underneath, it's either an input variable or it has been explained before -- e.g., i4 has "[id S]" which is listed earlier in the trace.

Now about the only thing that can go wrong in i0 * (i1 / i2) * i3 * i4 is a division by zero. Any other way to produce a NaN would require either of the inputs to be NaN or inf already, and this would have been caught earlier (make sure you configured the NaN guard mode to catch inf and -inf as well). The beginning of i2 tells that it's doing a (1 - sigmoid(something + b)), where "something" involves a lot of scalar_softplus and dot products, probably a neural network.

Try to figure out where you divide by one minus the output of a sigmoid layer, and ensure this output cannot get too large.


Also my code isn't using any conv layers - will making changes to the layers you proposed make any difference ?

No, in this case it won't.

FWIW the first machine is not using TINY.

Then probably this won't be the way to fix it. It's mysterious. Are you sure everything else is the same (driver, CUDA version, Theano version)?

Best, Jan
Reply all
Reply to author
Forward
0 new messages