Not understand the behaviour of Torch on GPU

20 views
Skip to first unread message

Shuzhi Yu

unread,
Jan 25, 2018, 5:14:34 PM1/25/18
to torch7
Hello everyone, I am trying to experimentally show that two nets A and B have the same loss and gradients during training given the same sequence of training samples. This can be shown in Theory but I also need to show empirically. I trained the two nets A and B on CPU with same RNG seed; the loss of the two nets were different to the order of 10^(-13) for each training iteration. This can be explained as numerical error.
However, when I trained them on GPU, the difference is 'huge' (to the order of 10^(-1)); this exceeds the possibility of numerical error. I suspect that the training sequence is not guaranteed the same on GPU since it may run on different GPU devices. I tried to synchronise and set same RNG seed for all GPU devices by cutorch.synchronizeAll() and cutorch.manualSeedAll(seed), but this did not work.

Do you think my suspect is right? If my suspect is right, do you have any suggestions on how to solve the problem of fixing the training sequence?

Thank you!

Ronak Kosti

unread,
Jan 26, 2018, 10:22:15 AM1/26/18
to torch7
Given that you fix the RNG seed and training sequence, it is difficult to see why your training should give the huge error.
There seems to be a problem with RNG seed in cutorch. When I set a seed and try to generate random values (torch.rand(10)), I get different values each time I re-run (after restarting itorch notebook).
This means that your GPUs are not using the same initialized weights on both the devices! - maybe this helps you understand the problem...

Whereas, when I use torch.manuaSeed(), it works fine - giving same random values each time I re-run.
Should there be an issue about this Or am I missing something? 
Reply all
Reply to author
Forward
0 new messages