Debugging difference between DyNet and PyTorch

69 views
Skip to first unread message

Daniel Deutsch

unread,
Jan 3, 2019, 10:41:01 PM1/3/19
to DyNet Users
Hi all,

I have a basic seq2seq with attention model implemented in DyNet and PyTorch. For some reason, they have very different behaviors, and I can't figure out what the difference is. Are there any known differences between DyNet and PyTorch for equivalent functions?

The model I've implemented is a seq2seq model with attention trained with teacher forcing. The attention scores are computed using a dot product between the encoder and decoder states. The attention context vector is concatenated to the decoder hidden state, projected back to the decoder hidden size, and then projected again to the vocabulary size.

As far as I can tell, both the PyTorch and DyNet models are computing the same function. However, with SGD and a learning rate of 0.1, the PyTorch code very quickly suffers an exploding gradient, but the DyNet code is fine. I'm more experienced with PyTorch, so it's possible that I've done something dumb with DyNet. I've simplified the code as much as possible by using a batch size of 1 and an Elman RNN.


If you want to reproduce what I'm seeing, the readme in the repo has the setup instructions.

Thanks!

Dan


Daniel Deutsch

unread,
Jan 3, 2019, 10:43:43 PM1/3/19
to DyNet Users
I'm using DyNet version 2.0 and PyTorch version 1.0.0

Daniel Deutsch

unread,
Jan 4, 2019, 10:45:53 AM1/4/19
to DyNet Users
I simplified the code even further by removing attention altogether and using a single directional encoder, and the PyTorch issues persist. There are now two variables which cause issues with PyTorch: summing the loss instead of averaging and including the extra linear projection layer. DyNet behaves nicely regardless of the configuration.

I think at this point, it's very likely an issue with PyTorch, and I will make a post on the PyTorch forums later today. That being said, I would appreciate someone with more DyNet experience taking a quick look at the DyNet code to make sure I'm not doing anything dumb.

Jonathan K

unread,
Jan 4, 2019, 11:10:23 AM1/4/19
to Daniel Deutsch, DyNet Users
Hi,
One possibility is gradient clipping, which is on by default in DyNet, but not in PyTorch.

I also found it tricky to figure out the details of variations between frameworks and put together a comparison page for an example (tagging) that you might find interesting - http://jkk.name/neural-tagger-tutorial/

Good luck!

Jonathan

--
You received this message because you are subscribed to the Google Groups "DyNet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dynet-users...@googlegroups.com.
To post to this group, send email to dynet...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dynet-users/e4d3cf27-9d8e-42b1-b48c-04633e763853%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Deutsch

unread,
Jan 4, 2019, 11:39:07 AM1/4/19
to DyNet Users
I think you are right. Removing the clipping threshold in DyNet made the gradient explode similarly to PyTorch. When I clipped the PyTorch gradient, the results are much more similar. I need to do more tests with the full code that I am trying to port, but I believe this is the issue. Your comparison page was very helpful.

Thanks!
Reply all
Reply to author
Forward
0 new messages