Hi all,
I have a basic seq2seq with attention model implemented in DyNet and PyTorch. For some reason, they have very different behaviors, and I can't figure out what the difference is. Are there any known differences between DyNet and PyTorch for equivalent functions?
The model I've implemented is a seq2seq model with attention trained with teacher forcing. The attention scores are computed using a dot product between the encoder and decoder states. The attention context vector is concatenated to the decoder hidden state, projected back to the decoder hidden size, and then projected again to the vocabulary size.
As far as I can tell, both the PyTorch and DyNet models are computing the same function. However, with SGD and a learning rate of 0.1, the PyTorch code very quickly suffers an exploding gradient, but the DyNet code is fine. I'm more experienced with PyTorch, so it's possible that I've done something dumb with DyNet. I've simplified the code as much as possible by using a batch size of 1 and an Elman RNN.
If you want to reproduce what I'm seeing, the readme in the repo has the setup instructions.
Thanks!
Dan