It depends on the batch size you can afford (which is limited by your GPU memory)
and several other hyper-parameters (big vs. base model,...).
Using the WMT training data (4.5 M sentence pairs), batch size>2k and 4GPUS,
I think at least 300k steps are needed.
It is always better to setup t2t-trainer with many more steps, plot the dev-set BLEU learning curve
and do early stopping (kill the training) when dev-set BLEU starts worsening or is good enough for your purposes.
I mean it may not be worth the money to run 4GPUs for another week, just to get extra 0.1 BLEU improvement.
>> > to
tensor2tenso...@googlegroups.com <javascript:>.
>> <javascript:>.
>
https://groups.google.com/d/msgid/tensor2tensor/ea05af7e-3b53-48b3-a236-fad06b5c7fdd%40googlegroups.com.