training big transformer with mixed precision

44 views
Skip to first unread message

oravec...@gmail.com

unread,
Apr 11, 2023, 6:56:20 AM4/11/23
to marian-nmt
We've been training a few big transformer models with marian 1.12.0 with about 25m segments on 4 A100 GPUs with more or less the following settings:

--workspace 34000
--type transformer
--dim-vocabs 36000 36000
--enc-depth 6 --dec-depth 6
--max-length 150
--mini-batch-fit
--sync-sgd
--learn-rate .0002
--label-smoothing 0.1
--clip-norm 5
--tied-embeddings-all --lr-warmup 8000 --lr-decay-inv-sqrt 8000
--optimizer-params 0.9 0.998 1e-09
--transformer-dropout 0.1 --exponential-smoothing
--dim-emb 1024 --transformer-dim-ffn 4096
--transformer-heads 16
--transformer-postprocess dan
--transformer-ffn-activation relu --optimizer-delay 1
--transformer-dropout-attention 0.1 --transformer-dropout-ffn 0.1

These trainings run fine. Using the exact same settings with mixed precision training (--fp16) results in a speedup of about two times, however, we have numerical instability after around 20 epochs ('normal' trainings run up to about 40 epochs). When decreasing the learning rate to 0.0001, mixed precision training becomes stable but model convergence gets a lot slower so in the end what we gain with speed we lose in overall training time.

Does anyone around happen to have some suggestion for a set of hyperparameters for mixed precision training that would be worth experimenting with and could yield fast and stable trainings?

Many thanks,
csaba

Reply all
Reply to author
Forward
Message has been deleted
0 new messages