Increased GPU VRAM and workspace improving training performance?

33 views
Skip to first unread message

Adrian Yalda

unread,
Jul 6, 2024, 2:22:14 PM7/6/24
to marian-nmt
Hello,

I recently set up a new HP 380a Gen 11 with 4x Nvidia L40S GPUs. By doing this, we were able to run a training task with increased workspace compared to our earlier training on 4x Tesla T4s. (The workspace went from 10GB to 40GB when moving from T4s to L40Ses).

We weren't expecting this, but the increased workspace seems to have increased performance by quite a bit. The BLEU score at the end of training on the T4s was around 38. I am mid the 2nd epoch on the L40Ses and the BLEU score is already above 45. We are using `--mini-batch-fit`, so I am assuming this has to do with the increased mini-batch size (so that the parameters are being updated with more data to learn from and generalize from?) The mini-batch size on the T4s was around 500 sentences (125 per T4), and now the mini-batch size is around 2000 sentences (250 per L40S). The training is also moving through our data set much more rapidly so it may train for more epochs than the previous training.

My question is, is it expected behavior for an increased workspace to improve training performance? If this is the case, theoretically, would I get the best performance by just getting the biggest GPUs possible?

Also, if I don't have more GPU capacity, could I increase the mini-batch-size even further to get better performance on the training?

Thanks for any insights!
Reply all
Reply to author
Forward
0 new messages