Why num_chunk_per_minibatch specs in chain recipes allow to go down to 64 or 32 but not 1?

152 views
Skip to first unread message

Kirill Katsnelson

unread,
Mar 28, 2019, 8:36:54 PM3/28/19
to kaldi-help
Speaking of the nnet3 chain TDNN case, why there are specs for num_chunk_per_minibatch such as "128,64"? My understanding is this means prefer 128 chunks per minibatch, but if you have fewer at end of iteration, take 64 and drop the rest. Would it make more sense to use just "1:128", to use up the rest at the end? What is the difference?

If there is really none, does it hold true for recurrent models?

 -kkm

Daniel Povey

unread,
Mar 28, 2019, 10:49:10 PM3/28/19
to kaldi-help
The reason we don't try to use up the last few elements in a training run is that the compilation overhead is substantial, and processing small numbers of elements in a batch is slow anyway.  Because we randomize the order with different random seeds on different iterations, it's not  like we lose those chunks permanently anyway; they will be used on other iterations.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/268158a3-e6ba-4c22-b74c-effec67a293a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kirill Katsnelson

unread,
Mar 29, 2019, 12:57:53 AM3/29/19
to kaldi-help
I see, thanks. So this is kind of a sweet spot for efficiency. In fact, with '1:32', I saw

178={2->1,32->54,d=0},187={5->1,32->29,d=0},226={11->1,32->545,d=0}

pretty much no larger split as I was getting with '32,64',

178={32->1,64->27,d=31},187={32->1,64->14,d=30},226={32->1,64->279,d=4}

if I understand what this diagnostic means.

Along the same lines, my kind of a common wisdom of minibatched GD has been, like, if you have no idea, go with 32. I did a couple runs with 32-sized minibatches, but toyed with other parameters as well. I cannot positively say I really saw an improvement, as I need apples to apples for that, but by some coincidence these experiments turned out the best (this is a mid-sized ~100hr uncorrelated, 4-way processed dataset, tdnnf-based model). I did not notice much inefficiency either. But stock recipes most often go with 128 or even 256; my thinking is that with multi-GPU split and averaging, there is likely less advantage in frequent model updates, so it does not make sense to make the minibatch size small? You probably ran over 9000 times more runs than I did; I am wondering how important is this setting, and what the general idea would be.

I am asking because at times I am feeling like a child in a plane's cockpit: there are so many bright-colored knobs to play with!

 -kkm

On Thursday, March 28, 2019 at 7:49:10 PM UTC-7, Dan Povey wrote:
The reason we don't try to use up the last few elements in a training run is that the compilation overhead is substantial, and processing small numbers of elements in a batch is slow anyway.  Because we randomize the order with different random seeds on different iterations, it's not  like we lose those chunks permanently anyway; they will be used on other iterations.

On Thu, Mar 28, 2019 at 8:36 PM Kirill Katsnelson <kkm.p...@gmail.com> wrote:
Speaking of the nnet3 chain TDNN case, why there are specs for num_chunk_per_minibatch such as "128,64"? My understanding is this means prefer 128 chunks per minibatch, but if you have fewer at end of iteration, take 64 and drop the rest. Would it make more sense to use just "1:128", to use up the rest at the end? What is the difference?

If there is really none, does it hold true for recurrent models?

 -kkm

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,
Mar 29, 2019, 12:13:14 PM3/29/19
to kaldi-help
I have never really done systematic experiments on the speed or WER effect of changes to the minibatch size.  My assumption has been that as large a minibatch size as memory will allow, will generally be the most efficient, but this may be wrong.

Incidentally, it will interact slightly with things like learning rates and max-change and num-epochs and so on, so I wouldn't necessarily trust any WER changes unless some effort was made to tune other things. 

Dan

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Kirill Katsnelson

unread,
Mar 29, 2019, 9:40:48 PM3/29/19
to kaldi-help
Thanks. I'll do some comparisons when I have a machine time, it's interesting. I only have 2 GPUs, and am currently in an iteration of improving our production model, but otherwise they may sit idle for weeks why I'm on something else. There will be a good time for no-rush experimentation.

 -kkm

On Friday, March 29, 2019 at 9:13:14 AM UTC-7, Dan Povey wrote:
I have never really done systematic experiments on the speed or WER effect of changes to the minibatch size.  My assumption has been that as large a minibatch size as memory will allow, will generally be the most efficient, but this may be wrong.

Incidentally, it will interact slightly with things like learning rates and max-change and num-epochs and so on, so I wouldn't necessarily trust any WER changes unless some effort was made to tune other things. 

Dan

Reply all
Reply to author
Forward
0 new messages