I see, thanks. So this is kind of a sweet spot for efficiency. In fact, with '1:32', I saw
178={2->1,32->54,d=0},187={5->1,32->29,d=0},226={11->1,32->545,d=0}
pretty much no larger split as I was getting with '32,64',
178={32->1,64->27,d=31},187={32->1,64->14,d=30},226={32->1,64->279,d=4}
if I understand what this diagnostic means.
Along the same lines, my kind of a common wisdom of minibatched GD has been, like, if you have no idea, go with 32. I did a couple runs with 32-sized minibatches, but toyed with other parameters as well. I cannot positively say I really saw an improvement, as I need apples to apples for that, but by some coincidence these experiments turned out the best (this is a mid-sized ~100hr uncorrelated, 4-way processed dataset, tdnnf-based model). I did not notice much inefficiency either. But stock recipes most often go with 128 or even 256; my thinking is that with multi-GPU split and averaging, there is
likely less advantage in frequent model updates, so it does not make
sense to make the minibatch size small? You probably ran over 9000 times more runs than I did; I am wondering how important is this setting, and what the general idea would be.
I am asking because at times I am feeling like a child in a plane's cockpit: there are so many bright-colored knobs to play with!
-kkm