Where was hundreds of epochs mentioned? Surely not for a dataset of thousands of hours?
A few of the reasons we use relatively few epochs in Kaldi are as follows:
- We actually count epochs *after* augmentation, and with a system that has frame-subsampling-factor of 3 we separately train on the data shifted by -1, 0 and 1 and count that all as one epoch. So for 3-fold augmentation and frame-subsampling-factor=3, each "epoch" actually ends up seeing the data 9 times.
- Kaldi uses natural gradient, which has better convergence properties than regular SGD and allows you to train with larger learning rates; this might allow you to reduce the num-epochs by at least a factor of 1.5 or 2 versus what you'd use with normal SGD.
- We do model averaging at the end-- averaging over the last few iterations of training (an iteration is an interval of usually a couple minutes' training time). This allows us to use relatively large learning rates at the end and not worry too much about the added noise; and it allows us to use relatively high learning rates at the end, which further decreases the training time. This wouldn't work without the natural gradient; the natural gradient stops the model from moving too far in the more important directions within parameter space.
- We start with aligments learned from a GMM system, so the nnet doesn't have to do all the work of figuring out the alignments-- i.e. it's not training from a completely uninformed start.
So supposing we say we are using 5 epochs, we are really seeing the data more like 50 times, and if we didn't have those tricks (NG, model averaging) that might have to be more like 100 or 150 epochs, and without knowing the alignments, maybe 200 or 300 epochs. Also it's likely that attention-based models take longer to train than the more standard models that we use.