subset data size for GMM-HMM training

Anjul Sharma

unread,

Mar 30, 2022, 7:03:32 AM3/30/22

to kaldi-help

Hi,

I want to train a TDNN model with a large amount of data. The data size is very huge, approx 30k hours after a couple of custom augmentations (augmentation contains very high noise and high speech distortion to make ASR robust for all kind of ambiance).

To train a gmm-hmm model i have chosen the following configuration:

1. subset data size: 500k utterances, approx 800hr random dataset.

But I'm not sure 800hr of random data would be sufficient to train a good GMM-HMM model for alignment.

2. deltas training (train_deltas.sh):

leaves: 11500

gauss: 400000

3. LDA MLLT training:

leaves: 11500

gauss: 800000

3. SAT training:

leaves: 11500

gauss: 1600000

Can you please guide me, if I'm making any mistakes?

Thanks

Daniel Povey

unread,

Mar 30, 2022, 7:36:48 AM3/30/22

to kaldi-help

That is enough, the final system is not very sensitive to the GMM system's quality anyway.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/4298d39f-ebab-4e19-90a8-790fd3ad0795n%40googlegroups.com.

Anjul Sharma

unread,

Mar 30, 2022, 7:44:52 AM3/30/22

to kaldi-help

Thanks dan for quick reply.

sis...@gmail.com

unread,

Sep 16, 2025, 1:10:49 AM (10 days ago) Sep 16

to kaldi-help

Hi Anjul,

We are training the people speech dataset of around 26k hours (removed some parts). And our server setup is:

RAM: 765 GB

CPUs: 128

GPUs: 3 * 48GB

But during the `fmllr alignment` we are getting core dump issues. (we keep nj 50)

My question is that what was your setup and how much njs we should use? And I am wondering if we are making any issues.