subset data size for GMM-HMM training

64 views
Skip to first unread message

Anjul Sharma

unread,
Mar 30, 2022, 7:03:32 AM3/30/22
to kaldi-help
Hi,

I want to train a TDNN model with a large amount of data. The data size is very huge, approx 30k hours after a couple of custom augmentations (augmentation contains very high noise and high speech distortion to make ASR robust for all kind of ambiance).

To train a gmm-hmm model i have chosen the following configuration:
1. subset data size: 500k utterances, approx 800hr random dataset.
   But I'm not sure 800hr of random data would be sufficient to train a good GMM-HMM model for alignment.
2. deltas training (train_deltas.sh):
    leaves: 11500
    gauss: 400000
3. LDA MLLT training:
    leaves: 11500
    gauss: 800000
3. SAT training:
    leaves: 11500
    gauss: 1600000

Can you please guide me, if I'm making any mistakes?

Thanks

Daniel Povey

unread,
Mar 30, 2022, 7:36:48 AM3/30/22
to kaldi-help
That is enough, the final system is not very sensitive to the GMM system's quality anyway.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/4298d39f-ebab-4e19-90a8-790fd3ad0795n%40googlegroups.com.

Anjul Sharma

unread,
Mar 30, 2022, 7:44:52 AM3/30/22
to kaldi-help

Thanks dan for quick reply.

sis...@gmail.com

unread,
Sep 16, 2025, 1:10:49 AM (10 days ago) Sep 16
to kaldi-help
Hi Anjul,

We are training the people speech dataset of around 26k hours (removed some parts). And our server setup is:

RAM: 765 GB
CPUs: 128
GPUs: 3 * 48GB

But during the `fmllr alignment` we are getting core dump issues. (we keep nj 50)

My question is that what was your setup and how much njs we should use? And I am wondering if we are making any issues.

We locked these too
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

Reply all
Reply to author
Forward
0 new messages