Training TDNN chain model on 25,000 hours of data

41 views
Skip to first unread message

Shruthi BS

unread,
Sep 15, 2025, 6:50:56 PM (11 days ago) Sep 15
to kaldi-help
Dear All,

We are training a TDNN chain model using around 25k hours of People Speech English data. Our GPU configuration is 3 L40S cards of 46gb each,756 GB RAM,128 cores.
We are facing out of memory  issues in fmllr alignment in the chain/run_ivector_common.sh script.It shows Aborted in terminal.We have reduced nj to 50,but still we face this issue.
What would the ideal way to go forward,please suggest.
Also ways to monitor memory management for big datasets

Thanks,
Shruthi

Jan Yenda Trmal

unread,
Sep 17, 2025, 3:39:07 AM (10 days ago) Sep 17
to kaldi...@googlegroups.com
Hi,
are you using run.pl? I suggest setting up slurm, even if on a single machine. Run.pl does not do any resource management.
And you will have to investigate logs for specific errors.
y.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/kaldi-help/0291671b-7e4c-4aec-8d25-0d1d61d7abdcn%40googlegroups.com.

Shruthi BS

unread,
Sep 17, 2025, 4:23:12 AM (10 days ago) Sep 17
to kaldi...@googlegroups.com
Thank you for the reply. Yes we are using run.pl. it is on a single machine. Just using slurm.pl in cmd.sh is sufficient?  

Regards,
Shruthi

Jan Yenda Trmal

unread,
Sep 17, 2025, 4:30:22 AM (10 days ago) Sep 17
to kaldi...@googlegroups.com
Unfortunately, no -- it's a software package you have to install and configure (https://slurm.schedmd.com/)
y.

Reply all
Reply to author
Forward
0 new messages