Recipe for RNNLM training for large data (600GB)

385 views
Skip to first unread message

aymsagul ablimit

unread,
Aug 2, 2019, 8:46:37 AM8/2/19
to kaldi-help
Hi, I am going to train RNNLM for a spontaneous corpus. I have ca. 600 GB text data for the training of language model. Is there any recipe for such kind of large data? 

Daniel Povey

unread,
Aug 2, 2019, 1:04:33 PM8/2/19
to kaldi-help
That's a lot of data.
There isn't a specific example.  You may have to reduce the num-epochs to 1 to avoid it taking forever, and just use a somewhat larger than normal system.  (You may not want to make it super large to stop it being too slow to use).
My feeling is, in that instance, the quality of the data is going to be more of an issue... i.e. there is probably a lot of junk in there and a lot of data that severely mismatches the style of spoken utterances.  Data-cleaning and data selection can be a lot of work, if you want to get into that.

Dan



On Fri, Aug 2, 2019 at 8:46 AM aymsagul ablimit <aymsag...@gmail.com> wrote:
Hi, I am going to train RNNLM for a spontaneous corpus. I have ca. 600 GB text data for the training of language model. Is there any recipe for such kind of large data? 

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/15690e30-b82a-486a-a4ae-4c51613651be%40googlegroups.com.

aymsagul ablimit

unread,
Aug 6, 2019, 9:18:10 AM8/6/19
to kaldi-help

Hi Dan,

thank you for the quick Antwort. Actually, I thought with the larger data I can get a "better" language model. Before I started with 600GB training data, I tried with 7 GB training data (ca. 1 Billion word tokens), among that 12 MB data was in-domain data and the rest was from other domain. My dev set was also from in-domain data (2MB). I followed the recipe of  wsj/s5/local/rnnlm/run_rnnlm.sh.  After training the best iteration gave the perplexity  60.0/156.0 (train/dev).  The high perplexity of dev set indicates that the trained model is not suitable for this domain? Do you have some advice how can I get a "suitable" rnnlm for my domain?  I have only 12MB in domain training data.  Thank you in advance.

Daniel Povey

unread,
Aug 6, 2019, 1:55:00 PM8/6/19
to kaldi-help
The Kaldi RNNLM setup allows you to scale different subsets of the data differently, by using the "multiplicity" and "scale" fields in one of the config files.. you'll see it if you look for it.  If you have a lot more out-of-domain than in-domain, you want to scale up your in-domain data.  E.g. set "multiplicity" to 10 for in-domain, if you have 100 times more out-of-domain data.  That's just a guess, you have to tune it.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

aymsagul ablimit

unread,
Aug 7, 2019, 3:21:27 AM8/7/19
to kaldi-help
Ok, I try it. Thank you very much, Dan
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages