prune language model or small text corpus

604 views
Skip to first unread message

mili lali

unread,
Aug 19, 2019, 4:48:21 PM8/19/19
to kaldi-help
Hi
I have a text has about 500M words about 5G size in utf-8 encoding.
for training a language model, prune language model or reduce text corpus?

Also, my lexicon is too small about 20k words. What do you suggest to extend lexicon in Pashto language (?

Daniel Povey

unread,
Aug 19, 2019, 5:00:59 PM8/19/19
to kaldi-help

Hi
I have a text has about 500M words about 5G size in utf-8 encoding.
for training a language model, prune language model or reduce text corpus?

Prune the language model. 

Also, my lexicon is too small about 20k words. What do you suggest to extend lexicon in Pashto language (?

I'd suggest to  just use a grapheme-based lexicon so you don't have to rely on human annotations.


Dan
 

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/8364ab3d-47d6-46a9-94fd-c22de1850de8%40googlegroups.com.

mili lali

unread,
Aug 19, 2019, 5:10:19 PM8/19/19
to kaldi-help

I'd suggest to  just use a grapheme-based lexicon so you don't have to rely on human annotations.

Any Example of grapheme-based model?  (wsj/s5/local/chain/e2e/run_tdnnf_flatstart_char.sh)
so I think these model a bit worse than phone models? 
What size of lexicon you suggest good for these models?

Daniel Povey

unread,
Aug 19, 2019, 5:13:10 PM8/19/19
to kaldi-help


I'd suggest to  just use a grapheme-based lexicon so you don't have to rely on human annotations.

Any Example of grapheme-based model?  (wsj/s5/local/chain/e2e/run_tdnnf_flatstart_char.sh)

Or look at gale_arabic/s5 which I think is grapheme-based; s5b  is BPE word-piece model where the phonetic units are still graphemes.

 
so I think these model a bit worse than phone models? 

Depends on the language
 
What size of lexicon you suggest good for these models?

Probably quite large, like  100k to 200k, or use a word-piece model like in gale_arabic/s5b . (the LM is on the word-piece level.)

Dan 

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

mili lali

unread,
Aug 19, 2019, 5:28:22 PM8/19/19
to kaldi-help

Or look at gale_arabic/s5 which I think is grapheme-based; s5b  is BPE word-piece model where the phonetic units are still graphemes.

Ok, thanks I will try.
I think Pashto and Arabic are the same.

But Is it possible to share some lang preparation of it (lexicon, phones, ...), I don't have access to LDC since I get some idea from them?  (A quick review of chain train codes, I think not a big differences between these codes and others however WSJ use w2e train for chain train)

mili lali

unread,
Aug 23, 2019, 4:05:49 PM8/23/19
to kaldi-help
Hi
Is it possible interpolate a big lm to small text corpus, I mean adapt big lm to small text corpus?
I check some scripts. wsj/s5/local/wsj_train_lms.sh and babel. In these scripts, these use SRILM.
they use ngram-count with -interpolate.
But I don't understand how can set the interpolated text? and how can interpolate a lm?

Ho Yin Chan

unread,
Aug 25, 2019, 11:03:04 PM8/25/19
to kaldi-help

mili lali於 2019年8月24日星期六 UTC+8上午4時05分49秒寫道:

Dongji Gao

unread,
Aug 27, 2019, 2:29:35 PM8/27/19
to kaldi-help
You can check utils/prepare_lang.sh and utils/subword/prepare_lang_subword.sh
The main difference is subword L.fst do not allow "subword in the middle" followed by silence symbol.
For example: international -> inter@@ nation@@ al
The suffix "@@" indicates that the subword in at the begining or in the middle of a word. Only "al" can be followed by sil.

mili lali

unread,
Aug 28, 2019, 2:55:48 PM8/28/19
to kaldi-help
Thanks all.
I read your answers and check them and don't say thanks to don't take your time.
My text corpus is too big and can't train Maxent 3grams language model since crash memory.
Maxent 3grams
-------------------
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc



How can handle it?

best regards

Daniel Povey

unread,
Aug 28, 2019, 3:14:40 PM8/28/19
to kaldi-help
You can just delete that whole section of the script.  Maxent won't usually be the best anyway.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

mili lali

unread,
Sep 6, 2019, 6:12:46 PM9/6/19
to kaldi-help
In this lecture said that interpolate bigram language model to unigram lm. 
I often interpolate a big lm to an another small text.
What thes best strategy to build a good lm from a large scale text corpus for ASR? (interpolating lm, smoothing method and ...) 

best regards

Jan Trmal

unread,
Sep 7, 2019, 4:30:17 AM9/7/19
to kaldi...@googlegroups.com
IMO there might not be a one-size-fits-all strategy/prescription. I'd suggest optimizing perplexity in each stage.
I think Dan told me that in his experience GT discount LM Behave best when interpolated or pruned but I just use whatever gives lowest perplexity, similarly like in the iban recipe in kaldi egs.
Y.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Sep 7, 2019, 10:30:33 AM9/7/19
to kaldi-help
Usually it will be modified Kneser-Ney with interpolation; a 3 or 4-gram.  That's what you'll see in most of the example scripts using SRILM.
[and yes, Yenda is right about interpolation ,but I think when you said `interpolation` you just meant between different n-gram orders, not between datasets.]

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages