Build Language Model

1,420 views
Skip to first unread message

uRic Oresths

unread,
May 28, 2019, 10:59:20 AM5/28/19
to kaldi-help
Hello

I am trying to familiarise with Kaldi and I want to train a model for Dutch language.

I would like to create a custom language model ( maybe a combination of domain specific and general as it was suggested in the forum) but I am confused and I don't know how to create my own language model from text.

If there is any tutorial or anything that can help (for commercial and non-commercial use ), it would be appreciated.




Daniel Povey

unread,
May 28, 2019, 11:24:09 AM5/28/19
to kaldi-help
There are toolkits that can help you do this.  In different example scripts we use different ones, e.g. kaldi_lm, srilm, pocolm.  srilm is the standard one but its license is not free for commercial use.

An example is
egs/tedlium/s5_r3/local/ted_train_lm.sh 

but other scripts with 'train_lm' in their name are relevant too.


Dan


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/6c9fa53f-e0cb-4bce-a3f3-aa4bb4425a09%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Povey

unread,
May 28, 2019, 11:24:32 AM5/28/19
to kaldi-help
And read "a bit of progress in language modeling -extended version" for basic intro to language modeling.
You don't have to read all the way through.

uRic Oresths

unread,
May 29, 2019, 3:19:19 AM5/29/19
to kaldi-help
Thank you!

Is any of the kaldi_lm, srilm and pocolm better than the others?

I just saw that the ted_train_lm.sh downloads some text and therefore i guess the script has to be changed in order to use my own text.
In addition, in the description it says that it trains on acoustic and text data, but the language model uses only text data to train right ?


Τη Τρίτη, 28 Μαΐου 2019 - 5:24:09 μ.μ. UTC+2, ο χρήστης Dan Povey έγραψε:
There are toolkits that can help you do this.  In different example scripts we use different ones, e.g. kaldi_lm, srilm, pocolm.  srilm is the standard one but its license is not free for commercial use.

An example is
egs/tedlium/s5_r3/local/ted_train_lm.sh 

but other scripts with 'train_lm' in their name are relevant too.


Dan


On Tue, May 28, 2019 at 10:59 AM uRic Oresths <urico...@gmail.com> wrote:
Hello

I am trying to familiarise with Kaldi and I want to train a model for Dutch language.

I would like to create a custom language model ( maybe a combination of domain specific and general as it was suggested in the forum) but I am confused and I don't know how to create my own language model from text.

If there is any tutorial or anything that can help (for commercial and non-commercial use ), it would be appreciated.




--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

uRic Oresths

unread,
May 29, 2019, 6:40:51 AM5/29/19
to kaldi-help
In addition, I would like to ask if the language model is independent of the other files. I mean, is the only thing that I have to do is to provide a txt file which will be converted to a language model using one of the scripts (kaldi_ml etc) or do I have to take care of other files also ?


Τη Τρίτη, 28 Μαΐου 2019 - 5:24:09 μ.μ. UTC+2, ο χρήστης Dan Povey έγραψε:
There are toolkits that can help you do this.  In different example scripts we use different ones, e.g. kaldi_lm, srilm, pocolm.  srilm is the standard one but its license is not free for commercial use.

An example is
egs/tedlium/s5_r3/local/ted_train_lm.sh 

but other scripts with 'train_lm' in their name are relevant too.


Dan


On Tue, May 28, 2019 at 10:59 AM uRic Oresths <urico...@gmail.com> wrote:
Hello

I am trying to familiarise with Kaldi and I want to train a model for Dutch language.

I would like to create a custom language model ( maybe a combination of domain specific and general as it was suggested in the forum) but I am confused and I don't know how to create my own language model from text.

If there is any tutorial or anything that can help (for commercial and non-commercial use ), it would be appreciated.




--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Jonathan K

unread,
May 29, 2019, 11:49:03 AM5/29/19
to kaldi-help
It is independent, when created (arpa file), but when you want to use it with acoustic model combined you need to convert the arpa to FST, and then create HCLG out of it using mkgraph.sh. I think you should start with Kaldi for Dummies first, and get the idea of how things being built and connected to get decoding done.

uRic Oresths

unread,
May 30, 2019, 7:01:54 AM5/30/19
to kaldi-help
Thank you for the reply.

I had troubles running the kaldi/egs/librispeech/s5/local/lm/train_lm.sh

Therefore I decided to find another tool. I found the milt tool but it creates .lm file instead of arpa. Is it possible to convert it to arpa file and subsequently to FST?

Jonathan K

unread,
May 30, 2019, 9:52:15 AM5/30/19
to kaldi-help
I am not sure, but I believe that .lm file is ARPA, just different extension.

Daniel Povey

unread,
May 30, 2019, 11:48:04 AM5/30/19
to kaldi-help
Look at egs/mini_librispeech/s5/run.sh for an example of how to run train_lm.sh in context.  You wouldn't just run it in isolation, you have to download data, set up the paths, etc.; and it has to be done from the directory egs/mini_librispeech/s5/.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages