combination of domain specific and generic LM

uRic Oresths

unread,

Jun 4, 2019, 6:30:24 AM6/4/19

to kaldi-help

Hello

I read that the combination of a generic and a domain specific LM improves the accuracy of a speech recognition model and therefore I wanted to create such a LM. However, I would like to give a higher weight on the domain specific LM.

I am using the mitlm tool to create language models and In the mitlm wiki branch I saw that there is a n-gram weighting technique which I am not sure if it does what I want. If this technique is for specifying the weight that you want to give on a language model, then I don't understand how it works.

If anybody is familiar with this tool and this technique, please let me know.

Itai Peer

unread,

Jun 4, 2019, 8:12:33 AM6/4/19

to kaldi-help

I don't familiar with this tools , but usually when a very easy to increase weight for portion of any training data for any purpose is to duplicate it , while leaving the other part as it is .

As long of course that the training time is not an issue, and because LM training time is not so painful , you can easily do it .

בתאריך יום שלישי, 4 ביוני 2019 בשעה 13:30:24 UTC+3, מאת uRic Oresths:

Jan Trmal

unread,

Jun 4, 2019, 10:07:09 AM6/4/19

to kaldi-help

typically you find an interpolation weight that maximizes perplexity on a dev set.

y.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/6697d635-2028-483c-9b15-4c266594eabc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Povey

unread,

Jun 4, 2019, 10:24:19 AM6/4/19

to kaldi-help

Yes, if you do

git grep interpolate '*.sh'

at the top of the Kaldi repo, you'll see examples of using SRILM to do interpolation.

The thing with repeating part of the data is relevant for training RNNLMS, but probably not relevant for

n-gram LMs as it violates the assumptions used in their update formulas.

Dan

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAFReZQbwfRGSRDpm%2BpabUDuh-8L98cUFVo0LkLsFTow1xJwHcw%40mail.gmail.com.

uRic Oresths

unread,

Jun 5, 2019, 5:02:46 AM6/5/19

to kaldi-help

the dev set is part of the language model ?

because in the mitlm tool, they have an "optimisation" parameter which uses the dev set to improve perplexity.

but the dev set is part of the language model.

so for example in order to create a language model from "test.txt" file and use the optimisation parameter, there is a "dev.txt" which is a part of the "test.txt"

and that make me think that it does use duplication.

In addition, I don't see any weights values or something.

Do I have a wrong understand of what a "dev set" is ?

Τη Τρίτη, 4 Ιουνίου 2019 - 4:07:09 μ.μ. UTC+2, ο χρήστης Yenda έγραψε:

typically you find an interpolation weight that maximizes perplexity on a dev set.
y.

On Tue, Jun 4, 2019 at 8:12 AM Itai Peer <ita...@gmail.com> wrote:

I don't familiar with this tools , but usually when a very easy to increase weight for portion of any training data for any purpose is to duplicate it , while leaving the other part as it is .
As long of course that the training time is not an issue, and because LM training time is not so painful , you can easily do it .

בתאריך יום שלישי, 4 ביוני 2019 בשעה 13:30:24 UTC+3, מאת uRic Oresths:
Hello

I read that the combination of a generic and a domain specific LM improves the accuracy of a speech recognition model and therefore I wanted to create such a LM. However, I would like to give a higher weight on the domain specific LM.
I am using the mitlm tool to create language models and In the mitlm wiki branch I saw that there is a n-gram weighting technique which I am not sure if it does what I want. If this technique is for specifying the weight that you want to give on a language model, then I don't understand how it works.
If anybody is familiar with this tool and this technique, please let me know.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Armando

unread,

Jun 5, 2019, 5:55:56 AM6/5/19

to kaldi-help

the dev set has to be different from the text you use for training the language model

if the content of dev.txt is included in test.txt, delete that content from test.txt (assuming you are using test.txt as input for the lm estimation toolkit)

uRic Oresths

unread,

Jun 5, 2019, 6:17:12 AM6/5/19

to kaldi-help

I thought that if the dev.txt is part of the test.txt, this is how you give more weights to the content of dev.txt

could you please explain what is the purpose of the dev.txt then ?

uRic Oresths

unread,

Jun 5, 2019, 7:27:31 AM6/5/19

to kaldi-help

I just found an example which contains interpolation of generic and domain specific model

estimate-ngram -text voicemail.txt -write-lm your_model.lm

"voicemail.txt" is the domain specific text that he wants to convert it to a language model

interpolate-ngram -lm "your_model.lm, lm_giga_5k_nvp_3gram.arpa" -interpolation LI -op voicemail.txt -wl Lectures+Textbook.LI.lm

"lm_giga_5k_nvp_3gram.arpa" is the generic language model and then he uses the "-op" parameter which is for perplexity optimization and as input the "voice.mail.txt" is given which is the "dev.txt"

which is even more confusing because the same txt file is used for the language model and as "dev.txt"

if you have any idea what is going on, please let me know

Armando

unread,

Jun 5, 2019, 9:32:11 AM6/5/19

to kaldi-help

well, for sure, if you are including the dev set in one of the sources for language model estimation, the language model train on that text source will have a (much) higher interpolation

weight with respect to all other language models.

A good practice is to not mix all these corpora, though: training, development, evaluation.

The development should be representative of your evaluation data, and you should let the LM interpolation toolkit estimates itself the interpolation weights for the different input LMs in order to minimize the perplexity on the development set. If one of the sources is "more similar" to the dev set, then its interpolation weight will be higher anyway.

The thing is, according to your methodology, you're putting a huge bias to one of the sources. This is not necessarily optimal though, when you are going to test the final model on real data.

If I were you, I would select a subset of the in-domain training data, and I would use it only as a dev set to compute interpolation weights

uRic Oresths

unread,

Jun 6, 2019, 3:58:32 AM6/6/19

to kaldi-help

For example with the SRILM tool, there is the option to give weights to each language model that you want to merge (with interpolation).

Isn't this a way to add bias somehow ?

I mean my goal was to not only merge two language models, but to give priority/higher weight to one of the language models ( the domain specific one )

Your suggestion is to interpolate the domain specific text with a generic text and create a language model, and in addition to that to use half of the domain specific text as dev set ?

Armando

unread,

Jun 6, 2019, 4:54:39 AM6/6/19

to kaldi-help

On Thursday, June 6, 2019 at 9:58:32 AM UTC+2, uRic Oresths wrote:

For example with the SRILM tool, there is the option to give weights to each language model that you want to merge (with interpolation).
Isn't this a way to add bias somehow ?

usually you compute the interpolation weights with another executable, something like compute-best-mix IIRC, where the inputs are the different text sources (each for a specific domain and then aldo the dev set) and you give those weights to the interpolation binary that produces the final LM

in absence of a dev set, you can set arbitrarily the weights, according to your own prior beliefs on their relative importance. I think, if you are including your dev set into one of your training sources, the weight for that source and its LM will be very high, like too high.

Try that and let me know what interpolation weight has been given to the in-domain specific source. I bet it's higher than 95%

I mean my goal was to not only merge two language models, but to give priority/higher weight to one of the language models ( the domain specific one )
Your suggestion is to interpolate the domain specific text with a generic text and create a language model, and in addition to that to use half of the domain specific text as dev set ?

usually, your dev set will be much smaller than the domain specific test set; but maybe in your case you have very little data

what I'm suggesting, though, is more about than standard methodology; if you don't have enoguh data to build the training, test and dev set, then I guess you will be forced to relax some of these requirements

uRic Oresths

unread,

Jun 6, 2019, 7:37:13 AM6/6/19

to kaldi-help

so I tried to combine let's say "test1.lm" (domain specific

) and "test2.lm" (generic LM) and for the dev set, the text is the same with "test1.lm"

I noticed that the probabilities of the words that are not in "test1.lm" decreased and the overall WER became worse.

However I want to emphasise the domain specific language model in the interpolation ( since I don't use the SRILM tool and I use the MITLM tool), where the only thing that I can do to achieve this

is to use the "-op" parameter which optimises the perplexity by giving a dev set as input when I run the interpolation.

Then I guess according to what you are saying, I should split the text from the "test1.lm" (the domain specific) into 2 parts and use the first part in the "test1.lm" and the second part in the dev set and leave the "test2.lm"(the generic lm) as it is.

Armando

unread,

Jun 6, 2019, 9:00:48 AM6/6/19

to kaldi-help

yes, you can try that

leave most of it for test1.lm and the remaining to dev

what are the sizes of those data? I would leave at least 10k words for the dev set

Reply all

Reply to author

Forward