Replacing <unk> in decode output by best matching phone combination

Armin Oliya

unread,

Sep 11, 2017, 3:43:51 PM9/11/17

to kaldi-help

Would it be possible to replace <unks> in the decode output by best matching phone combinations?

If i understand it correctly, not having '<unk>' paths in the lattice would 'force' the system to produce stuff without <unk>, but those would still be words (which wouldn't have a chance to match the reference word if it's oov).

Thanks

Armin

Daniel Povey

unread,

Sep 11, 2017, 3:51:44 PM9/11/17

to kaldi-help

If you don't want to decode `<unk>`, you can just remove it from the
LM. You can even do it at the HCLG level by removing arcs that have
the unknown word as the word (the olabel); it's easy to do in text
form.

It generally won't affect the WER substantially though, as unk is
rarely decoded and if it is decoded, generally the alternative would
be an error.

> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Armin Oliya

unread,

Sep 11, 2017, 4:10:05 PM9/11/17

to kaldi-help

quite interesting, thanks Dan.

Daniel Galvez

unread,

Sep 11, 2017, 4:17:30 PM9/11/17

to kaldi-help

Question: Can't the TEDLIUM r2 recipe's "unknown phone language model" be used to achieve Armin's purpose?

I admit that I never really tracked how useful that was, and it may not be at all, given Dan's above message, assuming your vocabulary size is large enough. (Presumably you could compute edit distance of the transcribed phones with the known correct phones for a word if you wanted some evaluation metric.)

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Daniel Galvez

Jan Trmal

unread,

Sep 11, 2017, 4:22:04 PM9/11/17

to kaldi-help

It could be used probably (if it's the phoneme bigram loop) but there is an additional step necessary that will get you the phoneme sequence (when you just decode, you will get <UNK> in the word lattice).

y.

Armin Oliya

unread,

May 11, 2018, 9:26:06 AM5/11/18

to kaldi-help

Thank you all,

I guess the focus here is more to increase recall for oov words, rather than improving wer.

From my understanding, if we have an oov lexicon ready, we can use run_unk_model.sh in tedlium s5_r2 receipe (thanks Daniel).

What it does is forcing unks to be decoded as one of the entries in the oov lexicon, and there will be no unks in the final output.

Here the main LM/lexicon doesn't have to be kept updated, and as long as user provides oov lexicon (plain text oov list > g2p > lexicon),

new combined lang_unk LMs (and new graphs) can be created and used for decoding.

This is also i guess how services like Voice Base allow their users to add custom vocabulary; this is especially interesting because LM weights don't need to tinkered.

pls let me know if my understanding of above is incorrect.

There's yet another case where the oov lexicon is not available or changes frequently -- or we don't have the time required to rebuild the graph.

An example could be a company like Amazon with all sorts of new products added hourly - and we'd like to be able to search product names in customer call center transcriptions.

The call volume is high and the transcription pipeline is already fully utilized. I guess this is also related to what people refer to as 'phonetic search'.

The two options you proposed:

Having the model produce actual phonemes instead of unks; in this case, user oov queries can be first converted to phonemes and then (fuzzy) searched against the transcript; like a traditional full-text search

Yenda you suggested this, could you pls give a few more pointers how to implement it?

Remove unk arcs from the graph (as Dan suggested) and get the graph to produce the closest matching word(s) (ref. the discussion here). Is there a way then to know those forced-words are actually unks, eg. if they always have low confidence, and match them against oov-queries?

Armin

--
Daniel Galvez

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

Daniel Povey

unread,

May 11, 2018, 6:06:27 PM5/11/18

to kaldi-help

> I guess the focus here is more to increase recall for oov words, rather than
> improving wer.
>
>
> From my understanding, if we have an oov lexicon ready, we can use
> run_unk_model.sh in tedlium s5_r2 receipe (thanks Daniel).
> What it does is forcing unks to be decoded as one of the entries in the oov
> lexicon, and there will be no unks in the final output.

Actually that's not quite accurate (although it wouldn't be hard to
modify the script to behave as you describe). What the script
actually has is a phone-level n-gram LM that it inserts into the graph
where <unk> would normally be decoded. So it will still decode <unk>,
but it will have a sequence of real phones that can be turned back
into a word by post-procesing if you want.

> Here the main LM/lexicon doesn't have to be kept updated, and as long as
> user provides oov lexicon (plain text oov list > g2p > lexicon),
> new combined lang_unk LMs (and new graphs) can be created and used for
> decoding.
>
> This is also i guess how services like Voice Base allow their users to add
> custom vocabulary; this is especially interesting because LM weights don't
> need to tinkered.
> pls let me know if my understanding of above is incorrect.
>
>
>
> There's yet another case where the oov lexicon is not available or changes
> frequently -- or we don't have the time required to rebuild the graph.
> An example could be a company like Amazon with all sorts of new products
> added hourly - and we'd like to be able to search product names in customer
> call center transcriptions.
> The call volume is high and the transcription pipeline is already fully
> utilized. I guess this is also related to what people refer to as 'phonetic
> search'.
>
> The two options you proposed:
>
> Having the model produce actual phonemes instead of unks; in this case, user
> oov queries can be first converted to phonemes and then (fuzzy) searched
> against the transcript; like a traditional full-text search
>
> Yenda you suggested this, could you pls give a few more pointers how to
> implement it?

I think it would be better to run something like run_unk_model.sh
which is more of a hybrid-- real words stay as words, but OOVs will
(hopefully) be decoded as sequences of phones.

>
> Remove unk arcs from the graph (as Dan suggested) and get the graph to
> produce the closest matching word(s) (ref. the discussion here). Is there a
> way then to know those forced-words are actually unks, eg. if they always
> have low confidence, and match them against oov-queries?

Confidence in this type of scenario is hard.

Dan

> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/59ea8d15-be4c-4fe1-9af0-9be2a2f54125%40googlegroups.com.

Armin Oliya

unread,

May 22, 2018, 4:55:16 PM5/22/18

to kaldi-help

Thanks Dan,

So having the combined fst (default + unk-fst) and an oov like mailbags we should get something like m ey l b ae g z in the transcription, if it otherwise is decdoded as <unk>.

I'm following the run_unk_model.sh of tedlium, but

I don't see anything that looks like a phone sequence in the transcription
<unk> still appears in the transcriptions; specifically why are there still arcs with <unk> in the combined L.fst?

Armin

Daniel Povey

unread,

May 22, 2018, 5:00:07 PM5/22/18

to kaldi-help, Xiaohui Zhang

The word labels will still be unk but they will have a reasonable
phone transcript associated with them.
You have to do post-processing on the lattice to turn it back into
words (or to display the phone sequences alongside the <unk>).
Samuel, can you remind us how this is done?

> https://groups.google.com/d/msgid/kaldi-help/e47801a1-3b56-4910-aa6e-7f3c9d39702b%40googlegroups.com.

Xiaohui Zhang

unread,

May 23, 2018, 3:10:16 PM5/23/18

to kaldi-help

Hi Dan, to summarize, basically we have to grab all <unk>-arcs using lattice-arc-post from the lattice first, and then have a list of unk words with their phone transcriptions. By running p2g we get the recovered words and then insert these recovered extra words into L and G.fst using utils/prepare_extended_lang.sh, and then we replace the IDs of the <unk>s with IDs of the recovered words by using lattice-copy and an awk command. Then we can get ctms including recovered words, or re-score lattice to improve both recall of OOVs or general WERs.

The working recipe is here: /export/b15/xzhang/swbd/s5c/oov_recovery.sh. It's for my swbd experiments. Probably I should commit a recipe for tedlium?

Xiaohui

Armin Oliya

unread,

Jun 4, 2018, 6:38:57 AM6/4/18

to kaldi-help

Thank you both,

I managed to extract phones and recover words via p2g.

I don't use all <unk> arcs though (a lot of them are noisy?); rather i first get ctm, find <unk>s there, and choose the <unk>-arc with the highest posterior in the +/- 5 frame window. I then update ctm with output of p2g.

Still curious how you "replace the IDs of the <unk>s with IDs of the recovered words by using lattice-copy", so would be great if you could share that script.

A few questions about unk modeling:

All the run_unk_model.sh scripts seem to be using the default data/local/dict with make_unk_lm.sh - technically the default lexicon. Is there also a scenario where you'd use a different/extended lexicon for modeling oovs? for example:

when dealing with a very large lexicon; for examples one including german numbers (one word per number), product names or medical terms?
when you'd rather to keep your default lexicon fixed, so that it's consistent with a pretrained rnn-lm.

Is it right to say that oov-modeling also helps with adressing the issue where we do have a certain word in the lexicon but it doesn't get decoded often due to low LM probability (not enough occurance of that word in the text corpus)?
when using --unk-fst with prepare_lang.sh it's practically required that the LM/G.fst be build with pocolm using --limit-unk-history=true option. Is there a way to adjust an existing non-pocolm arpa LM for this purpose?
given a user-created list of oov words, would it make sense to adjust their unk-arc posteriors to increase chances of having them in final transcript?

Thanks!
Armin

Xiaohui Zhang

unread,

Jun 4, 2018, 12:25:38 PM6/4/18

to kaldi-help

Hi Armin, glad to hear your good news.

On Monday, June 4, 2018 at 6:38:57 AM UTC-4, Armin Oliya wrote:

Thank you both,

I managed to extract phones and recover words via p2g.
I don't use all <unk> arcs though (a lot of them are noisy?); rather i first get ctm, find <unk>s there, and choose the <unk>-arc with the highest posterior in the +/- 5 frame window. I then update ctm with output of p2g.

Very interesting. Did you do this at c++ level or script level?

Still curious how you "replace the IDs of the <unk>s with IDs of the recovered words by using lattice-copy", so would be great if you could share that script.

I'll commit the whole pipeline when I finish it. For now I'll send you relevant parts by email.

A few questions about unk modeling:
All the run_unk_model.sh scripts seem to be using the default data/local/dict with make_unk_lm.sh - technically the default lexicon. Is there also a scenario where you'd use a different/extended lexicon for modeling oovs? for example:
when dealing with a very large lexicon; for examples one including german numbers (one word per number), product names or medical terms?
when you'd rather to keep your default lexicon fixed, so that it's consistent with a pretrained rnn-lm.

Definitely you can choose other lexicons to train the unk_lm, not necessarily the default one.

Is it right to say that oov-modeling also helps with adressing the issue where we do have a certain word in the lexicon but it doesn't get decoded often due to low LM probability (not enough occurance of that word in the text corpus)?

I think it's correct.

when using --unk-fst with prepare_lang.sh it's practically required that the LM/G.fst be build with pocolm using --limit-unk-history=true option. Is there a way to adjust an existing non-pocolm arpa LM for this purpose?

I'm not sure. @Dan do you have an idea?

given a user-created list of oov words, would it make sense to adjust their unk-arc posteriors to increase chances of having them in final transcript?

If you want to boost the unigram prob of <unk> in the LM, you can use kaldi/egs/wsj/s5/utils/lang/adjust_unk_arpa.pl

If you have a list of OOV words, I suggest add them to the lexicon & LM and adjust their unigram-probs using kaldi/egs/wsj/s5/utils/lang/add_unigrams_arpa.pl

Daniel Povey

unread,

Jun 4, 2018, 3:13:33 PM6/4/18

to kaldi-help

>> when using --unk-fst with prepare_lang.sh it's practically required that
>> the LM/G.fst be build with pocolm using --limit-unk-history=true option. Is
>> there a way to adjust an existing non-pocolm arpa LM for this purpose?
>
> I'm not sure. @Dan do you have an idea?

I think you can just delete any n-gram states of the form
foo <unk> -> X
which (if I recall correctly how the ARPA format works) means that you
could just delete any n-grams of order > 2 where <unk> is the
second-to-last word, for instance a line
-3.243 mr. <unk> is
would be deleted. I think that also means that for lines of the form
-4.09 every <unk> -0.64
(i.e. 2-gram or higher states where the predicted word is <unk>) you
have to delete the final term (-0.64, in this case) because that is
the backoff prob from a state that no longer exists.
And of course you'd have to adjust the n-gram counts in the header.

If you could create a script to do this it would be great.
e.g.
steps/utils/lang/limit_arpa_unk_history.pl
(doesn't have to be perl, that's just an example).

Dan

> https://groups.google.com/d/msgid/kaldi-help/0a33590d-b503-4bf8-a4b2-856615893854%40googlegroups.com.

Armin Oliya

unread,

Jun 5, 2018, 4:56:59 PM6/5/18

to kaldi-help

Thank you both.

@Xiaohui I'm doing it script level with bash and python.

I noticed those unigram add/update scripts just overwrite probs as given;

would it be safe and is there a suggested range without messing with other probs?
how effective is adding/adjusting unigram probs without changing higher order n-grams?

On top of those scripts, I'm specially interested in options that don't require recreating the graph. Let's say I have a transcription API and each user is passing an oov list. It would be quite impractical to rebuild the graph.

Also a question about the case where we use a 3gram for decoding, followed by a 4gram and rnn-lm for rescoring: the decoding 3-gram is the only lm that needs to be composed with unk_lm, right?

@Dan, thanks got it.

Xiaohui Zhang

unread,

Jun 9, 2018, 2:27:58 PM6/9/18

to kaldi-help

On Tuesday, June 5, 2018 at 4:56:59 PM UTC-4, Armin Oliya wrote:

Thank you both.

@Xiaohui I'm doing it script level with bash and python.
I noticed those unigram add/update scripts just overwrite probs as given;
would it be safe and is there a suggested range without messing with other probs?

You have to tune it yourself I guess.

how effective is adding/adjusting unigram probs without changing higher order n-grams?

add_unigram_arpa.pl only adds new unigrams. These added words never appear in higher order n-grams.

adjust_unk_arpa.pl doesn't adjust <unk>'s prob in higher order n-grams. It doesn't matter when the n-grams including <unk> have very low probs. Otherwise it does matter. I'm going to fix this, providing an option of scale the prob of higher order n-grams including <unk>.

On top of those scripts, I'm specially interested in options that don't require recreating the graph. Let's say I have a transcription API and each user is passing an oov list. It would be quite impractical to rebuild the graph.

That's definitely possible and I did this before. You can use an awk command to scale the score of arcs in HCLG.fst whose output symbol is <unk> (or whatever of your interest). @Dan if you think it's useful we can commit a simple script doing that.

Also a question about the case where we use a 3gram for decoding, followed by a 4gram and rnn-lm for rescoring: the decoding 3-gram is the only lm that needs to be composed with unk_lm, right?

Probably you should look into the scripts for more details. The unk-lm is inserted to L.fst, as the pronunciation of <unk>, not composed with G.fst.

Xiaohui

Daniel Povey

unread,

Jun 9, 2018, 2:32:10 PM6/9/18

to kaldi-help

>>
> That's definitely possible and I did this before. You can use an awk command
> to scale the score of arcs in HCLG.fst whose output symbol is <unk> (or
> whatever of your interest). @Dan if you think it's useful we can commit a
> simple script doing that.

Sure.

Armin Oliya

unread,

Aug 3, 2018, 8:18:02 AM8/3/18

to kaldi-help

Thanks Xiaohui for the feedback.

In my experiments decoding with an "unk'd graph" is 2-3 times slower than a normal graph; are there ways/tradeoffs to have it faster?

Armin

Daniel Povey

unread,

Aug 3, 2018, 2:53:15 PM8/3/18

to kaldi-help

You should make sure there aren't too many copies of the unk FST in the graph-- if there are, it will make the graph large and slow to decode with. If you used pocolm to build the LM, and used that special option (I forget, it has "unk" as part of its name, like --limit-unk-context, or something), then it should be OK. Otherwise you should use utils/lang/adjust_unk_arpa.pl on the ARPA, before starting the building process.

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/ef313f89-275d-4824-bd56-6492e913809f%40googlegroups.com.

Rudolf A. Braun

unread,

Sep 16, 2020, 8:58:43 AM9/16/20

to kaldi-help

Did any of the talked about scripts like "oov_recovery.sh" ever get committed ?

On Friday, August 3, 2018 at 8:53:15 PM UTC+2 Dan Povey wrote:

You should make sure there aren't too many copies of the unk FST in the graph-- if there are, it will make the graph large and slow to decode with. If you used pocolm to build the LM, and used that special option (I forget, it has "unk" as part of its name, like --limit-unk-context, or something), then it should be OK. Otherwise you should use utils/lang/adjust_unk_arpa.pl on the ARPA, before starting the building process.
Dan

On Fri, Aug 3, 2018 at 5:18 AM, Armin Oliya <armin...@gmail.com> wrote:
Thanks Xiaohui for the feedback.
In my experiments decoding with an "unk'd graph" is 2-3 times slower than a normal graph; are there ways/tradeoffs to have it faster?

Armin

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,

Sep 16, 2020, 11:24:50 AM9/16/20

to kaldi-help

It may be this:

egs/tedlium/s5_r2/local/run_learn_lex_greedy.sh

But this was aimed at a training scenario, where the words are known but their pronunciations are not.

So not the same as attempted decoding with OOV recovery.

I think we just didn't commit anything like that as we found the problem too hard.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/6a79514e-309d-43e1-addd-5b5896cd5664n%40googlegroups.com.

Rudolf A. Braun

unread,

Sep 17, 2020, 11:36:26 AM9/17/20

to kaldi-help

Okay thank you

Tanel Alumäe

unread,

Sep 21, 2020, 4:43:14 AM9/21/20

to kaldi-help

Hi!

We are using decoding with unknown word recovery in our Estonian system. We are using FST based G2P rules for generating a pronunciation lexicon. So, in order to turn the recovered phoneme sequence back into an orthographic word, we use an inverse of the G2P transducer (i.e., P2G transducer). This process is of course ambiguous, so it is composed with a transducer that represents letter n-grams (I believe we use smth like pruned 10-grams).

In our experience, the unknown word recovery actually works really well. It typically recovers two types of words: rare infections of (somewhat) rare words (that didn't make it into the 200k lexicon) and names. Of course, with foreign names, the resulting words are very often incorrect (since they are recovered using Estonain P2G rules) but for human readers, they are much better than typical ASR errors caused by unknown words.

The decoding scripts that we use are here: https://github.com/alumae/kaldi-offline-transcriber/blob/master/local/get_ctm_unk.sh

This paper also has a section on it: http://ebooks.iospress.nl/publication/50297

Hope this helps,

Tanel

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/5e758b88-0508-4eb7-9cfa-9e2cd085c3d4n%40googlegroups.com.

Rudolf A. Braun

unread,

Sep 21, 2020, 11:21:27 AM9/21/20

to kaldi-help

Thank you that sound really useful!

Reply all

Reply to author

Forward