Right way of work with OOV words

Sergei Tushev

unread,

Jul 31, 2019, 9:35:20 AM7/31/19

to kaldi-help

Hello.

Tell me please, what is the right way of work with OOV.

For example some audio sounds like "I live in London". Word "London" is OOV in my model. My ASR gives text "I live in Monday".

Words "Monday" may have high confidence up to 100%, but it is wrong.

1. How can I get word "<unk>" instead of Monday?

2. How can I get lower confidence for a word "Monday" (20%)?

P.S. My LM has word <unk>. Maybe I do something wrong?

Thank you.

Daniel Povey

unread,

Jul 31, 2019, 3:21:26 PM7/31/19

to kaldi-help

It's very hard to fix that kind of problem unless you model the OOV words in a more thorough way.

One way is to build a phone LM to model the unk words properly; you can look for scripts called `run_unk_model.sh`.

However, the way that I think we should adopt going forward is to build a word-piece system.

In this kind of system the acoustic modeling units would be the graphemes but the tokens in the language model would be word pieces; and it can naturally cover any words.

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a7011525-313d-44c9-87e6-ef622fd28665%40googlegroups.com.

Daniel Povey

unread,

Jul 31, 2019, 3:21:56 PM7/31/19

to kaldi-help, Dongji Gao

forgot to mention-- currently I think our only example of a word-piece system is egs/gale_arabic/s5c.

Dongji (cc'd) may be working on another though.

Reply all

Reply to author

Forward