OOV words

173 views
Skip to first unread message

Ana Montalvo

unread,
Oct 4, 2016, 11:50:26 AM10/4/16
to kaldi-help
Hi all,
all the words in my corpus.txt have to be defined in the lexicon.txt?
I thought that if there was a word not declared in the lexicon, it was assumed as OOV, isn't this way?
What choice do I have, if my corpus.txt has a lot of names? Do I have to include them one by one to the lexicon?

I am getting this error:

===== MAKING G.fst =====

arpa2fst -
Processing 1-grams
Processing 2-grams
Connected 0 states without outgoing arcs.
FATAL: FstCompiler: Symbol "ABBOUD" is not mapped to any integer arc ilabel, symbol table = /home/ana/Desktop/kaldi/kaldi-trunk/egs/wsjcam0/data/lang/words.txt, source = standard input, line = 11


thx in advance

Daniel Povey

unread,
Oct 4, 2016, 1:28:28 PM10/4/16
to kaldi-help
All the tools that build LMs have an option to specify the vocabulary
to use, so you could specify the vocabulary (word-list) from your
lexicon. Alternatively you could choose a word-list (e.g. based on
counts in the vocabulary) and use g2p to get pronunciations for them
(search for train_g2p.sh and apply_g2p.sh).
Dan
> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Danijel Korzinek

unread,
Oct 4, 2016, 5:17:44 PM10/4/16
to kaldi-help
OOV refers to the language model. The word still has to be in the lexicon. Otherwise there wouldn't be a way to link the word to phonemes/acoustic observation in any way. The system HAS to know how the word is pronounced if it is to recognize it.
Reply all
Reply to author
Forward
0 new messages