> I guess the focus here is more to increase recall for oov words, rather than
> improving wer.
>
>
> From my understanding, if we have an oov lexicon ready, we can use
> run_unk_model.sh in tedlium s5_r2 receipe (thanks Daniel).
> What it does is forcing unks to be decoded as one of the entries in the oov
> lexicon, and there will be no unks in the final output.
Actually that's not quite accurate (although it wouldn't be hard to
modify the script to behave as you describe). What the script
actually has is a phone-level n-gram LM that it inserts into the graph
where <unk> would normally be decoded. So it will still decode <unk>,
but it will have a sequence of real phones that can be turned back
into a word by post-procesing if you want.
> Here the main LM/lexicon doesn't have to be kept updated, and as long as
> user provides oov lexicon (plain text oov list > g2p > lexicon),
> new combined lang_unk LMs (and new graphs) can be created and used for
> decoding.
>
> This is also i guess how services like Voice Base allow their users to add
> custom vocabulary; this is especially interesting because LM weights don't
> need to tinkered.
> pls let me know if my understanding of above is incorrect.
>
>
>
> There's yet another case where the oov lexicon is not available or changes
> frequently -- or we don't have the time required to rebuild the graph.
> An example could be a company like Amazon with all sorts of new products
> added hourly - and we'd like to be able to search product names in customer
> call center transcriptions.
> The call volume is high and the transcription pipeline is already fully
> utilized. I guess this is also related to what people refer to as 'phonetic
> search'.
>
> The two options you proposed:
>
> Having the model produce actual phonemes instead of unks; in this case, user
> oov queries can be first converted to phonemes and then (fuzzy) searched
> against the transcript; like a traditional full-text search
>
> Yenda you suggested this, could you pls give a few more pointers how to
> implement it?
I think it would be better to run something like run_unk_model.sh
which is more of a hybrid-- real words stay as words, but OOVs will
(hopefully) be decoded as sequences of phones.
>
> Remove unk arcs from the graph (as Dan suggested) and get the graph to
> produce the closest matching word(s) (ref. the discussion here). Is there a
> way then to know those forced-words are actually unks, eg. if they always
> have low confidence, and match them against oov-queries?
Confidence in this type of scenario is hard.
Dan
> To post to this group, send email to
kaldi...@googlegroups.com.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/kaldi-help/59ea8d15-be4c-4fe1-9af0-9be2a2f54125%40googlegroups.com.