Extend Kaldi ASR to new words

2,189 views
Skip to first unread message

Billy

unread,
Feb 18, 2017, 12:06:47 AM2/18/17
to kaldi-help
Hi,

Is there a way to get the phoneme-level WFSA or possible phoneme sequences for unseen new words (for example, drug names like methamphetamine, ketamine, etc) with the Kaldi ASR system? 

If we directly decode the new word pronunciations using the model trained with a standard Kaldi script (say under egs/swbd/s5b/run.sh), the decoding result for a new word will be a sequence of pre-existing words in the dictionary or a phoneme sequence that corresponds to these pre-existing words, which may deviate from its real pronunciation because the decoding graph is constrained by the lexicon and grammar that does not include the new word) .

Thanks.


Daniel Povey

unread,
Feb 18, 2017, 12:11:12 AM2/18/17
to kaldi-help
In egs/tedlium/s5_r2/run.sh, you'll see a commented-out line that says:

# local/run_unk_model.sh     

This script demonstrates how you can add something to the decoding graph that makes it possible to decode arbitrary phone sequences in addition to regular words; they fill the slot where the "unknown word" would normally appear in the LM.

But don't expect the resulting decoded phone sequences to be particularly accurate.  Also, getting the actual phone sequences is not 100% trivial; the simplest method would be to pipe the lattices through lattice-best-path (with correct acoustic scale), then lattice-align-words, then lattice-arc-post, then sym2int a couple of times with suitable options to convert the words and phones to text form.



Dan



--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Billy

unread,
Feb 18, 2017, 12:37:09 AM2/18/17
to kaldi-help
Thank you so much for the detailed answer. I will look into the script you pointed out.

Would it be also possible to get the phoneme-level WFSA for a new word from "local/run_unk_model.sh"? The phoneme-level WFSA includes multiple hypotheses of the new word pronunciation.

Thanks. 
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Feb 18, 2017, 12:39:36 AM2/18/17
to kaldi-help
That would be quite tricky to do.  It would be possible-- by changing lattice determinization options-- to get multiple pronunciations of the unk word to remain in the lattice, but extracting them as a WFSA would be quite tricky due to interaction with surrounding context words.  One possible way would be to get the best path, then compose the lattice with the word sequence from that best path, doing lattice-align-words, and extracting just the bits aligning with the word you want.  But it would require deep knowledge of FSTs and of Kaldi and I don't recommend that you attempt it in the near future.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Billy

unread,
Feb 18, 2017, 12:53:31 AM2/18/17
to kaldi-help, dpo...@gmail.com
Thank you, Dan!

If I just want to get multiple hypothesized phoneme sequences of the new word, is it correct to go through the following steps as you described?

1. Decode with modified lattice determination options to get multiple pronunciations of the unk word to remain in the lattice.

2. Pipe the lattices through lattice-best-path (with correct acoustic scale), then lattice-align-words, then lattice-arc-post, then sym2int a couple of times with suitable options to convert the words and phones to text form.

Thanks.

Daniel Povey

unread,
Feb 18, 2017, 1:21:56 AM2/18/17
to Billy, kaldi-help
No, that would not work.
It's more complicated and you won't be able to do it, sorry.

Billy

unread,
Feb 21, 2017, 12:08:52 AM2/21/17
to kaldi-help
I see. Thank you, Dan. I have run "local/run_unk_model.sh" and I really admire the wonderful idea behind it.

Now I only want to get the phoneme sequence for unseen new words. In our dataset, each training utterance contains one-time pronunciation of the new word. In this case, word-level LM (grammar G) may not be necessary for decoding these utterances. 

If I use the HCLG produced by "local/run_unk_model.sh" for decoding, the phoneme sequence may still correspond to those pre-existing words since the word-level LM may impose large weights to some decoding paths which we do not want.  

Would you suggest that we build another graph that does not have word-level LM information for decoding the single new word utterances? Is there a simple implementation in Kaidi?

Thanks.

Daniel Povey

unread,
Feb 21, 2017, 12:21:27 AM2/21/17
to kaldi-help
you do need the language model, otherwise you'll get nonsense.  I suspect the language model that script produces is close to the best you can do in this scenario.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Billy

unread,
Feb 21, 2017, 6:23:14 PM2/21/17
to kaldi-help
Thank you, Dan. I will definitely try that out.

Sorry that my words last time was a little misleading. I actually meant that each training utterance consists of only one-time pronunciation of the unseen new word (no other words) and I want to figure out the best HCLG for decoding these utterances.

For example, the word-level transcription (ground truth) for the training wav file "methamphetamine_1.wav" is methamphetamine; the word-level transcription (ground truth) for the training wav file "methamphetamine_2.wav" is methamphetamine...

I agree that the HCLG from "local/run_unk_model.sh" should be my best choice. But I was just wondering why the word-level LM is still necessary for decoding theses wav files? Because I think the word-level LM is a distribution over sequences of words, but there is only one word in my case.

Thanks.

Daniel Povey

unread,
Feb 21, 2017, 6:24:21 PM2/21/17
to kaldi-help
That HCLG is not really a word-level LM, it's phone level.
There are "words" as far as the graph building setup is concerned, but they are really phones.


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Billy

unread,
Feb 21, 2017, 10:42:06 PM2/21/17
to kaldi-help
Thank you, Dan.

But in "kaldi/egs/tedlium/s5_r2/local/run_unk_model.sh", we see the following three lines.

utils/prepare_lang.sh --unk-fst exp/unk_lang_model/unk_fst.txt data/local/dict "<unk>" data/local/lang data/lang_unk

cp data/lang/G.fst data/lang_unk/G.fst

utils/mkgraph.sh data/lang_unk exp/tri3 exp/tri3/graph_unk

In "utils/prepare_lang.sh", the "<unk>" (more specifically "#2") in "L.fst" is replace with the phone-level LM "unk_fst.txt". Then in "utils/mkgraph.sh", "L.fst" is composed with the word-level LM "G.fst" from "data/lang". 

Does that mean the final the HCLG is still constrained by word-level gammar?

Daniel Povey

unread,
Feb 21, 2017, 11:18:06 PM2/21/17
to kaldi-help
yes there is still a word-level LM but it will also decode arbitrary phone sequences when it outputs the word "<unk>".


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
Message has been deleted

ark...@onvego.com

unread,
Oct 9, 2017, 12:19:23 PM10/9/17
to kaldi-help
Hi Daniel
I made an unk model and tried to get the phonemes for words that are not in the lexicon.
I followed your instructions (the simplest method would be to pipe .. )  and did everything in c++
Here is my code :
kaldi::BaseFloat min_post = 0.0001;
                CompactLattice aligned_clat;
                WordBoundaryInfoNewOpts opts;
                WordBoundaryInfo info(opts, "/path/to/word_boundary.int");
                bool ok = WordAlignLattice(best_path_clat, trans_model, info, 0, &aligned_clat);
                 fst::ScaleLattice(fst::LatticeScale(2.5, 0.4), &clat);
                kaldi::TopSortCompactLatticeIfNeeded(&aligned_clat);
                 kaldi::ArcPosteriorComputer computer(aligned_clat, min_post, false,&trans_model);
                std::vector<int32> phoneme = computer.OutputPosteriors();

I made insignificant changes in several methos

the result :

I get the sequences of phonemes for the words in my lexicon,
but for words that are not in the lexicon, I get the "sil" phoneme.
what's wrong ?
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Oct 9, 2017, 1:22:28 PM10/9/17
to kaldi-help
If there was no "unk" in your language model it would never decode
unk; look at your G.fst and see if there is unk on the labels. There
are options when you create the language model.

Also from the word-sequence point of view it would just output "unk",
not the phone sequence. You'd have to do something like
lattice-align-words and then lattice-arc-post if you want to see the
phone sequence.

ark...@onvego.com

unread,
Oct 9, 2017, 1:44:06 PM10/9/17
to kaldi-help
Thanks
I'll look at G.fst 
it is important to note that before I built the unk model,
 I used the graph I built according to the aspire directory and indeed I got unk so it actually means it exists on G.fst not like that?
Correct me if I'm wrong but I do exactly what you prescribed to do. 1 . lattice-best-path as follows:   CompactLattice clat;   bool end_of_utterance = true;   decoder-> GetLattice (end_of_utterance, & clan);   CompactLattice best_path_clat;   CompactLatticeShortestPath (clan, & best_path_clat);                Lattice best_path_lat;   ConvertLattice (best_path_clat, & best_path_lat); 2. lattice-align-words, as follows : WordBoundaryInfoNewOpts opts; WordBoundaryInfo info (opts, "/path/to/word_boundary.int"); bool ok = WordAlignLattice (best_path_clat, trans_model, info, 0, & aligned_clat); 3. lattice-arc-post, as follows: kaldi::TopSortCompactLatticeIfNeeded (& aligned_clat); kaldi::ArcPosteriorComputer computer (aligned_clat, min_post, false, & trans_model); std::vector <int32> phoneme = computer.OutputPosteriors (); And instead of printing the phonemes I return the phonemes to the OutputPosteriors function.
Is not that the right way to get the phonemes? For words that are in my lexicon, I accept the phonemes in this way ...

Daniel Povey

unread,
Oct 9, 2017, 1:46:39 PM10/9/17
to kaldi-help
Try increasing the probability of unk by modifying the costs G.fst,
and rebuilding your graph; perhaps it was too unlikely so it never got
decoded.

ark...@onvego.com

unread,
Oct 11, 2017, 6:01:49 AM10/11/17
to kaldi-help
That was the problem, I took care of it and now I get the phonemes as I wanted.
Thank you !
The next problem is that sometimes I get phonemes that are far from reflecting what I said.
I saw that you wrote up that the process should be performed several times
 until I get maximum accuracy - what is the purpose of the rehearsals?

Daniel Povey

unread,
Oct 11, 2017, 6:50:30 PM10/11/17
to kaldi-help
Rerunning the exact same thing will never make a difference. You'd
have to provide me with an exact quote of what I said in context for
me to clarify.
Reply all
Reply to author
Forward
0 new messages