Union of n-gram FSTs failing at decode time...

marc....@protonmail.com

unread,

Nov 16, 2022, 8:56:15 AM11/16/22

to kaldi-developers

Hello,

I am trying to build a HCLG decoding graph for a union of small n-gram LMs (each modeling a small, well-defined task, with separate text data for each, withe tasks competing against each other and a filler model). To do so, I format each ARPA LM into WFST and I basically add one disambigution symbol at the end of any sequence of each task (a new disambiguation symbol per task, as there can be some word sequence overlap across tasks). I then take the union of the task WFSTs and allow loops of all of it. I also replace the back-off disambiguation symbol for each n-gram WFST by a new one, but I think that is anyway protected by the task-wise disambiguation symbol.

Now, this works and I can build the decoding graph successfully. At decode time (online incremental decoding), though, lattice determinization fails with the message:

ERROR (online2-wav-nnet3-latgen-incremental[5.5.1189~1-f3cf2]:DeterminizeLatticePhonePrunedWrapper():determinize-lattice-pruned.cc:1495) Topological sorting of state-level lattice failed (probably your lexicon has empty words or your LM has epsilon cycles).

Since I have no empty words in the lexicon, I thought this might be due to each n-gram WFST accepting zero words (i.e. epsilon cycles, after removing disambiguation symbols) besides n-gram word sequences. What is the easiest way to force the n-gram FST to accept at least a word, while keeping all other n-gram costs untouched? I have spent quite some time forcing the main loop state in the n-gram WFST to be entered from an additional state swallowing at least one word transition (with 1-gram, 2-gram... going to the corresponding state) before reaching the loop state (the idea behind being implementing a +-closure from a *-closure graph, though the n-gram WFST can be quite complex itself). This seems to work with tiny test n-gram models but not really with larger n-gram models, so something I did is still wrong. Any ideas what I should be careful about? Would it be possible at all to achieve this at the ARPA LM level? Easier to handle at arpa2fst conversion level?

Thanks for any hint,

Marc

Nikolay Malkovskiy

unread,

Nov 17, 2022, 1:22:57 PM11/17/22

to kaldi-developers

Empty words might be referring to this check in lexicon creation
https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/make_lexicon_fst.pl#L76
which is missing in the sibling make_lexicon_fst_silprob script, checkout whether you have eps in the lexicon, it shouldn't be there.

Next, it is not clear how do you work with disambiguation symbols. Essentially they are used to optimize HCLG creation, but in the end they become eps transitions and thus might cause the problem in discuss.

Finally, i would actually recommend to perform the union on the level of ngram. Usually any ngram package allows you to merge several LMs into a single one, for example fst-based OpenGRM lib has this

https://www.openfst.org/twiki/bin/view/GRM/NGramMerge

среда, 16 ноября 2022 г. в 16:56:15 UTC+3, marc....@protonmail.com:

Daniel Povey

unread,

Nov 19, 2022, 10:45:27 AM11/19/22

to kaldi-de...@googlegroups.com

You could compose the FSA with an FSA that accepts one or more words, that would be the easiest way I think.

--
visit http://kaldi-asr.org/forums.html to find out how to join.
---
You received this message because you are subscribed to the Google Groups "kaldi-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-develope...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-developers/5d225b20-6efa-4682-bf58-549590b8b4c2n%40googlegroups.com.

marc....@protonmail.com

unread,

Nov 21, 2022, 8:25:25 AM11/21/22

to kaldi-developers

Thanks Nikolay,

Next, it is not clear how do you work with disambiguation symbols. Essentially they are used to optimize HCLG creation, but in the end they become eps transitions and thus might cause the problem in discuss.

When I do the union of n-gram WFSTs there might be word sequence overlap. Adding a new disambiguation symbol at the end of each sequence will make sure that each same sequence is different depending on the task we go through. That should help optimization later.

Nikolay Malkovskiy schrieb am Donnerstag, 17. November 2022 um 19:22:57 UTC+1:

Empty words might be referring to this check in lexicon creation
https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/make_lexicon_fst.pl#L76
which is missing in the sibling make_lexicon_fst_silprob script, checkout whether you have eps in the lexicon, it shouldn't be there.

I will check that, thanks.

Next, it is not clear how do you work with disambiguation symbols. Essentially they are used to optimize HCLG creation, but in the end they become eps transitions and thus might cause the problem in discuss.

When I do the union of n-gram WFSTs there might be word sequence overlap across n-gram WFSTs. I think adding a new disambiguation symbol at the end of each sequence will make sure that each same sequence is different for each task. That should help determinization later.

Finally, i would actually recommend to perform the union on the level of ngram. Usually any ngram package allows you to merge several LMs into a single one, for example fst-based OpenGRM lib has this
https://www.openfst.org/twiki/bin/view/GRM/NGramMerge

I want to tag each task with input output words, so merging or interpolating LMs would not allow me to do that.

Thanks for your input, Nikolay!

marc....@protonmail.com

unread,

Nov 21, 2022, 8:30:27 AM11/21/22

to kaldi-developers

Thanks Dan! It seems to be an easy and much safer way to do it! That'll work I think!

Reply all

Reply to author

Forward