Hello,
I am trying to build a HCLG decoding graph for a union of small n-gram LMs (each modeling a small, well-defined task, with separate text data for each, withe tasks competing against each other and a filler model). To do so, I format each ARPA LM into WFST and I basically add one disambigution symbol at the end of any sequence of each task (a new disambiguation symbol per task, as there can be some word sequence overlap across tasks). I then take the union of the task WFSTs and allow loops of all of it. I also replace the back-off disambiguation symbol for each n-gram WFST by a new one, but I think that is anyway protected by the task-wise disambiguation symbol.
Now, this works and I can build the decoding graph successfully. At decode time (online incremental decoding), though, lattice determinization fails with the message:
ERROR (online2-wav-nnet3-latgen-incremental[5.5.1189~1-f3cf2]:DeterminizeLatticePhonePrunedWrapper():determinize-lattice-pruned.cc:1495) Topological sorting of state-level lattice failed (probably your lexicon has empty words or your LM has epsilon cycles).
Since I have no empty words in the lexicon, I thought this might be due to each n-gram WFST accepting zero words (i.e. epsilon cycles, after removing disambiguation symbols) besides n-gram word sequences. What is the easiest way to force the n-gram FST to accept at least a word, while keeping all other n-gram costs untouched? I have spent quite some time forcing the main loop state in the n-gram WFST to be entered from an additional state swallowing at least one word transition (with 1-gram, 2-gram... going to the corresponding state) before reaching the loop state (the idea behind being implementing a +-closure from a *-closure graph, though the n-gram WFST can be quite complex itself). This seems to work with tiny test n-gram models but not really with larger n-gram models, so something I did is still wrong. Any ideas what I should be careful about? Would it be possible at all to achieve this at the ARPA LM level? Easier to handle at arpa2fst conversion level?
Thanks for any hint,
Marc