Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

what G.fst to use when rescoring a lattice made by a make-grammar-fst created graph

335 views
Skip to first unread message

vahid K

unread,
Mar 8, 2021, 11:28:10 AM3/8/21
to kaldi-help
Hi,

I am using make-grammar-fst to extend the vocabulary of my ASR system (I have a list of words that are absent in the train set, and the list can change all the time).
This works great for the initial decoding. But for rescoring with RNNLM, I am a bit confused about what G.fst should be fed into "lattice-lmrescore-kaldi-rnnlm-pruned"?
There is the G.fst for the new words, which only includes the new words. There is also the original G.fst that includes the #nonterm:unk symbols, which I suppose are not appropriate for the RNNLM. 
So I am thinking which G.fst should I use? should I combine the two and create a new one? I am afraid I am going to mess some IDs that way.
(I have extended the RNNLM vocabulary already with rnnlm/change_vocab to include the list of new words.)

I'd appreciate your advice on this.
Thanks,
Vahid

Daniel Povey

unread,
Mar 8, 2021, 12:01:19 PM3/8/21
to kaldi-help
You should probably use the entire FST with the small piece inserted into the large one.. it may be possible to do with `fstreplace` once you figure out how it works (I advise to experiment with
toy examples).

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/0875c443-5b66-40e3-9120-088f118621d8o%40googlegroups.com.

vahid kh

unread,
Mar 8, 2021, 5:30:47 PM3/8/21
to kaldi...@googlegroups.com
Thanks so much Dan,
I was able to make some progress in using fstreplace, to replace #nonterm:unk of the bigger G.fst  symbols by the smaller FST containing the OOV words.
These are the steps I took:

fstinvert G.fst > G_i.fst                                
fstinvert Gsmall.fst > Gsmall_i.fst  
fstreplace G_i.fst $gfst_id Gsmall_i.fst $nontermunk_id > newG_i.fst
fstinvert newG_i.fst > newG.fst
fstdeterminizestar  newG.fst >  newG_determinized.fst

With these steps I was able to rescore the lattice and actually get some of the introduced OOVs on the rescored lattices. But some other lattices are not successfully rescored and I am getting the "compsed lattice has no States: something went wrong" message for some of the lattices.  
I was wondering if these steps look reasonable or have I missed something?
(my Gsmall.fst contains #nonterm_beg and #nonterm_end disambiguation symbols. With the steps above, I can see they still exist in the final  newG_determinized.fst. Could that be the cause?)

Thanks,
Vahid


Daniel Povey

unread,
Mar 8, 2021, 10:23:45 PM3/8/21
to kaldi-help
You should remove #nonterm_big and #nonterm_end, they will cause failures in matching...

Armando

unread,
Oct 8, 2021, 3:24:39 PM10/8/21
to kaldi-help
I am not sure if, following Dan's advice, you were able to correctly rescore the lattices; I was wondering, reading your message, how you were able, in certain cases, to correctly enter the secondary fst if 1) your lattice does not contain any "nonterm" output and 2) your secondary fst does and at the very beginning; I was thinking that a matching failure would always occur during composition
(btw, I'm not sure why you need to invert the fst before replace)

this grammar-fst lattice rescoring task has never been adressed in a definitive way, if I have not missed something; and I think it might be something as simple as an epsilon arc introduced by fstreplace while transitioning the secondary fst that always causes matching failures

I actually came up with something different in the implementation but that should be conceptually similar (or equivalent)

1- I retain all non-terminals in the output (nonterm_unk, nonterm_begin, nonterm_end)
2- I depth-visit the lattice and mark each arc as generated by whatever fst, which I know because the nonterminals arcs can give me that info
3- I compose lattice arcs with corresponding arcs of the corresponding fst (this is done by for subtracting and adding, so it's the usual rescoring pipeline)

then again, you need to be careful about matching at the transition; fstconcat of nonterm_begin and the n-gram secondary fst actually generates and epsilon arc between the two and since I use  BackoffDeterministicOnDemandFst that arc was considered the back off arc, so I was never really composing with the actual secondary n-gram.
I suspect my struggle with fstreplace might have been related to just a similar issue
Reply all
Reply to author
Forward
0 new messages