Hello,
I am trying to build a decoding graph for an externally trained DNN using CTC. I made all of the recipe such that it was compatible as much as possible with Kaldi scripts (including the <sil> phone which is not trained in CTC actually), so dict and lang directories are as usual in Kaldi, except that I use position-independent chars as units. lang/phones/sets.{int,txt} has 32 rows (31 chars + 1 CTC blank) that match the number of transition model pdfs, with 1-state and 1 pdf per HMM state (forward and self-loop pdf-ids are the same).
Now I want to use the existing HCLG decoding graph and modify it to allow it to swallow the blank state after each "emitting" transition in HCLG, just as it has been recently published in the paper:
- "Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces", Frank Zhang et al, INTERSPEECH 2020 (
https://arxiv.org/pdf/2005.09150.pdf )
Please note that I am using characters as speech units, but still using the context fst C in the recipe, with N=1 and P=0. I assume that should bypass the phones into contexts...
I have written the Kaldi code below to perform this graph conversion, and it is decoding only parts of the utterances correctly, maybe some 50% of the words are correct. The decoded utterances are usually shorter than they should be, and some of them do not reach any final state (the decoder outputs partial hypotheses once in a while). So, there is something broken in the decoding graph I am building.
Here is the code that transforms HCLG.fst into HCLG-ctc.fst, which should work on CTC pdfs. First, I get a vector of tids assigned to the <blank> pdf in the TransitionModel:
std::vector<int32> blank_tids ;
GetPdfToTransitionIds(trans_model, blank_pdf_id, blank_tids) ;GetPdfToTransitionIds queries trans_model.TransitionIdToPdf() with all tids and collect those that match the desired pdf, the CTC blank here.
Then I start off by copying the original decoding graph, to make sure that looping over states and arcs does not interfere with the changes I make to the graph. I assume the copy will be an exact copy with same state and arc enumeration:
fst::VectorFst<fst::StdArc> *fst_ctc = fst->Copy() ;and loop over states and arcs/state in the graph duplicating states and adding corresponding arcs as in Figure 1 in the paper above:
for (StateId s = 0; s < fst->NumStates(); s++) {
// vectors storing a list of forward/loop arcs from s
std::vector<Arc> self_loop_arcs ;
std::vector<Arc> forward_arcs ;
for (MutableArcIterator<VectorFst<Arc> > aiter(fst, s) ;
!aiter.Done() ;
aiter.Next()) {
Arc arc = aiter.Value() ;
// convert <sil> tid to blank tid
...
// if it is an emitting arc...
if (arc.ilabel > 0) {
// store forward/loop arcs separately
if (arc.nextstate == s)
self_loop_arcs.push_back(arc) ;
else
forward_arcs.push_back(arc) ;
}
}
// prepare a new state that will swallow blanks for state s
StateId ctc_s = fst_ctc->AddState() ;
for (int32 blank_tid: blank_tids)
fst_ctc->AddArc(ctc_s, Arc(blank_tid, 0, Weight(0), ctc_s)) ;
// add forward arcs in s to ctc_s
for (Arc arc: forward_arcs)
fst_ctc->AddArc(ctc_s, arc) ;
// replicate final state if s was a final state already
if (fst_ctc->Final(s) != Weight::Zero() ) {
// allow ctc_s to be a final state if it was for s
fst_ctc->SetFinal(ctc_s, Weight::One());
}
// fix arcs for state s
fst_ctc->DeleteArcs(s) ;
// add the stored self-loop arcs from s
for (Arc arc: self_loop_arcs)
fst_ctc->AddArc(s, arc ) ;
// add <eps>:<eps> transition from s to ctc_s
fst_ctc->AddArc(s, Arc(0, 0, Weight(0), ctc_s)) ;
}
These changes to the graph are quite raw indeed, on the already composed and optimized HCLG graph, which makes any understanding what is going on in here quite tricky and not visualizable. One thing I am wondering is whether this sort of transformation can be done at the tid level (as I have done) or I need to operate on transition states, which I tried to avoid for the moment.
Maybe some of you Kaldi experts have some comments that shed some light on the code above?
Thanks,
Marc