Decoding graph for CTC network...

351 views
Skip to first unread message

marc....@protonmail.com

unread,
Feb 10, 2021, 12:42:09 PM2/10/21
to kaldi-developers
Hello,

I am trying to build a decoding graph for an externally trained DNN using CTC. I made all of the recipe such that it was compatible as much as possible with Kaldi scripts (including the <sil> phone which is not trained in CTC actually), so dict and lang directories are as usual in Kaldi, except that I use position-independent chars as units. lang/phones/sets.{int,txt} has 32 rows (31 chars + 1 CTC blank) that match the number of transition model pdfs, with 1-state and 1 pdf per HMM state (forward and self-loop pdf-ids are the same).

Now I want to use the existing HCLG decoding graph and modify it to allow it to swallow the blank state after each "emitting" transition in HCLG, just as it has been recently published in the paper:

  - "Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces", Frank Zhang et al, INTERSPEECH 2020 ( https://arxiv.org/pdf/2005.09150.pdf )

Please note that I am using characters as speech units, but still using the context fst C in the recipe, with N=1 and P=0. I assume that should bypass the phones into contexts...

I have written the Kaldi code below to perform this graph conversion, and it is decoding only parts of the utterances correctly, maybe some 50% of the words are correct. The decoded utterances are usually shorter than they should be, and some of them do not reach any final state (the decoder outputs partial hypotheses once in a while). So, there is something broken in the decoding graph I am building.

Here is the code that transforms HCLG.fst into HCLG-ctc.fst, which should work on CTC pdfs. First, I get a vector of tids assigned to the <blank> pdf in the TransitionModel:

   std::vector<int32> blank_tids ;
   GetPdfToTransitionIds(trans_model, blank_pdf_id, blank_tids) ;


GetPdfToTransitionIds queries trans_model.TransitionIdToPdf() with all tids and collect those that match the desired pdf, the CTC blank here.

Then I start off by copying the original decoding graph, to make sure that looping over states and arcs does not interfere with the changes I make to the graph. I assume the copy will be an exact copy with same state and arc enumeration:

  fst::VectorFst<fst::StdArc> *fst_ctc = fst->Copy() ;


and loop over states and arcs/state in the graph duplicating states and adding corresponding arcs as in Figure 1 in the paper above:

  for (StateId s = 0; s < fst->NumStates(); s++) {

    // vectors storing a list of forward/loop arcs from s
    std::vector<Arc> self_loop_arcs ;
    std::vector<Arc> forward_arcs ;

    for (MutableArcIterator<VectorFst<Arc> > aiter(fst, s) ;
         !aiter.Done() ;
         aiter.Next()) {

      Arc arc = aiter.Value() ;
           
      // convert <sil> tid to blank tid
      ...

      // if it is an emitting arc...
      if (arc.ilabel > 0) {
        // store forward/loop arcs separately
        if (arc.nextstate == s)
          self_loop_arcs.push_back(arc) ;
        else
          forward_arcs.push_back(arc) ;
      }
    }

    // prepare a new state that will swallow blanks for state s
    StateId ctc_s = fst_ctc->AddState() ;
    for (int32 blank_tid: blank_tids)
      fst_ctc->AddArc(ctc_s, Arc(blank_tid, 0, Weight(0), ctc_s)) ;

    // add forward arcs in s to ctc_s
    for (Arc arc: forward_arcs)
      fst_ctc->AddArc(ctc_s, arc) ;

    // replicate final state if s was a final state already
    if (fst_ctc->Final(s) != Weight::Zero() ) {
      // allow ctc_s to be a final state if it was for s
      fst_ctc->SetFinal(ctc_s, Weight::One());
    }

    // fix arcs for state s
    fst_ctc->DeleteArcs(s) ;
    // add the stored self-loop arcs from s
    for (Arc arc: self_loop_arcs)
      fst_ctc->AddArc(s, arc ) ;
    // add <eps>:<eps> transition from s to ctc_s
    fst_ctc->AddArc(s, Arc(0, 0, Weight(0), ctc_s)) ;
  }

These changes to the graph are quite raw indeed, on the already composed and optimized HCLG graph, which makes any understanding what is going on in here quite tricky and not visualizable. One thing I am wondering is whether this sort of transformation can be done at the tid level (as I have done) or I need to operate on transition states, which I tried to avoid for the moment.

Maybe some of you Kaldi experts have some comments that shed some light on the code above?

Thanks,

Marc


Daniel Povey

unread,
Feb 11, 2021, 3:19:19 AM2/11/21
to kaldi-developers
You could look at how we prepared the CTC graphs in snowfall, look for decode_ctc.py.
 https://github.com/k2-fsa/snowfall
it requires k2 as a dependency, which may not be super easy to install, but see its github page, https://github.com/k2-fsa/k2.
I think it would be easier to start from the LG, and either compose it with something or modify it, since you don't really need
context dependency.
I think one problem with what you have done is that you are not allowing repeats of the phone symbol.
Plus I am not sure that it would be possible to get the right results here when using a transition model, it's not intended for use with
non-Kaldi models.

Dan


--
visit http://kaldi-asr.org/forums.html to find out how to join.
---
You received this message because you are subscribed to the Google Groups "kaldi-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-develope...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-developers/727d54e4-c3b9-4332-8ac4-ee3659413a5cn%40googlegroups.com.

Xiaohui Zhang

unread,
Feb 11, 2021, 9:40:54 PM2/11/21
to kaldi-developers
You can also try the CTC-CRF toolkit. The CTC topo is the same as K2, and is Kaldi-friendly as much as possible, which is probably what you want.

Xiaohui

Peter Mihajlik

unread,
Feb 12, 2021, 3:50:42 PM2/12/21
to kaldi-de...@googlegroups.com
Marc,

Your approach seems overcomplicated to me. Using Kaldi decoder for externally (CTC) trained end-to-end DNN as character based acoustic model is quite simple:
- use the "space" character model of CTC traning instead of <sil>
- build pronunciation dictionary like this: TOP     _  T  _  O  _  P _
where _ means blank character
- regarding Ha transducer, build a similar structure for the blank character as for the other characters
- in the last step of HCLG generation, when you add self-loops, add an epsilon transition to the _ arcs so that they becomes skipable.

And that's all. You may add a "space" character to the beginning or to the end of the graphemic transcription.

We did this - it works fine.

Peter

marc....@protonmail.com

unread,
Feb 15, 2021, 11:20:05 AM2/15/21
to kaldi-developers
Hi,

thank you everyone for your replies. I eventually made it work. What I did is a copy of AddSelfLoops function to prepare the CTC decoding graph after the self-loops were added in the Kaldi HCLG preparation recipe (the code above), but I missed the transitions for the disambiguation symbols in the process (for non-transition states indeed). This meant the graph had some broken paths that prevented decoding from reaching final states, which is compatible with the errors I got. From Peter's comment, I can see now that including the CTC changes while adding the self-loops is probably an easier way to go, indeed, but now the work is done anyway :). Using CTC decoding on the HCLG Kaldi graph, instead of K2, allows me to also use context-dependent units, which I think might help reduce error rates. Now, my recipe is working as expected. I am looking forward to try the Lhotse/K2 recipes though.

Thanks so much,

Marc
Reply all
Reply to author
Forward
0 new messages