Decoding - Handwriting Recognition

Dayvid Castro

unread,

Sep 12, 2017, 6:49:00 PM9/12/17

to kaldi-help

Hi!

I'm trying to use Kaldi for decoding a handwriting recognition model. My situation is quite similar to what was described in the following discussion, the difference concerns the database that I'm using, which has 80 classes (78 chars + space + CTC blank label). I'm studying the tool but I still have some doubts. I intend to use the decode-faster-mapped, which involves a matrix of log-likelihoods, a decoding graph, and the HMM transition model. For now, I would like some help regarding the log-likelihoods matrix input format. So, for example, considering 80 classes and a test set with only one image, my model generates as output a matrix of log-likelihoods in the following way:

Shape: [18,80]

Truncated representation:

[[-13.487 -8.355 -8.416 ..., -13.693 -0.010 -4.797]

[-9.132 -7.163 -6.488 ..., -13.671 -0.014 -5.855]

[-9.017 -6.391 -4.293 ..., -9.185 -7.233 -5.949]

...,

[-17.873 -16.232 -13.409 ..., -16.146 -10.533 -0.029]

[-19.022 -16.719 -16.815 ..., -17.294 -16.461 -10.670]

[-22.810 -17.978 -19.435 ..., -20.250 -12.571 -0.004]]

Given this, which of the following options is the appropriate format for the matrix of log-likelihoods used for decode-faster-mapped?

I. Newline-terminated string to define multidimensionality

image_01 [-13.487 -8.355 -8.416 ..., -13.693 -0.010 -4.797

-9.132 -7.163 -6.488 ..., -13.671 -0.014 -5.855

-9.017 -6.391 -4.293 ..., -9.185 -7.233 -5.949

...,

-17.873 -16.232 -13.409 ..., -16.146 -10.533 -0.029

-19.022 -16.719 -16.815 ..., -17.294 -16.461 -10.670

-22.810 -17.978 -19.435 ..., -20.250 -12.571 -0.004]

II. Putting all log-likelihoods in one line separating by time steps

image_01 [ -13.487 -8.355 -8.416 ... -13.693 -0.010 -4.797] [-9.132 -7.163 -6.488 ... -13.671 -0.014 -5.855] [-9.017 -6.391 -4.293 ... -9.185 -7.233 -5.949] ...

III. Other

Thanks in advance,

Dayvid Castro

Daniel Povey

unread,

Sep 12, 2017, 6:52:06 PM9/12/17

to kaldi-help

It's actually 'decode-faster' that you want, since I imagine you won't
have a transition model.
It's the first of those two formats.
BTW we have been working on handwriting recognition recently; you can
look at @hhadian's repos in his github page and see if he has anything
as he is leading that. We have been using 'chain' models with mostly
the same configuration as we'd use for acoustic modeling, except with
diferent topologies; and obviously the features are different.

Dan

> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Dayvid Castro

unread,

Sep 15, 2017, 1:01:46 AM9/15/17

to kaldi-help

Thanks for helping me out, I really appreciate. I think I can make good use of the repository that you mentioned to prepare the language model and the lexicon since I am working with IAM as well as @hhadian. So, I thought that the Transition Model would be necessary since I wanna build an HMM-based decoder (with an RNN acting as the emission probability model). The decoding strategy concerning the graph is described as follows:

HMM transducer: Each character or form was represented as a Hidden Markov Model (HMM) with only one state, a self-loop transition, and a transition to the next model.
Lexicon FST: Transform a sequence of characters and blank symbols into words.
Grammar FST: Standard n-gram language model.

Since you already help me with doubt about the log-likelihoods table, now I'm trying to understand how I can build the H.fst. Please, correct me if I'm wrong:

First, I need to define the HMM topology in text format.

Here I have one question, If I want to set an HMM for each character, should I specify a TopologyEntry for each character or I can just insert all the chars options in <ForPhones>? I mean, in this way:
<Topology>
<TopologyEntry>
<ForPhones> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 </ForPhones>
<State> 0 <PdfClass> 0
<Transition> 0 0.5
<Transition> 1 0.5
</State>
<State> 1 </State>
</TopologyEntry>
</Topology>

Then, I use the gmm-init-mono in order to generate the transition model:

gmm-init-mono hmm-topology 39 mono.mdl mono.tree.
Here, I also have one basic question, what is this dim parameter? Should I set to be 80? Since I need 80 HMMs/states representing the 80 classes.

Finally, use the make-h-transducer to generate the H.fst.

make-h-transducer ilabel_info 1.tree 1.mdl > H.fst. Here, I failed at an understanding some aspects: it seems that make-h-transducer deals with context-dependency, which is not my case, I wanna only the HLG.fst decoding graph, can I build the H.fst without context-dependent phones? The make-h-transducer needs the ilabel_info containing the integer mappings, but this file is generated only when LG.fst is composed with C.fst, so I'm wondering if I could build this file manually and proceed.

I would be grateful if you could helping me confirming if the whole process is correct and also clarifying my questions in these steps.

Thanks again,

Dayvid Castro.

Daniel Povey

unread,

Sep 15, 2017, 1:06:13 AM9/15/17

to kaldi-help

Doing it that way is going to be complicated if not impossible.
You should create the FST manually and decode without the transition
model (decode-faster).
read openfst.org to understand the basic notation of FSTs.

Reply all

Reply to author

Forward