thanks Dan.
I first convert the alignments into pdf-ids by ali-to-pdf, then use show-transitions to get the corresponding phone and states. Start/end time information is calculated by ali-to-pdf output.
The sentence-middle silence are shown as below in ali-to-pdf output file:
... 367 0 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 2707 ...
where in show-transitions output it is:
Transition-state 1: phone = sil hmm-state = 0 pdf = 0
Transition-state 2: phone = sil hmm-state = 1 pdf = 1
Transition-state 3: phone = sil hmm-state = 2 pdf = 2
Transition-state 4: phone = sil hmm-state = 3 pdf = 3
Transition-state 5: phone = sil hmm-state = 4 pdf = 4
so this generates the results like:
xxx xxx sil[0]
xxx xxx sil[1]
xxx xxx sil[4]
xxx xxx sil[0]
xxx xxx sil[1]
xxx xxx sil[2]
...
I checked the model and found the (default) silence model does not has a left-to-right topology. On the other hand, all "real" phones do not has this issue.
If this is the reason, I guess in my case I need to modify the topology of silence model or create by hand, but no idea if this will lead to accuracy loss.
<TopologyEntry>
<ForPhones>
1
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.25 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 3 <PdfClass> 3 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 4 <PdfClass> 4 <Transition> 4 0.75 <Transition> 5 0.25 </State>
<State> 5 </State>
</TopologyEntry>
在 2019年7月28日星期日 UTC+8上午11:41:10,Dan Povey写道: