I am trying to get the word timings of the decoded output using the chain model.
The output og nbest-to-ctm seems to be wrong. Word timings are not correct.
What am I doing wrong?
Here is the script, I am running to get the word timings.
online2-wav-nnet3-latgen-faster --do-endpointing=false \
--online=false \
--config=conf/decode.config \
--max-active=7000 --beam=15.0 --lattice-beam=6.0 \
--mfcc-config=conf/mfcc_hires.conf \
--feature-type=mfcc --frame-subsampling-factor=3 \
--acoustic-scale=1.0 --word-symbol-table=data/decode_460/tdnn_sp/words.txt \
--feature-type=mfcc \
--ivector-extraction-config=data/decode_460/conf/ivector_extractor.conf \
data/decode_460/tdnn_sp/final.mdl data/decode_460/tdnn_sp/HCLG.fst \
"ark:echo utterance-id1 utterance-id1|" "scp:echo utterance-id1 /home/gnani/Downloads/SoundRecord-2018-05-08-12-07-02_8k.wav|" \
ark:| lattice-1best ark:- ark: | \
lattice-align-words data/lang/phones/word_boundary.int data/decode_460/tdnn_sp/final.mdl ark:- ark:- | \
nbest-to-ctm --frame-shift=0.01 --print-silence=true ark:- - | \
utils/int2sym.pl -f 5 data/decode_460/tdnn_sp/words.txt
The output I am getting is this:
LOG (online2-wav-nnet3-latgen-faster[5.4.122~3-08012]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (online2-wav-nnet3-latgen-faster[5.4.122~3-08012]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (online2-wav-nnet3-latgen-faster[5.4.122~3-08012]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (online2-wav-nnet3-latgen-faster[5.4.122~3-08012]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (online2-wav-nnet3-latgen-faster[5.4.122~3-08012]:Collapse():nnet-utils.cc:1314) Added 1 components, removed 2
LOG (online2-wav-nnet3-latgen-faster[5.4.122~3-08012]:CompileLooped():nnet-compile-looped.cc:334) Spent 0.187792 seconds in looped compilation.
utterance-id1 WHICH COINCIDENTALLY PRETTY MUCH MATCHES THE WORDS DEFINITION
LOG (online2-wav-nnet3-latgen-faster[5.4.122~3-08012]:main():online2-wav-nnet3-latgen-faster.cc:286) Decoded utterance utterance-id1
LOG (online2-wav-nnet3-latgen-faster[5.4.122~3-08012]:Print():online-timing.cc:55) Timing stats: real-time factor for offline decoding was 0.304333 = 1.64477 seconds / 5.4045 seconds.
LOG (online2-wav-nnet3-latgen-faster[5.4.122~3-08012]:main():online2-wav-nnet3-latgen-faster.cc:292) Decoded 1 utterances, 0 with errors.
LOG (online2-wav-nnet3-latgen-faster[5.4.122~3-08012]:main():online2-wav-nnet3-latgen-faster.cc:294) Overall likelihood per frame was 2.32917 per frame over 180 frames.
utterance-id1 1 0.000 0.380 <eps>
utterance-id1 1 0.380 0.110 WHICH
utterance-id1 1 0.490 0.270 COINCIDENTALLY
utterance-id1 1 0.760 0.110 PRETTY
utterance-id1 1 0.870 0.090 MUCH
utterance-id1 1 0.960 0.130 MATCHES
utterance-id1 1 1.090 0.040 THE
utterance-id1 1 1.130 0.160 WORDS
utterance-id1 1 1.290 0.030 <eps>
utterance-id1 1 1.320 0.250 DEFINITION
utterance-id1 1 1.570 0.230 <eps>
LOG (lattice-1best[5.4.122~3-08012]:main():lattice-1best.cc:92) Done converting 1 to best path, 0 had errors.
LOG (lattice-align-words[5.4.122~3-08012]:main():lattice-align-words.cc:125) Successfully aligned 1 lattices; 0 had errors.
LOG (nbest-to-ctm[5.4.122~3-08012]:main():nbest-to-ctm.cc:119) Converted 1 linear lattices to ctm format; 0 had errors.
The sox output of the file:
Input File : '/home/gnani/Downloads/SoundRecord-2018-05-08-12-07-02_8k.wav'
Channels : 1
Sample Rate : 8000
Precision : 16-bit
Duration : 00:00:05.40 = 43236 samples ~ 405.337 CDDA sectors
File Size : 86.5k
Bit Rate : 128k
Sample Encoding: 16-bit Signed Integer PCM