Time alignments off by a factor of 3 when using nnet3-align-compiled

488 views
Skip to first unread message

John Steinberg

unread,
Mar 8, 2018, 10:54:08 AM3/8/18
to kaldi-help
Hi all,

I recently trained a chain model and am trying to use it to generate phone level timing information using nnet3-align-compiled piped to ali-to-phones. However, I'm finding that the timing information is off by a factor of 3, i.e. the ctm-like output from ali-to-phones shows that the silence at the utterance end occurs at ~1 second for a 3 second long cut. Basically, if I use --frame-subsampling-factor=3 (which is what I expected to use since this is a chain model) I see this behavior but if I use --frame-subsampling-factor=1 then I get what appears to be reasonable alignments. It seems as if the frame subsampling factor is getting used twice somehow. A totally made up example of a 3 second long cut of someone saying "cat" is below:


For --frame-subsampling-factor=3, ali-to-phones gives:

utt1 1 0.0 0.1 sil
utt1 1 0.1 0.2 c
utt1 1 0.3 0.3 a
utt1 1 0.6 0.2 t
utt1 1 0.8 0.2 sil

For --frame-subsampling-factor=1, ali-to-phones gives:

utt1 1 0.0 0.3 sil
utt1 1 0.3 0.6 c
utt1 1 0.9 0.9 a
utt1 1 1.8 0.6 t
utt1 1 2.4 0.6 sil

I created my script to generally follow steps/nnet3/align.sh - the only real difference is that I generate features on demand for whatever I'm trying to align. My basic setup is to generate mfcc's and ivectors using online2-wav-dump-features and ivector-extract-online2 respectively and then pass it to a command like this:

compile-train-graphs tree final.mdl L.fst "ark:utils/sym2int.pl -f 2- words.txt $data/text |" ark:- \
    | nnet3-align-compiled \
      --use-gpu=$use_gpu \
      --acoustic-scale=$acoustic_scale \
      --beam=$beam \
      --frame-subsampling-factor=$frame_subsampling_factor \
      --online-ivector-period=$online_ivector_period \
      --online-ivectors=scp:$data/ivector_online.scp \
      --transition-scale=$transition_scale \
      --self-loop-scale=$self_loop_scale \
      final.mdl ark:- scp:$data/feats.scp ark:- \
    | ali-to-phones --ctm-output final.mdl ark:- - \
    | ./utils/int2sym.pl -f 5 phones.txt \
              > $ctm || exit 1

So long story short, do you have any idea how/why the --frame-subsampling-factor might be getting applied twice (or could it be a different matter entirely)? It's possible I screwed up somewhere but I've double checked my setup and I can't find any other point where I introduce an extra factor of 3.

Any thoughts would be helpful. Thanks!

John Steinberg

unread,
Mar 8, 2018, 11:00:16 AM3/8/18
to kaldi-help
Oh, and I should probably mention that I'm using the 5.0 branch of Kaldi to do this since I'm using additional toolkits that are only compatible with that branch.

entn-at

unread,
Mar 8, 2018, 12:12:45 PM3/8/18
to kaldi-help
I believe you have to use "--frame-shift=0.03" as parameter to ali-to-phones to compensate for the chain model frame subsampling factor.

John Steinberg

unread,
Mar 8, 2018, 1:29:43 PM3/8/18
to kaldi-help
Ah hah, you're right! I can't believe I missed that. Thanks for the pointer.

-John
Reply all
Reply to author
Forward
0 new messages