Hi all,
I recently trained a chain model and am trying to use it to generate phone level timing information using nnet3-align-compiled piped to ali-to-phones. However, I'm finding that the timing information is off by a factor of 3, i.e. the ctm-like output from ali-to-phones shows that the silence at the utterance end occurs at ~1 second for a 3 second long cut. Basically, if I use --frame-subsampling-factor=3 (which is what I expected to use since this is a chain model) I see this behavior but if I use --frame-subsampling-factor=1 then I get what appears to be reasonable alignments. It seems as if the frame subsampling factor is getting used twice somehow. A totally made up example of a 3 second long cut of someone saying "cat" is below:
For --frame-subsampling-factor=3, ali-to-phones gives:
utt1 1 0.0 0.1 sil
utt1 1 0.1 0.2 c
utt1 1 0.3 0.3 a
utt1 1 0.6 0.2 t
utt1 1 0.8 0.2 sil
For --frame-subsampling-factor=1, ali-to-phones gives:
utt1 1 0.0 0.3 sil
utt1 1 0.3 0.6 c
utt1 1 0.9 0.9 a
utt1 1 1.8 0.6 t
utt1 1 2.4 0.6 sil
I created my script to generally follow steps/nnet3/align.sh - the only real difference is that I generate features on demand for whatever I'm trying to align. My basic setup is to generate mfcc's and ivectors using online2-wav-dump-features and ivector-extract-online2 respectively and then pass it to a command like this:
compile-train-graphs tree final.mdl L.fst "ark:utils/
sym2int.pl -f 2- words.txt $data/text |" ark:- \
| nnet3-align-compiled \
--use-gpu=$use_gpu \
--acoustic-scale=$acoustic_scale \
--beam=$beam \
--frame-subsampling-factor=$frame_subsampling_factor \
--online-ivector-period=$online_ivector_period \
--online-ivectors=scp:$data/ivector_online.scp \
--transition-scale=$transition_scale \
--self-loop-scale=$self_loop_scale \
final.mdl ark:- scp:$data/feats.scp ark:- \
| ali-to-phones --ctm-output final.mdl ark:- - \
| ./utils/
int2sym.pl -f 5 phones.txt \
> $ctm || exit 1
So long story short, do you have any idea how/why the --frame-subsampling-factor might be getting applied twice (or could it be a different matter entirely)? It's possible I screwed up somewhere but I've double checked my setup and I can't find any other point where I introduce an extra factor of 3.
Any thoughts would be helpful. Thanks!