alignments from nnet3-align-compiled only have one frame per phone

10 views

Skip to first unread message

stilsen

unread,

Jun 12, 2024, 8:48:11 AMJun 12

to kaldi-help

Hi,

I am trying to use nnet3-align-compiled with the Gigaword XL model on a dataset I have, but the alignments I get end up having only one frame per phone. I don't see any errors or warnings that give me insight into why this would be happening. I am wondering what I am doing wrong and how I would go about debugging it. Here are my steps:

1. First I generate my utt2words, and compute the utterance-level cmvn stats and apply them to my mfccs.

> no errors or warning here, except that I do have some oov items that are not in the Gigaword lexicon.

2. Then I do this to extract ivectors, doesn't seem like anything is going wrong:

gmm-global-get-post ./extractor/final.dubm scp:/mnt/m/Data/polylect_audio/SEGMENTS/06.05.24a/mfcc/mfcc.scp ark,t:ivectors/post.txt
LOG (gmm-global-get-post[5.5.1012~1-dd107]:main():gmm-global-get-post.cc:115) Done 63 files, 0 with errors, average UBM log-likelihood is -1476.73 over 32938 frames.
/mnt/n/Github/kaldi/src/ivectorbin/ivector-extract --num-threads=8 extractor/final.ie scp:/mnt/m/Data/polylect_audio/SEGMENTS/06.05.24a/mfcc/mfcc.scp ark:ivectors/post.txt ark,t:ivectors/ivectors.txt
LOG (ivector-extract[5.5.1139~1549-67548]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (ivector-extract[5.5.1139~1549-67548]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (ivector-extract[5.5.1139~1549-67548]:main():ivector-extract.cc:314) Done 63 files, 0 with errors. Total (weighted) frames 32938

LOG (ivector-extract[5.5.1139~1549-67548]:main():ivector-extract.cc:317) Overall average objective-function change from estimating ivector was 139.309 per frame over 32938 (weighted) frames.

3. Then I compile the graphs:

compile-train-graphs --reorder=false --read-disambig-syms=$lang/phones/disambig.int $model/tree $model/final.mdl $lang/L.fst ark:$data/utt2words.int ark:$model/graphs.fsts
LOG (compile-train-graphs[5.5.1012~1-dd107]:main():compile-train-graphs.cc:147) compile-train-graphs: succeeded for 63 graphs, failed for 0

4. Then I get the alignments:

scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"

beam=10
retry_beam=40
careful=false

nnet3-align-compiled $scale_opts --write-per-frame-acoustic-loglikes=ark,t:$align/per_frame_logprobs --ivectors=ark:ivectors/ivectors.txt --beam=$beam --retry-beam=$retry_beam --careful=$careful $model/final.mdl ark:$model/graphs.fsts scp:$mfcc/mfcc.scp ark,t:$align/alignments.ali

But all the non-silence phones in my alignments have the duration of the mfcc frame-shift, e.g.:

audio_playerId_50_round_025_type_2_time_06.05.14.31.03_0_002 1 0.00 10.00 SIL
audio_playerId_50_round_025_type_2_time_06.05.14.31.03_0_002 1 10.00 10.00 HH_B
audio_playerId_50_round_025_type_2_time_06.05.14.31.03_0_002 1 20.00 10.00 AH0_I
audio_playerId_50_round_025_type_2_time_06.05.14.31.03_0_002 1 30.00 10.00 L_I
audio_playerId_50_round_025_type_2_time_06.05.14.31.03_0_002 1 40.00 10.00 OW1_E
audio_playerId_50_round_025_type_2_time_06.05.14.31.03_0_002 1 50.00 1830.00 SIL

Do you have any guesses what might be going wrong?

Thanks,

Sam

Reply all

Reply to author

Forward

0 new messages