getting very bad results (WER and CER), especially with long Utterance when decoding online (using a LSTM model ) .

127 views

Skip to first unread message

Kerolos Ghobrial

unread,

Apr 26, 2021, 10:04:29 AM4/26/21

to kaldi-help

I trained two models one with tdnn and one with LSTM layers: with the same everything such as lattice, tree, ivectors.

1. TDNN
1.1 TDNN config:

cat <<EOF > $dir/configs/network.xconfig

input dim=100 name=ivector

input dim=40 name=input

# please note that it is important to have input layer with the name=input

# as the layer immediately preceding the fixed-affine-layer to enable

# the use of short notation for the descriptor

fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat

# the first splicing is moved before the lda layer, so no splicing here

relu-batchnorm-dropout-layer name=tdnn1 $affine_opts dim=1536

tdnnf-layer name=tdnnf2 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=1

tdnnf-layer name=tdnnf3 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=1

tdnnf-layer name=tdnnf4 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=1

tdnnf-layer name=tdnnf5 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=0

tdnnf-layer name=tdnnf6 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf7 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf8 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf9 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf10 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf11 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf12 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf13 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf14 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf15 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf16 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

tdnnf-layer name=tdnnf17 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3

linear-component name=prefinal-l dim=256 $linear_opts

prefinal-layer name=prefinal-chain input=prefinal-l $prefinal_opts big-dim=1536 small-dim=256

output-layer name=output include-log-softmax=false dim=$num_targets $output_opts

prefinal-layer name=prefinal-xent input=prefinal-l $prefinal_opts big-dim=1536 small-dim=256

output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor $output_opts

EOF

1.2 training TDNN prams:

CUDA_VISIBLE_DEVICES=0,1 steps/nnet3/chain/train.py --stage -10 \

--cmd "$train_cmd" \

--use-gpu=yes \

--feat.online-ivector-dir $train_ivector_dir \

--feat.cmvn-opts "--norm-means=false --norm-vars=false" \

--chain.xent-regularize $xent_regularize \

--chain.leaky-hmm-coefficient 0.1 \

--chain.l2-regularize 0.0 \

--chain.apply-deriv-weights false \

--chain.lm-opts="--num-extra-lm-states=2000" \

--egs.dir "$common_egs_dir" \

--egs.stage $get_egs_stage \

--egs.opts "--frames-overlap-per-eg 0 --constrained false" \

--egs.chunk-width $frames_per_eg \

--trainer.dropout-schedule $dropout_schedule \

--trainer.add-option="--optimization.memory-compression-level=2" \

--trainer.num-chunk-per-minibatch $minibatch_size \

--trainer.frames-per-iter 2500000 \

--trainer.num-epochs $num_epochs \

--trainer.optimization.num-jobs-initial $num_jobs_initial \

--trainer.optimization.num-jobs-final $num_jobs_final \

--trainer.optimization.initial-effective-lrate $initial_effective_lrate \

--trainer.optimization.final-effective-lrate $final_effective_lrate \

--trainer.max-param-change 2.0 \

--cleanup.remove-egs $remove_egs \

--feat-dir $train_data_dir \

--tree-dir $tree_dir \

--lat-dir $lat_dir \

--dir $dir || exit 1;

2. LSTM

2.1 LSTM config:

cat <<EOF > $dir_lstm/configs/network.xconfig

input dim=100 name=ivector

input dim=40 name=input

# please note that it is important to have input layer with the name=input

# as the layer immediately preceding the fixed-affine-layer to enable

# the use of short notation for the descriptor

fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir_lstm/configs/lda.mat

# the first splicing is moved before the lda layer, so no splicing here

relu-batchnorm-layer name=tdnn1 dim=$hidden_dim

relu-batchnorm-layer name=tdnn2 input=Append(-1,0,1) dim=$hidden_dim

relu-batchnorm-layer name=tdnn3 input=Append(-1,0,1) dim=$hidden_dim

fast-lstmp-layer name=lstm1 cell-dim=$cell_dim recurrent-projection-dim=$projection_dim non-recurrent-projection-dim=$projection_dim delay=-3 dropout-proportion=0.0 $lstm_opts

relu-batchnorm-layer name=tdnn4 input=Append(-3,0,3) dim=$hidden_dim

relu-batchnorm-layer name=tdnn5 input=Append(-3,0,3) dim=$hidden_dim

fast-lstmp-layer name=lstm2 cell-dim=$cell_dim recurrent-projection-dim=$projection_dim non-recurrent-projection-dim=$projection_dim delay=-3 dropout-proportion=0.0 $lstm_opts

relu-batchnorm-layer name=tdnn6 input=Append(-3,0,3) dim=$hidden_dim

relu-batchnorm-layer name=tdnn7 input=Append(-3,0,3) dim=$hidden_dim

fast-lstmp-layer name=lstm3 cell-dim=$cell_dim recurrent-projection-dim=$projection_dim non-recurrent-projection-dim=$projection_dim delay=-3 dropout-proportion=0.0 $lstm_opts

relu-batchnorm-layer name=tdnn8 input=Append(-3,0,3) dim=$hidden_dim

relu-batchnorm-layer name=tdnn9 input=Append(-3,0,3) dim=$hidden_dim

fast-lstmp-layer name=lstm4 cell-dim=$cell_dim recurrent-projection-dim=$projection_dim non-recurrent-projection-dim=$projection_dim delay=-3 dropout-proportion=0.0 $lstm_opts

## adding the layers for chain branch

output-layer name=output input=lstm4 output-delay=$label_delay include-log-softmax=false dim=$num_targets max-change=1.5

# adding the layers for xent branch

# This block prints the configs for a separate output that will be

# trained with a cross-entropy objective in the 'chain' models... this

# has the effect of regularizing the hidden parts of the model. we use

# 0.5 / args.xent_regularize as the learning rate factor- the factor of

# 0.5 / args.xent_regularize is suitable as it means the xent

# final-layer learns at a rate independent of the regularization

# constant; and the 0.5 was tuned so as to make the relative progress

# similar in the xent and regular final layers.

output-layer name=output-xent input=lstm4 output-delay=$label_delay dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5

EOF

2.2 training LSTM prams:

CUDA_VISIBLE_DEVICES=0,1 steps/nnet3/chain/train.py --stage -10\

--cmd "$decode_cmd" \

--use-gpu=wait \

--feat.online-ivector-dir $train_ivector_dir \

--feat.cmvn-opts "--norm-means=false --norm-vars=false" \

--chain.xent-regularize $xent_regularize \

--chain.leaky-hmm-coefficient 0.1 \

--chain.l2-regularize 0.00005 \

--chain.apply-deriv-weights false \

--chain.lm-opts="--num-extra-lm-states=2000" \

--trainer.dropout-schedule $dropout_schedule \

--trainer.num-chunk-per-minibatch 64,32 \

--trainer.frames-per-iter 1500000 \

--trainer.max-param-change 2.0 \

--trainer.num-epochs $num_epochs \

--trainer.optimization.shrink-value 0.99 \

--trainer.optimization.num-jobs-initial 2 \

--trainer.optimization.num-jobs-final 2 \

--trainer.optimization.initial-effective-lrate 0.001 \

--trainer.optimization.final-effective-lrate 0.0001 \

--trainer.optimization.momentum 0.0 \

--trainer.deriv-truncate-margin 8 \

--egs.stage $get_egs_stage \

--egs.opts "--frames-overlap-per-eg 0 --generate-egs-scp true" \

--egs.chunk-width 160,140,110,80 \

--egs.chunk-left-context $chunk_left_context \

--egs.chunk-right-context $chunk_right_context \

--egs.chunk-left-context-initial 0 \

--egs.chunk-right-context-final 0 \

--egs.dir "$common_egs_dir" \

--cleanup.remove-egs $remove_egs \

--feat-dir $train_data_dir \

--tree-dir $tree_dir_lstm \

--lat-dir $lat_dir_lstm \

--dir $dir_lstm || exit 1;

3.Decoding Online:

online2-tcp-nnet3-decode-faster --config=${config} --print-args=true \

--samp-freq=16000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=1.0 \

--frames-per-chunk=50 --extra-left-context-initial=0 --frame-subsampling-factor=3 \

--min-active=200 --max-active=7000 \

--port-num=5050 \

${mdl} ${graph}/HCLG.fst ${graph}/words.txt

4.results :

WER 12.2"TDNN" vs WER 101"LSTM"

Daniel Povey

unread,

Apr 26, 2021, 11:02:14 AM4/26/21

to kaldi-help

I'd guess tree/model mismatch or other severe configuration error in testing.

But check objective function values in the logs.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/cb678a0c-310d-470c-aa0d-598f9f473553n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages