Error increase when perplexity decrease

181 views
Skip to first unread message

mili lali

unread,
Nov 12, 2019, 12:03:07 PM11/12/19
to kaldi-help
Hi
I train a TDNN model based WSJ model and 
I train two language models one of them has 3000 and another 600 perplexity on train set transcripts text.
but my error increases when decoding with the language model that has low perplexity.
my WER error on 3000 perplexity lm = 17%
and with 600 perplexity lm = 21%

can guess why?

Kirill Katsnelson

unread,
Nov 12, 2019, 1:48:56 PM11/12/19
to kaldi-help
Wondering why do you compare the LM perplexity on the train set, but WER on the test set? You've probably made some assumptions and decuced that these should be compatible, but it would help if you explained them explicitly. Absent that--if train and test sets are from different domain, they cannot be compared. Is that the case?

Also make sure you normalize perplexity the same way. The numbers 6000 and 300 are high, looks like they are per utterance? Is the length of utterances different in the two sets? Then even in-domain they are incomparable. Try -ppl1 in srilm.

 -kkm

mili lali

unread,
Nov 13, 2019, 2:48:40 AM11/13/19
to kaldi-help
Your right, I must share information about my train and test datasets.
Train and test sets are in domain and don't different between them.
here are perplexities on test corpus and WER of them:
dev file  test.txt:  11697  sentences,  65592  words,  0  OOVs  0  zeroprobs
3gram.kn012.gz    logprob=  -205753.1  ppl=459.3325  ppl1=  1370.451   WER=21.32
3gram.me.gz    logprob=  -219334.9  ppl=  688.4208  ppl1=  2207.637     WER=17.23
2gram.me.gz   logprob=  -229224.9  ppl=  924.3056  ppl1=  3123.982       WER=21.69

Kirill Katsnelson

unread,
Nov 13, 2019, 3:58:10 PM11/13/19
to kaldi-help
On Tuesday, November 12, 2019 at 11:48:40 PM UTC-8, mili lali wrote:
dev file  test.txt:  11697  sentences,  65592  words,  0  OOVs  0  zeroprobs
3gram.kn012.gz    logprob=  -205753.1  ppl=459.3325  ppl1=  1370.451   WER=21.32
3gram.me.gz    logprob=  -219334.9  ppl=  688.4208  ppl1=  2207.637     WER=17.23

Yes, that really looks strange. One possibility is that the training text is not as much in-domain as you think.

Assuming the WER values are the best for each run, see if the best LM weights are in the ballpark, and that none of the LM weights are at the end of the test range (e.g., if you score with LM weight 7...17 and find the best WER at 17, that's not right, suspicious of test set sentences sneaking into the LM's training set; extend the range). Check the dynamics of WER vs LM weight. Ideally it should be slightly convex and very smooth.

Sorry I was rushing ahead of myself when replied "try -ppl1". What I meant look at ppl1 (to me, it seems a more sensible metric: no one expects the </s> in the middle of utterance, so chopping it off and adjusting for this makes sense; YMMV), and use -debug 1 in srilm. You may spot the pattern. If that reveals nothing, select only sentences with word errors, and use a deeper debug level.

 -kkm

mili lali

unread,
Nov 14, 2019, 2:31:03 PM11/14/19
to kaldi-help
Thanks,

Yes, that really looks strange. One possibility is that the training text is not as much in-domain as you think.

The results in lm weights between 7~17 are almost the same.

The corpus of my language model is different from the train set. 
I check ppl on both train and test set texts, both ppls decrease in these lm models.
I think My acoustic model is overfitting because my train and test set sentences generated automatic and have the same pattern in almost of sentences. (I think the occurrence of sequences of phones are actually the same in both train and sets.)
Here is the information of tdnn model:

$ steps/info/chain_dir_info.pl exp/chain/tdnn1i_sp/
exp/chain/tdnn1i_sp/: num-iters=422 nj=3..12 num-params=10.4M dim=40+100->3512 combine=-0.043->-0.042 (over 8) xent:train/valid[280,421,final]=(-0.972,-0.820,-0.815/-0.984,-0.857,-0.844) logprob:train/valid[280,421,final]=(-0.060,-0.043,-0.042/-0.061,-0.047,-0.048)

augment data with reveb, noise, babble
$steps/info/chain_dir_info.pl exp/chain/tdnn1i_online_cmvn_aug_sp/
exp
/chain/tdnn1i_online_cmvn_aug_sp/: num-iters=979 nj=3..12 num-params=10.4M dim=40+100->3512 combine=-0.072->-0.071 (over 10) xent:train/valid[651,978,final]=(-1.16,-0.976,-0.956/-1.41,-1.26,-1.22) logprob:train/valid[651,978,final]=(-0.089,-0.068,-0.068/-0.100,-0.086,-0.083)


What is your idea?

How can to check to overfit? How can we decrease this?


Kirill Katsnelson

unread,
Nov 18, 2019, 2:51:22 AM11/18/19
to kaldi-help
On Thursday, November 14, 2019 at 11:31:03 AM UTC-8, mili lali wrote:
The results in lm weights between 7~17 are almost the same.

So it looks like the LM either had no say at all, or the opposite, that it grabbed on the steering wheel and ignored the AM. Make sure that you do scale lattices when you decode with the chain model. With the chain model, we actually expect the LM weight to be around 1.0, but this is not a "typical" value for scoring scripts. It's usual to scale the lattice AM scores by 10, to bring the LM weight to the "familiar" range (you would typically see the best LM:AM weight in the range 13...18 with a GMM model), so that the expected optimal weight is around 10. When you do 7...17, the real LM:AM ratio explored is 0.7...1.7. Check the decode logs in chain/tdnn...online_/decode.../logs and see if lattice-scale is in the pipelines. Make sure you're decoding with the '--acwt 1.0 --post-decode-acwt 10.0' arguments to [online]/nnet3/decode.sh, like this: https://github.com/kaldi-asr/kaldi/blob/d97f1d824/egs/mini_librispeech/s5/local/chain/tuning/run_tdnn_1h.sh#L254. Otherwise you are looking at the result of decoding with LM weight in "typical" units varying from 70 to 170, which would be bad.

As a quick check, decode with LMWT from 1 to 2. If you see better results, then it's a scaling issue.

I think My acoustic model is overfitting because my train and test set sentences generated automatic and have the same pattern in almost of sentences. (I think the occurrence of sequences of phones are actually the same in both train and sets.)

First, there is a script generate_plots.py, it will show the performance dynamics on a held-out vs training set. It only reads the logs which you already have. You'll need to pass '--is_chain true' to it. If the plots diverge badly, then yea, it overfits. If only slightly, maybe not. The held-out set is small, don't jump to conclusion based in these plots alone.

Second, pass '--cleanup.preserve-model-interval=N' to train.py to save a checkpoint model every N iters, and decode with some of these interim models (they will be saved as e.g. 50.mdl, 100.mdl etc. if you specify 50). I'd use N=50 for the second experiment with 979 iters, or 25 for the first, it will save about 20 checkpoints in each.

 -kkm
Reply all
Reply to author
Forward
0 new messages