How to improve the ASR system

751 views
Skip to first unread message

micheal...@gmail.com

unread,
Mar 15, 2016, 10:33:11 AM3/15/16
to kaldi-help
Dear all,

Recently, I am using Kaldi to build the ASR system using my own corpus (Tibetan corpus). I have split the corpus into train, dev and test set. The detailed information is below:

Data set               Number of speakers                       Number of audios                               Total Time duration

Train                    7 females and 10 males                           36,127                                            32.09 h
Test                     3 females and 3 males                             2,664                                              2.41h
Dev                      7 females and 10 males                           1700                                               1.51h

In my corpus, each speaker has recorded the same 3,000 sentences. I randomly pick up 500 sentences and take these 500 sentences from 6 speakers to be test set (in fact there are 2,664 audios because some audios are dropped due to the bad quality). There are 17 speakers left and each of them has about 2500 sentences which are not contained in the test set. I randomly choose 1700 sentences among them to be dev set and the left data is used to be train set. So, there is no overlap between the train and test set. But the speakers from train and dev set are same.

There are totally 3,194 Tibetan characters in my pronunciation dictionary. And the text data which I use to train the LM is just from the train set transcripts and some sentences from the Tibetan Textbook in primary school.

For the acoustic models, I have just trained the mono, tri1 (MFCC), tri2 (LDA+MLLT), tri2a (LDA+MLLT+MMI), tri2b (LDA+MLLT+MPE), tri3 (LDA+MLLT+SAT). The results are listed below:
---------------------------------------decoding dev set-----------------------------------------------------------------
%WER 57.54 [ 8883 / 15439, 196 ins, 750 del, 7937 sub ] exp/mono/decode_dev/wer_9

%WER 31.45 [ 4855 / 15439, 114 ins, 400 del, 4341 sub ] exp/tri1/decode_dev/wer_14

%WER 27.13 [ 4189 / 15439, 125 ins, 385 del, 3679 sub ] exp/tri2/decode_dev/wer_15

%WER 22.00 [ 3397 / 15439, 102 ins, 269 del, 3026 sub ] exp/tri2a/decode_it3_dev/wer_12
%WER 21.41 [ 3306 / 15439, 93 ins, 265 del, 2948 sub ] exp/tri2a/decode_it4_dev/wer_12

%WER 23.84 [ 3680 / 15439, 103 ins, 301 del, 3276 sub ] exp/tri3/decode_dev/wer_15
%WER 28.67 [ 4426 / 15439, 140 ins, 343 del, 3943 sub ] exp/tri3/decode_dev.si/wer_13

----------------------------------------decoding test set-----------------------------------------------------------------
%WER 52.25 [ 12594 / 24103, 306 ins, 611 del, 11677 sub ] exp/tri1/decode_test/wer_12

%WER 48.55 [ 11701 / 24103, 300 ins, 529 del, 10872 sub ] exp/tri2/decode_test/wer_12

%WER 47.70 [ 11498 / 24103, 270 ins, 570 del, 10658 sub ] exp/tri2a/decode_it3_test/wer_11
%WER 47.91 [ 11548 / 24103, 285 ins, 598 del, 10665 sub ] exp/tri2a/decode_it4_test/wer_11

%WER 46.14 [ 11122 / 24103, 285 ins, 438 del, 10399 sub ] exp/tri2b/decode_it3_test/wer_12
%WER 45.71 [ 11017 / 24103, 267 ins, 541 del, 10209 sub ] exp/tri2b/decode_it4_test/wer_13

%WER 40.32 [ 9719 / 24103, 211 ins, 301 del, 9207 sub ] exp/tri3/decode/wer_11
%WER 47.90 [ 11545 / 24103, 283 ins, 495 del, 10767 sub ] exp/tri3/decode.si/wer_11

Based on the result, I have found there is much recognition accuracy degradation in the test set compared with the result in the dev set. I don't know whether this phenomenon is normal. But I have noticed that for most of the RESULT files in those Kaldi examples, the recognition in dev set is just little better than test set. To be honest, I even don't know why we should use the dev set. I have heard that the dev set is used to estimate best possible hyperparameters, like language weight, decoding beams. Is that right? Can there be overlap between train and dev set (different utterances but from the same speaker)?

Aiming at the bad result in test set, I have check my configurations. There maybe two reasons:

The one is that I have found there are 20 disambiguate symbols. It means that I have at least one pronunciation that shared by 20 Tibetan characters. Can this cause the bad recognition result?

The other one is that when I checked the stochastic properties of decoding graph, I used 'fstisstochastic G.fst' to check the G.fst and the result is

fstisstochastic data/lang_test/G.fst
2.05167   -1.13199

Based on my little knowledge about WFST, I just know that this G.fst is bad and not stochastic. This G.fst is just converted from arpa file. Can anyone show me the direction that how to make more stochastic G.fst? Is it related to choosing more in-domain text data?

I am sorry that I have asked too many questions and maybe some ones are just wrong questions. But I really want to know how to improve my ASR system.

Any help will be appreciated.

Best,

Micheal 

Daniel Povey

unread,
Mar 15, 2016, 1:36:37 PM3/15/16
to kaldi-help
I think the difference between dev and test is that your dev speakers are in your training set but your test speakers are not, so the model works better on the dev set.
Traditionally both dev and test set should be on speakers not included in the training set.
Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

micheal...@gmail.com

unread,
Mar 16, 2016, 4:16:52 AM3/16/16
to kaldi-help, dpo...@gmail.com
Thanks, Dan.

I also want to make sure two things.

One is that whether too many disambiguate symbols, say 20, can lead to bad recognition results.

The other one is that I just use the SRILM to train the LM in arpa file. Why is it very non-stochastic when changed into fst file? What can I do to avoid this?

Best,

Micheal


在 2016年3月16日星期三 UTC+8上午1:36:37,Dan Povey写道:

Daniel Povey

unread,
Mar 16, 2016, 4:56:15 PM3/16/16
to micheal...@gmail.com, kaldi-help
I also want to make sure two things.

One is that whether too many disambiguate symbols, say 20, can lead to bad recognition results.

It shouldn't make a difference.  Usually this happens because you have a lot of homophones (?).  I.e. words pronounced the same but with different spellings.

The other one is that I just use the SRILM to train the LM in arpa file. Why is it very non-stochastic when changed into fst file? What can I do to avoid this?

I'm not sure at what stage you're measuring this, but it could be due to words with many pronunciations, and you not using pronunciation probabilities.  Typically having too many pronunciations is not good for WER.
Dan

micheal...@gmail.com

unread,
Mar 17, 2016, 4:28:26 AM3/17/16
to kaldi-help, dpo...@gmail.com
Thanks a lot for the reply.

There are indeed too many homophones in my pronunciation dictionary and for the most special case, there are 20 different characters sharing the same pronunciation. So I asked you whether it would cause the accuracy degradation in the last email.

I have checked the stochastic property when I have got the arpa file of the LM by SRILM toolkit and then used the command below:

cat $lmdir/lm.arpa | \
  grep -v '<s> <s>' | \
  grep -v '</s> <s>' | \
  grep -v '</s> </s>' | \
  arpa2fst - | fstprint | \
  utils/remove_oovs.pl $tmpdir/oovs.txt | \
  utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=$test/words.txt \
    --osymbols=$test/words.txt  --keep_isymbols=false --keep_osymbols=false | \
  fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst
fstisstochastic $test/G.fst
 
The the result is:

fstisstochastic data/lang_test/G.fst
2.05167   -1.13199


I have checked my dictionary and it has been proved that there are only two Tibetan characters having two different pronunciations. All of other characters just have one pronunciation separately.

So I don't know why the G.fst is so non-stochastic. Is it due to the bad statistical language model? I just used the trigram model and Witten-Bell discount. The training corpus is from the training data transcription and Tibetan text book.

Do you know what I can try to solve this problem? I think the current bad result can be improved if I can fix this.

Best,

Micheal



在 2016年3月17日星期四 UTC+8上午4:56:15,Dan Povey写道:

Daniel Povey

unread,
Mar 17, 2016, 4:49:07 PM3/17/16
to micheal...@gmail.com, kaldi-help

Thanks a lot for the reply.

There are indeed too many homophones in my pronunciation dictionary and for the most special case, there are 20 different characters sharing the same pronunciation. So I asked you whether it would cause the accuracy degradation in the last email.

I have checked the stochastic property when I have got the arpa file of the LM by SRILM toolkit and then used the command below:

cat $lmdir/lm.arpa | \
  grep -v '<s> <s>' | \
  grep -v '</s> <s>' | \
  grep -v '</s> </s>' | \
  arpa2fst - | fstprint | \
  utils/remove_oovs.pl $tmpdir/oovs.txt | \
  utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=$test/words.txt \
    --osymbols=$test/words.txt  --keep_isymbols=false --keep_osymbols=false | \
  fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst
fstisstochastic $test/G.fst
 
The the result is:

fstisstochastic data/lang_test/G.fst
2.05167   -1.13199

It's normal for the second number to be negative- this corresponds to LM states that 'sum to' more than one, which can happen because of the way backoff is implemented during the conversion to FST.

The first number should not really be positive though.
This corresponds to states that 'sum to' less than one.  (It's positive because OpenFst represents these things as negative log, interpreted as a cost).

This could happen if you have a lot of OOVs, or a language model state that gives high probability mass to OOV words.
Dan
Reply all
Reply to author
Forward
0 new messages