Dear all,
Recently, I am using Kaldi to build the ASR system using my own corpus (Tibetan corpus). I have split the corpus into train, dev and test set. The detailed information is below:
Data set Number of speakers Number of audios Total Time duration
Train 7 females and 10 males 36,127 32.09 h
Test 3 females and 3 males 2,664 2.41h
Dev 7 females and 10 males 1700 1.51h
In my corpus, each speaker has recorded the same 3,000 sentences. I randomly pick up 500 sentences and take these 500 sentences from 6 speakers to be test set (in fact there are 2,664 audios because some audios are dropped due to the bad quality). There are 17 speakers left and each of them has about 2500 sentences which are not contained in the test set. I randomly choose 1700 sentences among them to be dev set and the left data is used to be train set. So, there is no overlap between the train and test set. But the speakers from train and dev set are same.
There are totally 3,194 Tibetan characters in my pronunciation dictionary. And the text data which I use to train the LM is just from the train set transcripts and some sentences from the Tibetan Textbook in primary school.
For the acoustic models, I have just trained the mono, tri1 (MFCC), tri2 (LDA+MLLT), tri2a (LDA+MLLT+MMI), tri2b (LDA+MLLT+MPE), tri3 (LDA+MLLT+SAT). The results are listed below:
---------------------------------------decoding dev set-----------------------------------------------------------------
%WER 57.54 [ 8883 / 15439, 196 ins, 750 del, 7937 sub ] exp/mono/decode_dev/wer_9
%WER 31.45 [ 4855 / 15439, 114 ins, 400 del, 4341 sub ] exp/tri1/decode_dev/wer_14
%WER 27.13 [ 4189 / 15439, 125 ins, 385 del, 3679 sub ] exp/tri2/decode_dev/wer_15
%WER 22.00 [ 3397 / 15439, 102 ins, 269 del, 3026 sub ] exp/tri2a/decode_it3_dev/wer_12
%WER 21.41 [ 3306 / 15439, 93 ins, 265 del, 2948 sub ] exp/tri2a/decode_it4_dev/wer_12
%WER 23.84 [ 3680 / 15439, 103 ins, 301 del, 3276 sub ] exp/tri3/decode_dev/wer_15
----------------------------------------decoding test set-----------------------------------------------------------------
%WER 52.25 [ 12594 / 24103, 306 ins, 611 del, 11677 sub ] exp/tri1/decode_test/wer_12
%WER 48.55 [ 11701 / 24103, 300 ins, 529 del, 10872 sub ] exp/tri2/decode_test/wer_12
%WER 47.70 [ 11498 / 24103, 270 ins, 570 del, 10658 sub ] exp/tri2a/decode_it3_test/wer_11
%WER 47.91 [ 11548 / 24103, 285 ins, 598 del, 10665 sub ] exp/tri2a/decode_it4_test/wer_11
%WER 46.14 [ 11122 / 24103, 285 ins, 438 del, 10399 sub ] exp/tri2b/decode_it3_test/wer_12
%WER 45.71 [ 11017 / 24103, 267 ins, 541 del, 10209 sub ] exp/tri2b/decode_it4_test/wer_13
%WER 40.32 [ 9719 / 24103, 211 ins, 301 del, 9207 sub ] exp/tri3/decode/wer_11
%WER 47.90 [ 11545 / 24103, 283 ins, 495 del, 10767 sub ] exp/tri3/
decode.si/wer_11
Based on the result, I have found there is much recognition accuracy degradation in the test set compared with the result in the dev set. I don't know whether this phenomenon is normal. But I have noticed that for most of the RESULT files in those Kaldi examples, the recognition in dev set is just little better than test set. To be honest, I even don't know why we should use the dev set. I have heard that the dev set is used to estimate best possible hyperparameters, like language weight, decoding beams. Is that right? Can there be overlap between train and dev set (different utterances but from the same speaker)?
Aiming at the bad result in test set, I have check my configurations. There maybe two reasons:
The one is that I have found there are 20 disambiguate symbols. It means that I have at least one pronunciation that shared by 20 Tibetan characters. Can this cause the bad recognition result?
The other one is that when I checked the stochastic properties of decoding graph, I used 'fstisstochastic G.fst' to check the G.fst and the result is
fstisstochastic data/lang_test/G.fst
2.05167 -1.13199
Based on my little knowledge about WFST, I just know that this G.fst is bad and not stochastic. This G.fst is just converted from arpa file. Can anyone show me the direction that how to make more stochastic G.fst? Is it related to choosing more in-domain text data?
I am sorry that I have asked too many questions and maybe some ones are just wrong questions. But I really want to know how to improve my ASR system.
Any help will be appreciated.
Best,
Micheal