optimize chain model

allab...@gmail.com

unread,

Jan 3, 2019, 8:34:37 AM1/3/19

to kaldi-help

Hi

I run chain mode on my dataset. my data set has 120 hours for the train and 6 hours for the test. I follow WSJ script.

here my WER.

my language model only based on my train text corpus so report decoding with language wight equal one.



tri3  20.73  lm_wt=1
      10.85  lm_wt=17


and for chain mode, follow 1g
chain  12.78 lm_wt=1
       3.52  lm_wt=14


 steps/info/chain_dir_info.pl exp/chain/tdnn1g_sp
exp/chain/tdnn1g_sp: num-iters=132 nj=2..8 num-params=8.5M dim=40+100->3160 combine=-0.044->-0.044 (over 2) xent:train/valid[87,131,final]=(-1.31,-0.876,-0.865/-1.35,-0.971,-0.954) logprob:train/valid[87,131,final]=(-0.072,-0.044,-0.043/-0.094,-0.066,-0.065)

my training dataset has short waves, e.g. one word like the name of cities.

1-what your opinion? can change TDNN variables to get a better model?

2-In last posts, I think said CNN has good performance, but in WSJ say overfitting.

what the best model in KALDI? (in summary robust noise,.... computing,...)

best regards

Daniel Povey

unread,

Jan 3, 2019, 3:12:13 PM1/3/19

to kaldi-help

There's not much I can say based on what you said, you weren't very specific. It's probably working OK though.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/be665d0e-c533-47f8-b57c-2c4b77a91ca8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

allab...@gmail.com

unread,

Jan 8, 2019, 6:59:02 AM1/8/19

to kaldi-help

Hi, thanks Dan

1- what additional data to specific details.

when I record waves by myself, it a little bad to recognition, it sensitive to start and end of the speech (time of turn on and off microphone ) and hesitation.

2- is it a good idea to argument data, I follow data argumentation in run.sh in sre16 egs.

what are your suggestions?

it error :

steps/data/reverberate_data_dir.py --rir-set-parameters 0.5, RIRS_NOISES/simulated_rirs/smallroom/rir_list --rir-set-parameters 0.5, RIRS_NOISES/simulated_rirs/mediumroom/rir_list --speech-rvb-probability 1 --pointsource-noise-addition-probability 0 --isotropic-noise-addition-probability 0 --num-replications 1 --source-sampling-rate 8000 data/train_cleaned data/train_cleaned_reverb
Number of RIRs is 40000
Traceback (most recent call last):
  File "steps/data/reverberate_data_dir.py", line 657, in <module>
    Main()
  File "steps/data/reverberate_data_dir.py", line 654, in Main
    max_noises_per_minute = args.max_noises_per_minute)
  File "steps/data/reverberate_data_dir.py", line 422, in CreateReverberatedCopy
    pointsource_noise_addition_probability, max_noises_per_minute)
  File "steps/data/reverberate_data_dir.py", line 345, in GenerateReverberatedWavScp
    speech_dur = durations[recording_id]
KeyError: 'skh000003W0000003'

here my keys in data/cleaned

$less reco2dur
skh000003W0000003-1 1.45
skh000003W0001791-1 1.83

$less wav.scp
skh000003W0000003 sox ....
skh000003W0001791 ....

Daniel Povey

unread,

Jan 8, 2019, 3:15:52 PM1/8/19

to kaldi-help

I am hoping someone else can respond to this.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d7bdf68c-7b38-4907-b81c-dc9f5efd67ee%40googlegroups.com.

allab...@gmail.com

unread,

Jan 9, 2019, 12:06:41 AM1/9/19

to kaldi-help

sorry but I think it because of segments

I ran data/train (no segments and assume every wave include one utterance) instead of data/train_cleaned (that clean train by script)

and it ran and ok.

is it a bug or I wrong?





./run_agument_data.sh 
steps/data/reverberate_data_dir.py --rir-set-parameters 0.5, RIRS_NOISES/simulated_rirs/smallroom/rir_list --rir-set-parameters 0.5, RIRS_NOISES/simulated_rirs/mediumroom/rir_list --speech-rvb-probability 1 --pointsource-noise-addition-probability 0 --isotropic-noise-addition-probability 0 --num-replications 1 --source-sampling-rate 8000 data/train data/train_reverb

Number of RIRs is 40000


utils/copy_data_dir.sh: copied data from data/train_reverb to data/train_reverb.new
utils/validate_data_dir.sh: Successfully validated data-directory data/train_reverb.new
Preparing data/musan...
In music directory, processed 645 files; 0 had missing wav data
In speech directory, processed 426 files; 0 had missing wav data
In noise directory, processed 930 files; 0 had missing wav data
utils/fix_data_dir.sh: file data/musan/utt2spk is not in sorted order or not unique, sorting it
utils/fix_data_dir.sh: file data/musan/wav.scp is not in sorted order or not unique, sorting it
fix_data_dir.sh: kept all 2001 utterances.
fix_data_dir.sh: old files are kept in data/musan/.backup
utils/subset_data_dir.sh: reducing #utt from 2001 to 645
utils/subset_data_dir.sh: reducing #utt from 2001 to 426
utils/subset_data_dir.sh: reducing #utt from 2001 to 930
fix_data_dir.sh: kept all 645 utterances.
fix_data_dir.sh: old files are kept in data/musan_music/.backup
fix_data_dir.sh: kept all 426 utterances.
fix_data_dir.sh: old files are kept in data/musan_speech/.backup
fix_data_dir.sh: kept all 930 utterances.
fix_data_dir.sh: old files are kept in data/musan_noise/.backup
utils/data/get_utt2dur.sh: segments file does not exist so getting durations from wave files
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file headers, using wav-to-duration
utils/data/get_utt2dur.sh: computed data/musan_speech/utt2dur
utils/data/get_utt2dur.sh: segments file does not exist so getting durations from wave files
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file headers, using wav-to-duration
utils/data/get_utt2dur.sh: computed data/musan_noise/utt2dur
utils/data/get_utt2dur.sh: segments file does not exist so getting durations from wave files
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file headers, using wav-to-duration
utils/data/get_utt2dur.sh: computed data/musan_music/utt2dur
steps/data/augment_data_dir.py --utt-suffix noise --fg-interval 1 --fg-snrs 15:10:5:0 --fg-noise-dir data/musan_noise data/train data/train_noise
steps/data/augment_data_dir.py --utt-suffix music --bg-snrs 15:10:8:5 --num-bg-noises 1 --bg-noise-dir data/musan_music data/train data/train_music
steps/data/augment_data_dir.py --utt-suffix babble --bg-snrs 20:17:15:13 --num-bg-noises 3:4:5:6:7 --bg-noise-dir data/musan_speech data/train data/train_babble
utils/combine_data.sh data/train_aug data/train_reverb data/train_noise data/train_music data/train_babble
utils/combine_data.sh: combined utt2uniq
utils/combine_data.sh [info]: not combining segments as it does not exist
utils/combine_data.sh: combined utt2spk
utils/combine_data.sh [info]: not combining utt2lang as it does not exist
utils/combine_data.sh [info]: **not combining utt2dur as it does not exist everywhere**
utils/combine_data.sh [info]: **not combining reco2dur as it does not exist everywhere**
utils/combine_data.sh [info]: not combining feats.scp as it does not exist
utils/combine_data.sh: combined text
utils/combine_data.sh [info]: not combining cmvn.scp as it does not exist
utils/combine_data.sh [info]: not combining vad.scp as it does not exist
utils/combine_data.sh [info]: not combining reco2file_and_channel as it does not exist
utils/combine_data.sh: combined wav.scp
utils/combine_data.sh [info]: **not combining spk2gender as it does not exist everywhere**
fix_data_dir.sh: kept all 510088 utterances.
fix_data_dir.sh: old files are kept in data/train_aug/.backup
utils/subset_data_dir.sh: reducing #utt from 510088 to 200000
fix_data_dir.sh: kept all 200000 utterances.
fix_data_dir.sh: old files are kept in data/train_aug_200k/.backup
steps/make_mfcc.sh --mfcc-config conf/mfcc.conf --nj 40 --cmd run.pl  data/train_aug_200k exp/make_mfcc /data/master/egs/me/s5/mfcc.
...
...
...

allab...@gmail.com

unread,

Jan 13, 2019, 5:15:35 PM1/13/19

to kaldi-help

Hi Dan

I used a text corpus that has about 400M words to bilud language model.

but the WER is so increased.

best results 
language model based train text corpus
%WER 6.53 [ 4331 / 66364, 871 ins, 2065 del, 1395 sub ] exp/chain/tdnn1g_sp/decode_cleaned/wer_16_0.0

big language model about 400M words
%WER 34.91 [ 23166 / 66364, 4006 ins, 2963 del, 16197 sub ] exp/chain/tdnn1g_sp/decode_cleaned_big_lm/wer_10_1.0

(some test wave lose in this run so WER is increased instead of the first result reporting)

I guess because my lexicon is limited and have 7k words. so most of the language model is UNK. and if I extend my lexicon WER decrease.

what your opinion?

Daniel Povey

unread,

Jan 13, 2019, 5:19:06 PM1/13/19

to kaldi-help

You should probably have a lexicon that covers the bulk of words in the LM.

But out of domain data is not always expected to help. It has to be reasonably similar to the data you want to recognize.

Or you can interpolate the LM with the in-domain data. Search in kaldi script for the -interp option to srilm, e.g. see

egs/ami/s5b/local/ami_train_lms.sh

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/1029d263-d319-4ed9-b3fb-38d59205165b%40googlegroups.com.

allab...@gmail.com

unread,

Jan 15, 2019, 4:29:06 PM1/15/19

to kaldi-help

Hi

In training is it important that have larger lexicon than unique words in train corpus?

I mean I have 7k unique words in my train set, I must have more words in the lexicon in the training phase?

(I think it not important)

thanks

Daniel Povey

unread,

Jan 15, 2019, 4:29:56 PM1/15/19

to kaldi-help

No you don't have to have more words; the presence of extra words in the lexicon while training will not make any difference to the resulting model.

--

Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/f9ade8c0-4872-4b55-a37a-ea251645adb5%40googlegroups.com.

Reply all

Reply to author

Forward