Extended TDNN x-vector xconfig

855 views
Skip to first unread message

Min Wang

unread,
Oct 3, 2019, 4:04:17 PM10/3/19
to kaldi-help
HI

I am trying to re-run the egs/voxceleb/v2 with Extended TDNN x-vector described by:


here is my changes to:    egs/voxceleb/v2/local/nnet3/xvector/my_run_xvector.sh

  input dim=${feat_dim} name=input
  relu-batchnorm-layer name=tdnn1 input=Append(-2,-1,0,1,2) dim=512
  relu-batchnorm-layer name=tdnn11 dim=512                                                       <--added
  relu-batchnorm-layer name=tdnn2 input=Append(-2,0,2) dim=512
  relu-batchnorm-layer name=tdnn21 dim=512                                                      <--added
  relu-batchnorm-layer name=tdnn3 input=Append(-3,0,3) dim=512
  relu-batchnorm-layer name=tdnn31 dim=512                                                       <--added
  relu-batchnorm-layer name=tdnn3 input=Append(-4,0,4) dim=512                      <--added
  relu-batchnorm-layer name=tdnn4 dim=512
  relu-batchnorm-layer name=tdnn5 dim=1500

  # The stats pooling layer. Layers after this are segment-level.
  # In the config below, the first and last argument (0, and ${max_chunk_size})

...
steps/nnet3/train_raw_dnn.py
  ...
--trainer.optimization.minibatch-size=512         <-- updated from 64
--trainer.num-epochs=6 \                                  <-- updated from 3 

...


 are those changes corrected?



Thanks

min


David Snyder

unread,
Oct 3, 2019, 4:08:37 PM10/3/19
to kaldi-help
The paper used a minibatch of size 128, but other than that it looks correct to me. 

Min Wang

unread,
Oct 3, 2019, 4:21:24 PM10/3/19
to kaldi-help
Hi David

Oh, thanks!

min

Min Wang

unread,
Oct 11, 2019, 5:41:07 PM10/11/19
to kaldi-help
HI

My training result's log using extended TDNN  on voxcel1/2 data was:

EER: 2.858%
minDCF(p-target=0.01): 0.2981
minDCF(p-target=0.001): 0.4409

I did a quick test on a real live data,  it seems the extended TDNN performance is roughly the same ( or little worse than) as previous pre-trained model (https://kaldi-asr.org/models/m8),
sometimes gave a even worse result , predicted a wrong person with high score.


That seems odd, I am wondering if anything could be wrong?


thanks.


min

David Snyder

unread,
Oct 12, 2019, 11:40:07 AM10/12/19
to kaldi-help
In the papers we used more training data. Instead of subsampling the augmented training data before combining it with the clean data, (e.g., https://github.com/kaldi-asr/kaldi/blob/master/egs/sitw/v2/run.sh#L128) we used all of the augmented copied. That will probably help, as the extended TDNN has more parameters. Also, you may need to tune other aspects of the recipe. We've found that it's better to have smaller training archives, to encourage more frequent model averaging. Alternatively, you might be able to achieve the same result by reducing the learning rate somewhat. You'll have to experiment with tuning various things.

Min Wang

unread,
Oct 13, 2019, 7:03:44 PM10/13/19
to kaldi-help
Hi

@david, ok, thanks.

What does " have smaller training archives" mean, and how to do it?

Also would you mind publish your pre-trained model?



thanks

min

Daniel Povey

unread,
Oct 13, 2019, 7:18:15 PM10/13/19
to kaldi-help
He probably means the frames-per-iter or chunks-per-iter value, whatever it's called, should be smaller.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/b4451bf2-5958-470f-9f68-27b77eb36b74%40googlegroups.com.

David Snyder

unread,
Oct 13, 2019, 9:21:52 PM10/13/19
to kaldi-help
Yes, I believe you can achieve that by decreasing the frames-per-iter value. You can play with the options to get_egs.sh, and see how it affects the output in the egs/ directory. If you look at the ranges.* files (in the egs/temp/ directory), you should be able to determine how many archives there are, and how many examples there are per archive. It seems that smaller tends to be better, maybe because it enables more frequent model averaging. But, I wonder if simply decreasing the learning rate would have the same effect. 

I personally don't have any plans on updating the pretrained x-vector models at the moment. Maybe others do. 

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Min Wang

unread,
Oct 15, 2019, 1:11:59 PM10/15/19
to kaldi-help
HI

@dan, @ david, thanks for the info.

Training those 2M took us about 1 week on 8 GPUs.
With 6M data (without subsampling),  I guess it will took 3-4 weeks? I am wondering how long did it take to train those data in your case?

It would be great if someone in the kaldi community could publish their pre-trained extended/TDNN model.

I noticed that voxceleb/v2 receipt just used original data/train ( without train_augumented data) to train the PLDA,
does it make sense to just train the PLDA part with all those 6M data ( without re-train the e-tdnn ) ?
Or is e-tdnn x-vector part more important for the performance than plda part?


Thanks

min

David Snyder

unread,
Oct 15, 2019, 3:36:53 PM10/15/19
to kaldi-help
Training those 2M took us about 1 week on 8 GPUs.
 
For me it took about a week, I think, to train it on 16 GPUs. You can certainly subsample the augmented training data if you want to train faster. I can't predict how much degradation, if any you'll have, though. 

I noticed that voxceleb/v2 receipt just used original data/train ( without train_augumented data) to train the PLDA,
does it make sense to just train the PLDA part with all those 6M data ( without re-train the e-tdnn ) ?
Or is e-tdnn x-vector part more important for the performance than plda part?

Augmentation will almost always help the DNN training. It might improve the PLDA component, if it helps to make the training data look more like the enroll/test data. In the case of the Voxceleb recipe, the training data already closely matches the enroll/test data, and so it probably isn't going to help. If the enroll/test data has more noises or reverberation than the training data, it will probably help to use augmentation in the PLDA training. 

Min Wang

unread,
Oct 21, 2019, 11:41:17 AM10/21/19
to kaldi-help
HI

I tried to use 6m data, but somehow in the stage 4, the prepare_feats_for_egs.sh only get about 1.8m utterences.

here are some log msg:

utils/combine_data.sh [info]: not combining spk2gender as it does not exist
fix_data_dir.sh: kept all 6384440 utterances.                               <----- 6m in train_combined
fix_data_dir.sh: old files are kept in data/train_combined/.backup

>  local/nnet3/xvector/prepare_feats_for_egs.sh --nj 40 --cmd "$train_cmd" \
    data/train_combined data/train_combined_no_sil exp/train_combined_no_sil

local/nnet3/xvector/prepare_feats_for_egs.sh --nj 40 --cmd run.pl data/train_combined data/train_combined_no_sil exp/train_combined_no_sil
local/nnet3/xvector/prepare_feats_for_egs.sh: Succeeded creating xvector features for train_combined
fix_data_dir.sh: kept 1871456 utterances out of 6384440         <---- it only keep 1.8 m instead of 6m.


I looked at  local/nnet3/xvector/prepare_feats_for_egs.sh, I did not see any thing limit/filter those wav files.

Anything am I missing?


thanks

min

David Snyder

unread,
Oct 21, 2019, 3:47:45 PM10/21/19
to kaldi-help
Try deleting data/train_combined_no_sil and rerun stage 4.

My guess is there's some leftover files in that directory from before you increased the amount of training data. 

Min Wang

unread,
Oct 21, 2019, 6:19:52 PM10/21/19
to kaldi-help
HI

@david, thank. I removed data/train_combined_no_sil and  re-ran the stage 4 like this:

///////////////////////////////////////////////////////////
  local/nnet3/xvector/prepare_feats_for_egs.sh --nj 40 --cmd "$train_cmd" \
    data/train_combined data/train_combined_no_sil exp/train_combined_no_sil

  echo "before fix_data dir"
  wc data/train_combined_no_sil/wav.scp

  utils/fix_data_dir.sh data/train_combined_no_sil
  echo "after fix_data dir"
  wc data/train_combined_no_sil/wav.scp
///////////////////////////////////////////////////////////

$ more my.log.step4

local/nnet3/xvector/prepare_feats_for_egs.sh --nj 40 --cmd run.pl data/train_combined data/train_combined_no_sil exp/train_combined_no_sil
local/nnet3/xvector/prepare_feats_for_egs.sh: Succeeded creating xvector features for train_combined
before fix_data dir
   6384440  141292684 2308439195 data/train_combined_no_sil/wav.scp
fix_data_dir.sh: kept 1871456 utterances out of 6384440
fix_data_dir.sh: old files are kept in data/train_combined_no_sil/.backup
after fix_data dir
  1871456  22817630 382929048 data/train_combined_no_sil/wav.scp


It looks like fix_data_dir.sh only got 1871456.

And I checked the data/train_combined_no_sil/.backup/utt2num_frames, it only has 1.8M.

$ wc data/train_combined_no_sil/.backup/utt2num_frames
 1871456  3742912 62229302 utt2num_frames

Should not  utt2num_frames be 6M just like train_combined/utt2num_frames?



thanks

min

David Snyder

unread,
Oct 21, 2019, 6:37:07 PM10/21/19
to kaldi-help
Are you sure you're not losing these files in stage 5? In these scripts, there's usually a stage 5 where we reduce the size of the dataset, and only keep utterances of a certain length, and speakers with several recordings.

Also, can you report the output of  wc -l data/train_combined/*. I want to see if there's some file in there with 1.8 million lines. 

Min Wang

unread,
Oct 21, 2019, 6:58:00 PM10/21/19
to kaldi-help


HI

@david. thanks. 

I only re-ran the stage 4 , then exit.

Here is rsult of train_combined/

$ wc -l data/train_combined/*

   6384440 data/train_combined/feats.scp
         1 data/train_combined/frame_shift
      7323 data/train_combined/spk2utt
wc: data/train_combined/split200: Is a directory
         0 data/train_combined/split200
wc: data/train_combined/split40: Is a directory
         0 data/train_combined/split40
   6384440 data/train_combined/utt2dur
   6384440 data/train_combined/utt2num_frames
   6384440 data/train_combined/utt2spk
   6384440 data/train_combined/utt2uniq
   6384440 data/train_combined/vad.scp
   6384440 data/train_combined/wav.scp
  44698404 total

Thanks

min

David Snyder

unread,
Oct 21, 2019, 7:07:30 PM10/21/19
to kaldi-help
Run wc -l on data/train_combined_no_sil/* but *before* you run fix_data_dir.sh. There's probably some file in there that it is filtering against. 

Min Wang

unread,
Oct 21, 2019, 10:26:06 PM10/21/19
to kaldi-help
HI

@david, since fix_data_dir.sh will backup the data, so here it is the result:

wc -l data/train_combined_no_sil/.backup/*
   1871456 data/train_combined_no_sil/.backup/feats.scp
      7323 data/train_combined_no_sil/.backup/spk2utt
   1871456 data/train_combined_no_sil/.backup/utt2num_frames
   6384440 data/train_combined_no_sil/.backup/utt2spk
   6384440 data/train_combined_no_sil/.backup/wav.scp
  16519115 total




thanks.


min

Daniel Povey

unread,
Oct 21, 2019, 10:30:37 PM10/21/19
to kaldi-help
Looks like they might have been lost in feature generation.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/3d34b20a-bb0e-4f67-98bf-377bc3352d74%40googlegroups.com.

David Snyder

unread,
Oct 21, 2019, 10:36:45 PM10/21/19
to kaldi-help
Yeah, that's what it looks like. You should inspect the log file and see if something went wrong. 

Min Wang

unread,
Oct 22, 2019, 11:38:00 AM10/22/19
to kaldi-help
HI

@dan, @david, thanks.

I looked at the log: exp/train_combined_no_sil/log/create_xvector_feats_train_combined.xx.log
they have those kind of WARNING and error:

WARNING (select-voiced-frames[5.5.506~1-7b4c5]:main():select-voiced-frames.cc:76) Mismatch in number for frames 563 for features and VAD 557, for utterance id00820-zhX-Mjuc_dc-00083-babble
WARNING (select-voiced-frames[5.5.506~1-7b4c5]:main():select-voiced-frames.cc:76) Mismatch in number for frames 563 for features and VAD 557, for utterance id00820-zhX-Mjuc_dc-00083-music
WARNING (select-voiced-frames[5.5.506~1-7b4c5]:main():select-voiced-frames.cc:76) Mismatch in number for frames 563 for features and VAD 557, for utterance id00820-zhX-Mjuc_dc-00083-noise
LOG (apply-cmvn-sliding[5.5.506~1-7b4c5]:main():apply-cmvn-sliding.cc:75) Applied sliding-window cepstral mean normalization to 162845 utterances, 0 had errors.
WARNING (select-voiced-frames[5.5.506~1-7b4c5]:main():select-voiced-frames.cc:76) Mismatch in number for frames 563 for features and VAD 557, for utterance id00820-zhX-Mjuc_dc-00083-reverb
LOG (select-voiced-frames[5.5.506~1-7b4c5]:main():select-voiced-frames.cc:106) Done selecting voiced frames; processed 32569 utterances, 130276 had errors.
LOG (copy-feats[5.5.506~1-7b4c5]:main():copy-feats.cc:143) Copied 32569 feature matrices.


I am thinking those VAD is from stage 3, here is my stage 3:

if [ $stage -le 3 ]; then
  # Take a random subset of the augmentations 5m
  #utils/subset_data_dir.sh data/train_aug 5000000 data/train_aug_1m
  # use all of them
  utils/subset_data_dir.sh data/train_aug 5107552 data/train_aug_1m
  utils/fix_data_dir.sh data/train_aug_1m

  # Make MFCCs for the augmented data.  Note that we do not compute a new
  # vad.scp file here.  Instead, we use the vad.scp from the clean version of
  # the list.
  steps/make_mfcc.sh --mfcc-config conf/mfcc.conf --nj 40 --cmd "$train_cmd" \
    data/train_aug_1m exp/make_mfcc $mfccdir

  # Combine the clean and augmented VoxCeleb2 list.  This is now roughly
  # double the size of the original clean list.
  utils/combine_data.sh data/train_combined data/train_aug_1m data/train
fi





min

David Snyder

unread,
Oct 22, 2019, 2:18:04 PM10/22/19
to kaldi-help
Regenerating the vad.scp for data/train_combined should fix the issue.

If everything was done correctly, the number of frames in the feats.scp and vad.scp would be equal. It's not going to be easy for us to tell you where this went wrong. I could see something like this happening if you were experimenting with different frame-shift values in the MFCC configs.

Anyway, the simplest fix is to just regenerate the vad.scp file. 

David Snyder

unread,
Oct 22, 2019, 2:20:56 PM10/22/19
to kaldi-help
Just run

sid/compute_vad_decision.sh --nj 40 --cmd "$train_cmd" \
data/train_combined exp/make_vad $vaddir

utils/fix_data_dir.sh data/train_combined

Before you rerun prepare_feats_for_egs.sh.

Min Wang

unread,
Oct 22, 2019, 4:14:02 PM10/22/19
to kaldi-help
Hi 

@david, thanks a lot!!

Yah, it did fix the step4 problem.

Now I got 6m,   continued to stage 5. 





min

Min Wang

unread,
Oct 31, 2019, 6:54:57 PM10/31/19
to kaldi-help
HI

just fyi, my training finally finished, I got the following result:

EER: 2.344%
minDCF(p-target=0.01): 0.2418
minDCF(p-target=0.001): 0.4219


Those number are improved compared to my previous result trained from 2M utterances.

However when I tried to a real live data,  it seems the performance improvement is NOT obvious at all.

The main thing seems to the performance degraded a lot if enrollment and test acoustic condition mismatches.

For example:
one people record enrollment samples on one device/microphone, 
then talked over another type of microphone at different distance under different condition, the performance degraded a lot.


In real life environment, it is hard to ask people to match enrollment and test acoustic condition.

I guess current speaker identification technology still can not solve this issue ( enrollment and test acoustic condition mismatch), right?

Any suggestions?


thanks

David Snyder

unread,
Nov 1, 2019, 1:20:19 AM11/1/19
to kaldi-help
 I guess current speaker identification technology still can not solve this issue ( enrollment and test acoustic condition mismatch), right?

I'm not sure what you would consider "solving" this issue. Performance will always be better if the enrollment and test conditions match (e.g., same recording device, same phrase, etc). Performance degrades as the enrollment and test conditions differ.

Any suggestions?

Your best bet is to collect some data the reflects the variability you expect to see in your enrollment and test data, or to simulate it if you can. Use this data to train or adapt the PLDA model in some way. If you have enough of this data, it may help to include it in the DNN embedding training set. 

Min Wang

unread,
Nov 1, 2019, 1:14:21 PM11/1/19
to kaldi-help
HI

@david, thanks.

What I mean is:  the performance should degrade just a little if  enrollment and test acoustic condition mismatch.

For example: a person enrolled using a recording on a PC/microphone in a typical office environment,
now he talked on a big conference room using a wireless microphone,
we should be able to identify him without much performance degrade. 

I assume voxcel v1/v2 data should have that typical variability?

But it seems current x-vector/etdnn trained with voxcele etc  did not perform well in those situations.

Considering there are many possible test conditions, 
Is there any scientific way to collect data to reflect the variability between in the enrollment and test data?
For example, what if a person talk over different distance/different microphones?
What are those  variabilities exactly?



thanks.

min
Reply all
Reply to author
Forward
0 new messages