Extended TDNN x-vector xconfig

Min Wang

unread,

Oct 3, 2019, 4:04:17 PM10/3/19

to kaldi-help

HI

I am trying to re-run the egs/voxceleb/v2 with Extended TDNN x-vector described by:

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2979.pdf

here is my changes to: egs/voxceleb/v2/local/nnet3/xvector/my_run_xvector.sh

# use http://www.danielpovey.com/files/2019_icassp_multi_speaker.pdf

input dim=${feat_dim} name=input

relu-batchnorm-layer name=tdnn1 input=Append(-2,-1,0,1,2) dim=512

relu-batchnorm-layer name=tdnn11 dim=512 <--added

relu-batchnorm-layer name=tdnn2 input=Append(-2,0,2) dim=512

relu-batchnorm-layer name=tdnn21 dim=512 <--added

relu-batchnorm-layer name=tdnn3 input=Append(-3,0,3) dim=512

relu-batchnorm-layer name=tdnn31 dim=512 <--added

relu-batchnorm-layer name=tdnn3 input=Append(-4,0,4) dim=512 <--added

relu-batchnorm-layer name=tdnn4 dim=512

relu-batchnorm-layer name=tdnn5 dim=1500

# The stats pooling layer. Layers after this are segment-level.

# In the config below, the first and last argument (0, and ${max_chunk_size})

...

steps/nnet3/train_raw_dnn.py

...

--trainer.optimization.minibatch-size=512 <-- updated from 64

--trainer.num-epochs=6 \ <-- updated from 3

...

are those changes corrected?

Thanks

min

David Snyder

unread,

Oct 3, 2019, 4:08:37 PM10/3/19

to kaldi-help

The paper used a minibatch of size 128, but other than that it looks correct to me.

Min Wang

unread,

Oct 3, 2019, 4:21:24 PM10/3/19

to kaldi-help

Hi David

Oh, thanks!

min

Min Wang

unread,

Oct 11, 2019, 5:41:07 PM10/11/19

to kaldi-help

HI

My training result's log using extended TDNN on voxcel1/2 data was:

EER: 2.858%

minDCF(p-target=0.01): 0.2981

minDCF(p-target=0.001): 0.4409

I did a quick test on a real live data, it seems the extended TDNN performance is roughly the same ( or little worse than) as previous pre-trained model (https://kaldi-asr.org/models/m8),

sometimes gave a even worse result , predicted a wrong person with high score.

That seems odd, I am wondering if anything could be wrong?

thanks.

min

David Snyder

unread,

Oct 12, 2019, 11:40:07 AM10/12/19

to kaldi-help

In the papers we used more training data. Instead of subsampling the augmented training data before combining it with the clean data, (e.g., https://github.com/kaldi-asr/kaldi/blob/master/egs/sitw/v2/run.sh#L128) we used all of the augmented copied. That will probably help, as the extended TDNN has more parameters. Also, you may need to tune other aspects of the recipe. We've found that it's better to have smaller training archives, to encourage more frequent model averaging. Alternatively, you might be able to achieve the same result by reducing the learning rate somewhat. You'll have to experiment with tuning various things.

Min Wang

unread,

Oct 13, 2019, 7:03:44 PM10/13/19

to kaldi-help

Hi

@david, ok, thanks.

What does " have smaller training archives" mean, and how to do it?

Also would you mind publish your pre-trained model?

thanks

min

Daniel Povey

unread,

Oct 13, 2019, 7:18:15 PM10/13/19

to kaldi-help

He probably means the frames-per-iter or chunks-per-iter value, whatever it's called, should be smaller.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/b4451bf2-5958-470f-9f68-27b77eb36b74%40googlegroups.com.

David Snyder

unread,

Oct 13, 2019, 9:21:52 PM10/13/19

to kaldi-help

Yes, I believe you can achieve that by decreasing the frames-per-iter value. You can play with the options to get_egs.sh, and see how it affects the output in the egs/ directory. If you look at the ranges.* files (in the egs/temp/ directory), you should be able to determine how many archives there are, and how many examples there are per archive. It seems that smaller tends to be better, maybe because it enables more frequent model averaging. But, I wonder if simply decreasing the learning rate would have the same effect.

I personally don't have any plans on updating the pretrained x-vector models at the moment. Maybe others do.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Min Wang

unread,

Oct 15, 2019, 1:11:59 PM10/15/19

to kaldi-help

HI

@dan, @ david, thanks for the info.

Training those 2M took us about 1 week on 8 GPUs.

With 6M data (without subsampling), I guess it will took 3-4 weeks? I am wondering how long did it take to train those data in your case?

It would be great if someone in the kaldi community could publish their pre-trained extended/TDNN model.

I noticed that voxceleb/v2 receipt just used original data/train ( without train_augumented data) to train the PLDA,

does it make sense to just train the PLDA part with all those 6M data ( without re-train the e-tdnn ) ?

Or is e-tdnn x-vector part more important for the performance than plda part?

Thanks

min

David Snyder

unread,

Oct 15, 2019, 3:36:53 PM10/15/19

to kaldi-help

Training those 2M took us about 1 week on 8 GPUs.

For me it took about a week, I think, to train it on 16 GPUs. You can certainly subsample the augmented training data if you want to train faster. I can't predict how much degradation, if any you'll have, though.

I noticed that voxceleb/v2 receipt just used original data/train ( without train_augumented data) to train the PLDA,
does it make sense to just train the PLDA part with all those 6M data ( without re-train the e-tdnn ) ?
Or is e-tdnn x-vector part more important for the performance than plda part?

Augmentation will almost always help the DNN training. It might improve the PLDA component, if it helps to make the training data look more like the enroll/test data. In the case of the Voxceleb recipe, the training data already closely matches the enroll/test data, and so it probably isn't going to help. If the enroll/test data has more noises or reverberation than the training data, it will probably help to use augmentation in the PLDA training.

Min Wang

unread,

Oct 21, 2019, 11:41:17 AM10/21/19

to kaldi-help

HI

I tried to use 6m data, but somehow in the stage 4, the prepare_feats_for_egs.sh only get about 1.8m utterences.

here are some log msg:

utils/combine_data.sh [info]: not combining spk2gender as it does not exist

fix_data_dir.sh: kept all 6384440 utterances. <----- 6m in train_combined

fix_data_dir.sh: old files are kept in data/train_combined/.backup

> local/nnet3/xvector/prepare_feats_for_egs.sh --nj 40 --cmd "$train_cmd" \

data/train_combined data/train_combined_no_sil exp/train_combined_no_sil

local/nnet3/xvector/prepare_feats_for_egs.sh --nj 40 --cmd run.pl data/train_combined data/train_combined_no_sil exp/train_combined_no_sil

local/nnet3/xvector/prepare_feats_for_egs.sh: Succeeded creating xvector features for train_combined

fix_data_dir.sh: kept 1871456 utterances out of 6384440 <---- it only keep 1.8 m instead of 6m.

I looked at local/nnet3/xvector/prepare_feats_for_egs.sh, I did not see any thing limit/filter those wav files.

Anything am I missing?

thanks

min

David Snyder

unread,

Oct 21, 2019, 3:47:45 PM10/21/19

to kaldi-help

Try deleting data/train_combined_no_sil and rerun stage 4.

My guess is there's some leftover files in that directory from before you increased the amount of training data.

Min Wang

unread,

Oct 21, 2019, 6:19:52 PM10/21/19

to kaldi-help

HI

@david, thank. I removed data/train_combined_no_sil and re-ran the stage 4 like this:

///////////////////////////////////////////////////////////

local/nnet3/xvector/prepare_feats_for_egs.sh --nj 40 --cmd "$train_cmd" \

data/train_combined data/train_combined_no_sil exp/train_combined_no_sil

echo "before fix_data dir"

wc data/train_combined_no_sil/wav.scp

utils/fix_data_dir.sh data/train_combined_no_sil

echo "after fix_data dir"

wc data/train_combined_no_sil/wav.scp

///////////////////////////////////////////////////////////

$ more my.log.step4

local/nnet3/xvector/prepare_feats_for_egs.sh --nj 40 --cmd run.pl data/train_combined data/train_combined_no_sil exp/train_combined_no_sil

local/nnet3/xvector/prepare_feats_for_egs.sh: Succeeded creating xvector features for train_combined

before fix_data dir

6384440 141292684 2308439195 data/train_combined_no_sil/wav.scp

fix_data_dir.sh: kept 1871456 utterances out of 6384440

fix_data_dir.sh: old files are kept in data/train_combined_no_sil/.backup

after fix_data dir

1871456 22817630 382929048 data/train_combined_no_sil/wav.scp

It looks like fix_data_dir.sh only got 1871456.

And I checked the data/train_combined_no_sil/.backup/utt2num_frames, it only has 1.8M.

$ wc data/train_combined_no_sil/.backup/utt2num_frames

1871456 3742912 62229302 utt2num_frames

Should not utt2num_frames be 6M just like train_combined/utt2num_frames?

thanks

min

David Snyder

unread,

Oct 21, 2019, 6:37:07 PM10/21/19

to kaldi-help

Are you sure you're not losing these files in stage 5? In these scripts, there's usually a stage 5 where we reduce the size of the dataset, and only keep utterances of a certain length, and speakers with several recordings.

Also, can you report the output of wc -l data/train_combined/*. I want to see if there's some file in there with 1.8 million lines.

Min Wang

unread,

Oct 21, 2019, 6:58:00 PM10/21/19

to kaldi-help

HI

@david. thanks.

I only re-ran the stage 4 , then exit.

Here is rsult of train_combined/

$ wc -l data/train_combined/*

6384440 data/train_combined/feats.scp

1 data/train_combined/frame_shift

7323 data/train_combined/spk2utt

wc: data/train_combined/split200: Is a directory

0 data/train_combined/split200

wc: data/train_combined/split40: Is a directory

0 data/train_combined/split40

6384440 data/train_combined/utt2dur

6384440 data/train_combined/utt2num_frames

6384440 data/train_combined/utt2spk

6384440 data/train_combined/utt2uniq

6384440 data/train_combined/vad.scp

6384440 data/train_combined/wav.scp

44698404 total

Thanks

min

David Snyder

unread,

Oct 21, 2019, 7:07:30 PM10/21/19

to kaldi-help

Run wc -l on data/train_combined_no_sil/* but *before* you run fix_data_dir.sh. There's probably some file in there that it is filtering against.

Min Wang

unread,

Oct 21, 2019, 10:26:06 PM10/21/19

to kaldi-help

HI

@david, since fix_data_dir.sh will backup the data, so here it is the result:

wc -l data/train_combined_no_sil/.backup/*

1871456 data/train_combined_no_sil/.backup/feats.scp

7323 data/train_combined_no_sil/.backup/spk2utt

1871456 data/train_combined_no_sil/.backup/utt2num_frames

6384440 data/train_combined_no_sil/.backup/utt2spk

6384440 data/train_combined_no_sil/.backup/wav.scp

16519115 total

thanks.

min

Daniel Povey

unread,

Oct 21, 2019, 10:30:37 PM10/21/19

to kaldi-help

Looks like they might have been lost in feature generation.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/3d34b20a-bb0e-4f67-98bf-377bc3352d74%40googlegroups.com.

David Snyder

unread,

Oct 21, 2019, 10:36:45 PM10/21/19

to kaldi-help

Yeah, that's what it looks like. You should inspect the log file and see if something went wrong.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/3d34b20a-bb0e-4f67-98bf-377bc3352d74%40googlegroups.com.

Min Wang

unread,

Oct 22, 2019, 11:38:00 AM10/22/19

to kaldi-help

HI

@dan, @david, thanks.

I looked at the log: exp/train_combined_no_sil/log/create_xvector_feats_train_combined.xx.log

they have those kind of WARNING and error:

WARNING (select-voiced-frames[5.5.506~1-7b4c5]:main():select-voiced-frames.cc:76) Mismatch in number for frames 563 for features and VAD 557, for utterance id00820-zhX-Mjuc_dc-00083-babble

WARNING (select-voiced-frames[5.5.506~1-7b4c5]:main():select-voiced-frames.cc:76) Mismatch in number for frames 563 for features and VAD 557, for utterance id00820-zhX-Mjuc_dc-00083-music

WARNING (select-voiced-frames[5.5.506~1-7b4c5]:main():select-voiced-frames.cc:76) Mismatch in number for frames 563 for features and VAD 557, for utterance id00820-zhX-Mjuc_dc-00083-noise

LOG (apply-cmvn-sliding[5.5.506~1-7b4c5]:main():apply-cmvn-sliding.cc:75) Applied sliding-window cepstral mean normalization to 162845 utterances, 0 had errors.

WARNING (select-voiced-frames[5.5.506~1-7b4c5]:main():select-voiced-frames.cc:76) Mismatch in number for frames 563 for features and VAD 557, for utterance id00820-zhX-Mjuc_dc-00083-reverb

LOG (select-voiced-frames[5.5.506~1-7b4c5]:main():select-voiced-frames.cc:106) Done selecting voiced frames; processed 32569 utterances, 130276 had errors.

LOG (copy-feats[5.5.506~1-7b4c5]:main():copy-feats.cc:143) Copied 32569 feature matrices.

I am thinking those VAD is from stage 3, here is my stage 3:

if [ $stage -le 3 ]; then

# Take a random subset of the augmentations 5m

#utils/subset_data_dir.sh data/train_aug 5000000 data/train_aug_1m

# use all of them

utils/subset_data_dir.sh data/train_aug 5107552 data/train_aug_1m

utils/fix_data_dir.sh data/train_aug_1m

# Make MFCCs for the augmented data. Note that we do not compute a new

# vad.scp file here. Instead, we use the vad.scp from the clean version of

# the list.

steps/make_mfcc.sh --mfcc-config conf/mfcc.conf --nj 40 --cmd "$train_cmd" \

data/train_aug_1m exp/make_mfcc $mfccdir

# Combine the clean and augmented VoxCeleb2 list. This is now roughly

# double the size of the original clean list.

utils/combine_data.sh data/train_combined data/train_aug_1m data/train

fi

min

David Snyder

unread,

Oct 22, 2019, 2:18:04 PM10/22/19

to kaldi-help

Regenerating the vad.scp for data/train_combined should fix the issue.

If everything was done correctly, the number of frames in the feats.scp and vad.scp would be equal. It's not going to be easy for us to tell you where this went wrong. I could see something like this happening if you were experimenting with different frame-shift values in the MFCC configs.

Anyway, the simplest fix is to just regenerate the vad.scp file.

David Snyder

unread,

Oct 22, 2019, 2:20:56 PM10/22/19

to kaldi-help

Just run

sid/compute_vad_decision.sh --nj 40 --cmd "$train_cmd" \
data/train_combined exp/make_vad $vaddir

utils/fix_data_dir.sh data/train_combined

Before you rerun prepare_feats_for_egs.sh.

Min Wang

unread,

Oct 22, 2019, 4:14:02 PM10/22/19

to kaldi-help

Hi

@david, thanks a lot!!

Yah, it did fix the step4 problem.

Now I got 6m, continued to stage 5.

min

Min Wang

unread,

Oct 31, 2019, 6:54:57 PM10/31/19

to kaldi-help

HI

just fyi, my training finally finished, I got the following result:

EER: 2.344%

minDCF(p-target=0.01): 0.2418

minDCF(p-target=0.001): 0.4219

Those number are improved compared to my previous result trained from 2M utterances.

However when I tried to a real live data, it seems the performance improvement is NOT obvious at all.

The main thing seems to the performance degraded a lot if enrollment and test acoustic condition mismatches.

For example:

one people record enrollment samples on one device/microphone,

then talked over another type of microphone at different distance under different condition, the performance degraded a lot.

In real life environment, it is hard to ask people to match enrollment and test acoustic condition.

I guess current speaker identification technology still can not solve this issue ( enrollment and test acoustic condition mismatch), right?

Any suggestions?

thanks

David Snyder

unread,

Nov 1, 2019, 1:20:19 AM11/1/19

to kaldi-help

I guess current speaker identification technology still can not solve this issue ( enrollment and test acoustic condition mismatch), right?

I'm not sure what you would consider "solving" this issue. Performance will always be better if the enrollment and test conditions match (e.g., same recording device, same phrase, etc). Performance degrades as the enrollment and test conditions differ.

Any suggestions?

Your best bet is to collect some data the reflects the variability you expect to see in your enrollment and test data, or to simulate it if you can. Use this data to train or adapt the PLDA model in some way. If you have enough of this data, it may help to include it in the DNN embedding training set.

Min Wang

unread,

Nov 1, 2019, 1:14:21 PM11/1/19

to kaldi-help

HI

@david, thanks.

What I mean is: the performance should degrade just a little if enrollment and test acoustic condition mismatch.

For example: a person enrolled using a recording on a PC/microphone in a typical office environment,

now he talked on a big conference room using a wireless microphone,

we should be able to identify him without much performance degrade.

I assume voxcel v1/v2 data should have that typical variability?

But it seems current x-vector/etdnn trained with voxcele etc did not perform well in those situations.

Considering there are many possible test conditions,

Is there any scientific way to collect data to reflect the variability between in the enrollment and test data?

For example, what if a person talk over different distance/different microphones?

What are those variabilities exactly?

thanks.

min

Reply all

Reply to author

Forward