Kaldi Speaker Identification

Yi Yang

unread,

Oct 21, 2019, 5:43:15 AM10/21/19

to kaldi-help

Hi All,

I am looking for the most accurate speaker identification method/script using Kaldi to do it?

For your info, I have trained my data with speaker IDs following the Kaldi WSJ recipe and trained my DNN model.

Now I am looking for how to identify the speaker ID of my input wave file (new wave file which is not in the training data), with high accuracy.

I would be very much appreciative to receive your kind advice on the above mentioned.

Thank you.

YiYang

David Snyder

unread,

Oct 21, 2019, 10:56:29 AM10/21/19

to kaldi-help

The best speaker recognition systems are based on DNN embeddings. In Kaldi these are called x-vectors. Look at egs/voxceleb/v2 or egs/sitw/v2 for wideband recipes (i.e., microphone) and egs/sre16/v2/ for a telephone based recipe. We also have a few pretrained models for these recipes available online.

Yi Yang

unread,

Oct 23, 2019, 6:27:36 AM10/23/19

to kaldi-help

Hi David,

From the recipes mentioned above, what is the PLDA model used for and also the PLDA scoring?

And is there any example for using the pretrained model for the recipes to do a speaker recognition?

Thank you and best regards,

YiYang

David Snyder

unread,

Oct 23, 2019, 10:09:38 AM10/23/19

to kaldi-help

From the recipes mentioned above, what is the PLDA model used for and also the PLDA scoring?

Google "probabilistic linear discriminant analysis" and "speaker recognition" and you should find plenty of information online.

And is there any example for using the pretrained model for the recipes to do a speaker recognition?

The pretrained models are generated from existing recipes. For example, the voxceleb model was generated by https://github.com/kaldi-asr/kaldi/blob/master/egs/voxceleb/v2/run.sh. If you follow the steps in this recipe, you should get and idea of how it was trained and how to use it.

Yi Yang

unread,

Nov 8, 2019, 5:54:42 AM11/8/19

to kaldi-help

Hi David,

I am working on the voxceleb recipe and have change the train and test data with my own data.

But I encounter error when at stage 11, for the scoring it is the part where it do the speaker recognition?

And what is the usage of "trials" file downloaded by local/make_voxceleb1_v2.pl?

Thank you and regards,

YiYang

David Snyder

unread,

Nov 8, 2019, 8:39:13 AM11/8/19

to kaldi-help

But I encounter error when at stage 11, for the scoring it is the part where it do the speaker recognition?

Scoring is where you compare pairs of embeddings and determine how likely they are to have been generated by the same speaker versus different speakers. The comparison results in a score, which can be converted into a speaker recognition decision. E.g., if the score is sufficiently high, we'll say the two embeddings belong to the same speaker (this is a "target" trial), otherwise they're from different speakers (a "nontarget" trial).

And what is the usage of "trials" file downloaded by local/make_voxceleb1_v2.pl?

The trials file defines the comparisons we are going to perform. In the Kaldi scripts, we also provide a column that says what the outcome of the comparison should be (either "target" or "nontarget"), but I believe that last column is only used when computing error rates, not by the scoring binaries themselves.

The trials file might look something like this:

spk-id-A utt-id-A target

spk-id-A utt-id-B nontarget

spk-id-A utt-id-C nontarget

spk-id-B utt-id-A nontarget

spk-id-B utt-id-B target

In the first line, it says we want to compare a speaker model spk-id-A with an utterance utt-id-A, and that comparison should result in a low score, since it is a nontarget trial. The output of the scoring should be a file that looks something like this:

spk-id-A utt-id-A -10.87238

spk-id-A utt-id-B -61.12823

spk-id-A utt-id-C -80.87298

spk-id-B utt-id-A -47.72377

spk-id-B utt-id-B 5.81908

The last column is the PLDA score, which is the log likelihood ratio between the embeddings with the IDs given in the first and second column belonging to the same speaker, versus belonging to different speakers.

Yi Yang

unread,

Nov 12, 2019, 5:18:04 AM11/12/19

to kaldi-help

Hi David,

I have try to create the trials file and succeeded run the scoring.

The results I get is as below:

cso001_VL180810115318108_001 cso001_VL180810124444108_001 -0.05909699
cso001_VL180810115318108_001 cso002_VL180810120047200_001 -1.734944
cso001_VL180810115318108_001 cso001_VL180810124444108_003 -2.204333
cso001_VL180810115318108_001 cso003_VL180810123942162_001 -0.2915734
cso001_VL180810115318108_001 cso001_VL180810124444108_006 -0.1330951
cso001_VL180810115318108_001 cust001_VL180810115318108_001 -2.969074
cso001_VL180810115318108_001 cso001_VL180810124444108_009 -0.9908003

cust002_VL180810120047200_007 cso001_VL180810124444108_001 -0.07748856
cust002_VL180810120047200_007 cso002_VL180810120047200_001 -2.998007
cust002_VL180810120047200_007 cso001_VL180810124444108_003 -4.153507
cust002_VL180810120047200_007 cso003_VL180810123942162_001 -0.4131091
cust002_VL180810120047200_007 cso001_VL180810124444108_006 -2.56669
cust002_VL180810120047200_007 cust001_VL180810115318108_001 -2.118264
cust002_VL180810120047200_007 cso001_VL180810124444108_009 -4.937249

From the score, I can't get higher score for speaker-id "cust002" against utterances from speaker-id "cust002"

And it show the EER percentage is 40%, the EER percentage it is mean the lower the percentage, the more accurate the model?

Yi Yang

unread,

Nov 19, 2019, 5:12:11 AM11/19/19

to kaldi-help

Hi David,

I know if the lower the EER value, the higher the accuracy of the model.

And referring to the voxceleb recipe, I have train it with my own train and test data.

I also not include the "RIRS_noise" and "musan" data in my training.

My train data is about 17 hour++ and currently I can get EER value around 40% with my train data.

What can I do in order to improve the EER value? It is just by adding more train data will do?

Thank you and much appreciative to receive your kind advice.

Thanks and regards,

YiYang

David Snyder

unread,

Nov 20, 2019, 9:33:06 PM11/20/19

to kaldi-help

Yes, you need a lot more than 17h of speech to train the x-vector DNN. For example, a commonly used dataset is Voxceleb which has 2,000 hours of training data. Also, for speaker ID, it's really important to have diversity of training speakers. We usually expect several thousand training speakers. There's no minimum training size I can point to, but I would be skeptical of training an x-vector DNN on less than 1000 speakers with less than 500 hours of speech. Maybe an i-vector system would work better with that amount of data.

I suggest adding some publicly available datasets. Voxceleb is a good one, but it's primarily (only?) English. Still, a lot of out-of-domain data is better than a small amount of in-domain data. I've heard about some Chinese version of Voxceleb exists, but I don't know too much about it. Finally, you can look for datasets on the LDC, but those cost money.

Jan Trmal

unread,

Nov 20, 2019, 11:52:03 PM11/20/19

to kaldi-help

there is the CN-Celebs also available now (since last week) on openslr.org if that helps

y.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/b36d390e-2bf6-428d-97bd-659478de3d76%40googlegroups.com.

Yi Yang

unread,

Nov 28, 2019, 3:38:28 AM11/28/19

to kaldi-help

Hi David,

Thank you for your advice.

Before try to add the publicly available datasets with my own train data.

I have try to run the "Voxceleb" recipe again on my system and currently I encounter the error as show below:

sid/nnet3/xvector/get_egs.sh: Shuffling order of archives on disk
bash: line 1: 35764 Killed                  ( nnet3-shuffle-egs --srand=45 ark:./exp/xvector_nnet_1a/egs/egs_temp.45.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.45.ark,./exp/xvector_nnet_1a/egs/egs.45.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.45.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.45.log
bash: line 1: 35816 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=49 ark:./exp/xvector_nnet_1a/egs/egs_temp.49.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.49.ark,./exp/xvector_nnet_1a/egs/egs.49.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.49.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.49.log
bash: line 1: 35702 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=40 ark:./exp/xvector_nnet_1a/egs/egs_temp.40.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.40.ark,./exp/xvector_nnet_1a/egs/egs.40.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.40.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.40.log
...
bash: line 1: 36233 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=81 ark:./exp/xvector_nnet_1a/egs/egs_temp.81.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.81.ark,./exp/xvector_nnet_1a/egs/egs.81.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.81.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.81.log
bash: line 1: 36246 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=82 ark:./exp/xvector_nnet_1a/egs/egs_temp.82.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.82.ark,./exp/xvector_nnet_1a/egs/egs.82.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.82.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.82.log
bash: line 1: 36260 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=83 ark:./exp/xvector_nnet_1a/egs/egs_temp.83.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.83.ark,./exp/xvector_nnet_1a/egs/egs.83.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.83.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.83.log
bash: line 1: 36274 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=84 ark:./exp/xvector_nnet_1a/egs/egs_temp.84.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.84.ark,./exp/xvector_nnet_1a/egs/egs.84.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.84.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.84.log
run.pl: 39 / 84 failed, log is in ./exp/xvector_nnet_1a/egs/log/shuffle.*.log

At first I encounter an error that is cause by not enough memory on my system and I have reduce the value of $nj and run again the run.sh script.

Then I encounter the error as show in the log above, it might because of not enough disk space on my system.

Currently my system have around 800GB disk space, but how much disk space does the "Voxceleb" recipe need?

Regards,

YiYang

Yi Yang

unread,

Nov 28, 2019, 3:42:37 AM11/28/19

to kaldi-help

Hi Yenda,

Thank you for your helps, currently my aim is on English dataset maybe in future I will include Chinese dataset.

Thank you and appreciate your help.

Regards,

YiYang

Daniel Povey

unread,

Nov 28, 2019, 6:37:46 AM11/28/19

to kaldi-help

There was a recent conversation on this list about that, with Bar Madar I think- I pointed out a code-level issue which makes the egs 4x larger than they need to be. You might be able to fix it if you are good at C++.

--

Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/06702f8e-0616-4b6f-8926-7323baf1fa5a%40googlegroups.com.

Yi Yang

unread,

Dec 4, 2019, 4:28:19 AM12/4/19

to kaldi-help

Hi Dan,

The conversation you mean is it in this discussion: https://groups.google.com/d/msg/kaldi-help/Duqa5XEAJek/Pm2tybddAAAJ

It that the only ways I can solve it or can I try to reduced the number of archives?

Thanks and Regards,

YiYang

Daniel Povey

unread,

Dec 4, 2019, 4:31:09 AM12/4/19

to kaldi-help

Turns out I was mistaken, the egs are compressed later on in that code.

You'll just have to use less data or less perturbations. The number of archives does not make a difference.

--

Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/e39a91ad-9cbf-4897-a0bb-f4a5bdc53130%40googlegroups.com.

kodamvenk...@gmail.com

unread,

Aug 14, 2020, 2:01:45 AM8/14/20

to kaldi-help

hi david

i too training a speaker verification model for i am at stage 11 plda scoring so i am getting error at this stage i show below

i am running command (below )

$run.pl exp/scores/log/test_scoring.log \

ivector-plda-scoring --normalize-length=true \

"ivector-copy-plda --smoothing=0.0 exp/xvector_nnet_1a/xvector_train/plda - |" \

"ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \

"cat 'data/test/trials' | cut -d\ --fields=1,2 |" exp/scores_test

then i get error like this

run.pl: job failed, log is in exp/scores/log/test_scoring.log

and my test_scoring.log is shown below and how i am attaching my file also

# ivector-plda-scoring --normalize-length=true "ivector-copy-plda --smoothing=0.0 exp/xvector_nnet_1a/xvector_train/plda - |" "ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" "ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" "cat 'data/test/tri' | cut -d\ --fields=1,2 |" exp/scores_test

# Started at Fri Aug 14 11:22:19 IST 2020

#

ivector-plda-scoring --normalize-length=true 'ivector-copy-plda --smoothing=0.0 exp/xvector_nnet_1a/xvector_train/plda - |' 'ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |' 'ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |' 'cat '\''data/test/tri'\'' | cut -d\ --fields=1,2 |' exp/scores_test

ivector-copy-plda --smoothing=0.0 exp/xvector_nnet_1a/xvector_train/plda -

ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:-

transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:-

ivector-normalize-length ark:- ark:-

LOG (ivector-subtract-global-mean[5.5.640~1487-04a0c]:main():ivector-subtract-global-mean.cc:108) Wrote 16 mean-subtracted iVectors

LOG (transform-vec[5.5.640~1487-04a0c]:main():transform-vec.cc:85) Applied transform to 16 vectors.

LOG (ivector-normalize-length[5.5.640~1487-04a0c]:main():ivector-normalize-length.cc:90) Processed 16 iVectors.

LOG (ivector-normalize-length[5.5.640~1487-04a0c]:main():ivector-normalize-length.cc:94) Average ratio of iVector to expected length was 2.26529, standard deviation was 0.246129

transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:-

ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:-

LOG (ivector-subtract-global-mean[5.5.640~1487-04a0c]:main():ivector-subtract-global-mean.cc:108) Wrote 16 mean-subtracted iVectors

LOG (transform-vec[5.5.640~1487-04a0c]:main():transform-vec.cc:85) Applied transform to 16 vectors.

ivector-normalize-length ark:- ark:-

LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:96) Reading train iVectors

LOG (ivector-normalize-length[5.5.640~1487-04a0c]:main():ivector-normalize-length.cc:90) Processed 16 iVectors.

LOG (ivector-normalize-length[5.5.640~1487-04a0c]:main():ivector-normalize-length.cc:94) Average ratio of iVector to expected length was 2.26529, standard deviation was 0.246129

LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:122) Read 16 training iVectors, errors on 0

LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:126) Average renormalization scale on training iVectors was 1.00594

LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:129) Reading test iVectors

LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:147) Read 16 test iVectors.

LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:150) Average renormalization scale on test iVectors was 1.00594

WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.

LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:217) Processed 0 trials, 16 had errors.

# Accounting: time=0 threads=1

# Ended (code 1) at Fri Aug 14 11:22:19 IST 2020, elapsed time 0 seconds

and my trail file is ( i have only one speaker for testing )

111_call_1176 111_call_1176-0000524-0000982 target

111_call_1176 111_call_1176-0001134-0001552 target

111_call_1176 111_call_1176-0006447-0007252 target

111_call_1176 111_call_1176-0010021-0010446 target

111_call_1176 111_call_1176-0010476-0010963 target

111_call_1176 111_call_1176-0012020-0012437 target

111_call_1176 111_call_1176-0021600-0022012 target

111_call_1176 111_call_1176-0025994-0026586 target

111_call_1176 111_call_1176-0026586-0026995 target

111_call_1176 111_call_1176-0028624-0029100 target

111_call_1176 111_call_1176-0031733-0032369 target

111_call_1176 111_call_1176-0033412-0033826 target

111_call_1176 111_call_1176-0034358-0034953 target

111_call_1176 111_call_1176-0036669-0037197 target

111_call_1176 111_call_1176-0037197-0037719 target

111_call_1176 111_call_1176-0037719-0038293 target

can you please help me how to sort this

thank you

venkat sai

Ho Yin Chan

unread,

Aug 16, 2020, 6:38:20 AM8/16/20

to kaldi-help

You didn't put the training speaker xvector (spk_xvector.scp) in plda-scoring, you put scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark in both rspecifier

Usage: ivector-plda-scoring <plda> <train-ivector-rspecifier> <test-ivector-rspecifier>

john harvey

unread,

Aug 16, 2020, 10:10:04 AM8/16/20

to kaldi-help

I am trying to do speaker recognition on my own test file with enrollment utterances for my 3 speakers that are in the audio file. Can you please clarify how the trials file is generated. I found a few threads here but this was following the steps in one and since I was trying the pre-trained voxceleb model on my test file (so I think I need to create the file without the third column), but was not sure of the first 2 columns. Is file generated from the segmented spk2utt or utt2spk files?

Also in the above ivector-plda-scoring command you have mentioned scp:exp/xvector_nnet_1a/xvector_test/xvector.scp, where are the enrollment utterances used for comparison?

Can you please clarify?

Ho Yin Chan

unread,

Aug 16, 2020, 11:33:42 AM8/16/20

to kaldi-help

You have to at least separate the enrollment data and test data during embedding extraction.

Reply all

Reply to author

Forward

Message has been deleted