Kaldi Speaker Identification

2,406 views
Skip to first unread message

Yi Yang

unread,
Oct 21, 2019, 5:43:15 AM10/21/19
to kaldi-help
Hi All,

I am looking for the most accurate speaker identification method/script using Kaldi to do it?

For your info, I have trained my data with speaker IDs following the Kaldi WSJ recipe and trained my DNN model.

Now I am looking for how to identify the speaker ID of my input wave file (new wave file which is not in the training data), with high accuracy.

I would be very much appreciative to receive your kind advice on the above mentioned.

Thank you.
YiYang

David Snyder

unread,
Oct 21, 2019, 10:56:29 AM10/21/19
to kaldi-help
The best speaker recognition systems are based on DNN embeddings. In Kaldi these are called x-vectors. Look at egs/voxceleb/v2 or egs/sitw/v2 for wideband recipes (i.e., microphone) and egs/sre16/v2/ for a telephone based recipe. We also have a few pretrained models for these recipes available online.

Yi Yang

unread,
Oct 23, 2019, 6:27:36 AM10/23/19
to kaldi-help
Hi David,

From the recipes mentioned above, what is the PLDA model used for and also the PLDA scoring?

And is there any example for using the pretrained model for the recipes to do a speaker recognition?

Thank you and best regards,
YiYang

David Snyder

unread,
Oct 23, 2019, 10:09:38 AM10/23/19
to kaldi-help
From the recipes mentioned above, what is the PLDA model used for and also the PLDA scoring?

Google "probabilistic linear discriminant analysis" and "speaker recognition" and you should find plenty of information online. 

And is there any example for using the pretrained model for the recipes to do a speaker recognition?

The pretrained models are generated from existing recipes. For example, the voxceleb model was generated by https://github.com/kaldi-asr/kaldi/blob/master/egs/voxceleb/v2/run.sh. If you follow the steps in this recipe, you should get and idea of how it was trained and how to use it.  

Yi Yang

unread,
Nov 8, 2019, 5:54:42 AM11/8/19
to kaldi-help
Hi David,

I am working on the voxceleb recipe and have change the train and test data with my own data.

But I encounter error when at stage 11, for the scoring it is the part where it do the speaker recognition?

And what is the usage of "trials" file downloaded by local/make_voxceleb1_v2.pl?

Thank you and regards,
YiYang

David Snyder

unread,
Nov 8, 2019, 8:39:13 AM11/8/19
to kaldi-help
But I encounter error when at stage 11, for the scoring it is the part where it do the speaker recognition?

Scoring is where you compare pairs of embeddings and determine how likely they are to have been generated by the same speaker versus different speakers. The comparison results in a score, which can be converted into a speaker recognition decision. E.g., if the score is sufficiently high, we'll say the two embeddings belong to the same speaker (this is a "target" trial), otherwise they're from different speakers (a "nontarget" trial). 

  And what is the usage of "trials" file downloaded by local/make_voxceleb1_v2.pl?

The trials file defines the comparisons we are going to perform. In the Kaldi scripts, we also provide a column that says what the outcome of the comparison should be (either "target" or "nontarget"), but I believe that last column is only used when computing error rates, not by the scoring binaries themselves.

The trials file might look something like this:

spk-id-A utt-id-A target
spk-id-A utt-id-B nontarget
spk-id-A utt-id-C nontarget
spk-id-B utt-id-A nontarget
spk-id-B utt-id-B target

In the first line, it says we want to compare a speaker model spk-id-A with an utterance utt-id-A, and that comparison should result in a low score, since it is a nontarget trial. The output of the scoring should be a file that looks something like this:

spk-id-A utt-id-A -10.87238
spk-id-A utt-id-B -61.12823
spk-id-A utt-id-C -80.87298
spk-id-B utt-id-A -47.72377
spk-id-B utt-id-B 5.81908

The last column is the PLDA score, which is the log likelihood ratio between the embeddings with the IDs given in the first and second column belonging to the same speaker, versus belonging to different speakers. 

Yi Yang

unread,
Nov 12, 2019, 5:18:04 AM11/12/19
to kaldi-help
Hi David,

I have try to create the trials file and succeeded run the scoring.

The results I get is as below:
cso001_VL180810115318108_001 cso001_VL180810124444108_001 -0.05909699
cso001_VL180810115318108_001 cso002_VL180810120047200_001 -1.734944
cso001_VL180810115318108_001 cso001_VL180810124444108_003 -2.204333
cso001_VL180810115318108_001 cso003_VL180810123942162_001 -0.2915734
cso001_VL180810115318108_001 cso001_VL180810124444108_006 -0.1330951
cso001_VL180810115318108_001 cust001_VL180810115318108_001 -2.969074
cso001_VL180810115318108_001 cso001_VL180810124444108_009 -0.9908003

cust002_VL180810120047200_007 cso001_VL180810124444108_001 -0.07748856
cust002_VL180810120047200_007 cso002_VL180810120047200_001 -2.998007
cust002_VL180810120047200_007 cso001_VL180810124444108_003 -4.153507
cust002_VL180810120047200_007 cso003_VL180810123942162_001 -0.4131091
cust002_VL180810120047200_007 cso001_VL180810124444108_006 -2.56669
cust002_VL180810120047200_007 cust001_VL180810115318108_001 -2.118264
cust002_VL180810120047200_007 cso001_VL180810124444108_009 -4.937249

From the score, I can't get higher score for speaker-id "cust002" against utterances from speaker-id "cust002"

And it show the EER percentage is 40%, the EER percentage it is mean the lower the percentage, the more accurate the model?

Yi Yang

unread,
Nov 19, 2019, 5:12:11 AM11/19/19
to kaldi-help
Hi David,

I know if the lower the EER value, the higher the accuracy of the model.

And referring to the voxceleb recipe, I have train it with my own train and test data. 

I also not include the "RIRS_noise" and "musan" data in my training.

My train data is about 17 hour++ and currently I can get EER value around 40% with my train data.

What can I do in order to improve the EER value? It is just by adding more train data will do?

Thank you and much appreciative to receive your kind advice.

Thanks and regards,
YiYang


David Snyder

unread,
Nov 20, 2019, 9:33:06 PM11/20/19
to kaldi-help
Yes, you need a lot more than 17h of speech to train the x-vector DNN. For example, a commonly used dataset is Voxceleb which has 2,000 hours of training data. Also, for speaker ID, it's really important to have diversity of training speakers. We usually expect several thousand training speakers. There's no minimum training size I can point to, but I would be skeptical of training an x-vector  DNN on less than 1000 speakers with less than 500 hours of speech. Maybe an i-vector system would work better with that amount of data. 

I suggest adding some publicly available datasets. Voxceleb is a good one, but it's primarily (only?) English. Still, a lot of out-of-domain data is better than a small amount of in-domain data. I've heard about some Chinese version of Voxceleb exists, but I don't know too much about it. Finally, you can look for datasets on the LDC, but those cost money. 

Jan Trmal

unread,
Nov 20, 2019, 11:52:03 PM11/20/19
to kaldi-help
there is the CN-Celebs also available now (since last week) on openslr.org if that helps
y.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/b36d390e-2bf6-428d-97bd-659478de3d76%40googlegroups.com.

Yi Yang

unread,
Nov 28, 2019, 3:38:28 AM11/28/19
to kaldi-help
Hi David,

Thank you for your advice. 

Before try to add the publicly available datasets with my own train data.

I have try to run the "Voxceleb" recipe again on my system and currently I encounter the error as show below:

sid/nnet3/xvector/get_egs.sh: Shuffling order of archives on disk
bash: line 1: 35764 Killed                  ( nnet3-shuffle-egs --srand=45 ark:./exp/xvector_nnet_1a/egs/egs_temp.45.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.45.ark,./exp/xvector_nnet_1a/egs/egs.45.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.45.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.45.log
bash: line 1: 35816 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=49 ark:./exp/xvector_nnet_1a/egs/egs_temp.49.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.49.ark,./exp/xvector_nnet_1a/egs/egs.49.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.49.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.49.log
bash: line 1: 35702 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=40 ark:./exp/xvector_nnet_1a/egs/egs_temp.40.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.40.ark,./exp/xvector_nnet_1a/egs/egs.40.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.40.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.40.log
...
bash: line 1: 36233 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=81 ark:./exp/xvector_nnet_1a/egs/egs_temp.81.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.81.ark,./exp/xvector_nnet_1a/egs/egs.81.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.81.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.81.log
bash: line 1: 36246 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=82 ark:./exp/xvector_nnet_1a/egs/egs_temp.82.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.82.ark,./exp/xvector_nnet_1a/egs/egs.82.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.82.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.82.log
bash: line 1: 36260 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=83 ark:./exp/xvector_nnet_1a/egs/egs_temp.83.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.83.ark,./exp/xvector_nnet_1a/egs/egs.83.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.83.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.83.log
bash: line 1: 36274 Aborted                 (core dumped) ( nnet3-shuffle-egs --srand=84 ark:./exp/xvector_nnet_1a/egs/egs_temp.84.ark ark,scp:./exp/xvector_nnet_1a/egs/egs.84.ark,./exp/xvector_nnet_1a/egs/egs.84.scp ) 2>> ./exp/xvector_nnet_1a/egs/log/shuffle.84.log >> ./exp/xvector_nnet_1a/egs/log/shuffle.84.log
run.pl: 39 / 84 failed, log is in ./exp/xvector_nnet_1a/egs/log/shuffle.*.log

At first I encounter an error that is cause by not enough memory on my system and I have reduce the value of $nj and run again the run.sh script.

Then I encounter the error as show in the log above, it might because of not enough disk space on my system. 

Currently my system have around 800GB disk space, but how much disk space does the "Voxceleb" recipe need?

Regards,
YiYang

Yi Yang

unread,
Nov 28, 2019, 3:42:37 AM11/28/19
to kaldi-help
Hi Yenda,

Thank you for your helps, currently my aim is on English dataset maybe in future I will include Chinese dataset.

Thank you and appreciate your help.

Regards,
YiYang

Daniel Povey

unread,
Nov 28, 2019, 6:37:46 AM11/28/19
to kaldi-help
There was a recent conversation on this list about that, with Bar Madar I think- I pointed out a code-level issue which makes the egs 4x larger than they need to be.  You might be able to fix it if you are good at C++.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Yi Yang

unread,
Dec 4, 2019, 4:28:19 AM12/4/19
to kaldi-help
Hi Dan,

The conversation you mean is it in this discussion: https://groups.google.com/d/msg/kaldi-help/Duqa5XEAJek/Pm2tybddAAAJ

It that the only ways I can solve it or can I try to reduced the number of archives?

Thanks and Regards,
YiYang

Daniel Povey

unread,
Dec 4, 2019, 4:31:09 AM12/4/19
to kaldi-help
Turns out I was mistaken, the egs are compressed later on in that code.
You'll just have to use less data or less perturbations.  The number of archives does not make a difference.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

kodamvenk...@gmail.com

unread,
Aug 14, 2020, 2:01:45 AM8/14/20
to kaldi-help
hi david 

i too training a speaker verification model for i am at stage 11 plda scoring so i am getting error at this stage i show below 
i am running command (below )

$run.pl exp/scores/log/test_scoring.log \
    ivector-plda-scoring --normalize-length=true \
    "ivector-copy-plda --smoothing=0.0 exp/xvector_nnet_1a/xvector_train/plda - |" \
    "ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
    "ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
    "cat 'data/test/trials' | cut -d\  --fields=1,2 |" exp/scores_test 

then i get error like this 
run.pl: job failed, log is in exp/scores/log/test_scoring.log
 and my test_scoring.log is shown below and how i am attaching my file also 

# ivector-plda-scoring --normalize-length=true "ivector-copy-plda --smoothing=0.0 exp/xvector_nnet_1a/xvector_train/plda - |" "ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" "ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" "cat 'data/test/tri' | cut -d\  --fields=1,2 |" exp/scores_test 
# Started at Fri Aug 14 11:22:19 IST 2020
#
ivector-plda-scoring --normalize-length=true 'ivector-copy-plda --smoothing=0.0 exp/xvector_nnet_1a/xvector_train/plda - |' 'ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |' 'ark:ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- | transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |' 'cat '\''data/test/tri'\'' | cut -d\  --fields=1,2 |' exp/scores_test 
ivector-copy-plda --smoothing=0.0 exp/xvector_nnet_1a/xvector_train/plda - 
ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- 
transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- 
ivector-normalize-length ark:- ark:- 
LOG (ivector-subtract-global-mean[5.5.640~1487-04a0c]:main():ivector-subtract-global-mean.cc:108) Wrote 16 mean-subtracted iVectors
LOG (transform-vec[5.5.640~1487-04a0c]:main():transform-vec.cc:85) Applied transform to 16 vectors.
LOG (ivector-normalize-length[5.5.640~1487-04a0c]:main():ivector-normalize-length.cc:90) Processed 16 iVectors.
LOG (ivector-normalize-length[5.5.640~1487-04a0c]:main():ivector-normalize-length.cc:94) Average ratio of iVector to expected length was 2.26529, standard deviation was 0.246129
transform-vec exp/xvector_nnet_1a/xvector_train/transform.mat ark:- ark:- 
ivector-subtract-global-mean exp/xvector_nnet_1a/xvector_train/mean.vec scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark:- 
LOG (ivector-subtract-global-mean[5.5.640~1487-04a0c]:main():ivector-subtract-global-mean.cc:108) Wrote 16 mean-subtracted iVectors
LOG (transform-vec[5.5.640~1487-04a0c]:main():transform-vec.cc:85) Applied transform to 16 vectors.
ivector-normalize-length ark:- ark:- 
LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:96) Reading train iVectors
LOG (ivector-normalize-length[5.5.640~1487-04a0c]:main():ivector-normalize-length.cc:90) Processed 16 iVectors.
LOG (ivector-normalize-length[5.5.640~1487-04a0c]:main():ivector-normalize-length.cc:94) Average ratio of iVector to expected length was 2.26529, standard deviation was 0.246129
LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:122) Read 16 training iVectors, errors on 0
LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:126) Average renormalization scale on training iVectors was 1.00594
LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:129) Reading test iVectors
LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:147) Read 16 test iVectors.
LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:150) Average renormalization scale on test iVectors was 1.00594
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:170) Key 111_call_1176 not present in training iVectors.
LOG (ivector-plda-scoring[5.5.640~1487-04a0c]:main():ivector-plda-scoring.cc:217) Processed 0 trials, 16 had errors.
# Accounting: time=0 threads=1
# Ended (code 1) at Fri Aug 14 11:22:19 IST 2020, elapsed time 0 seconds


and my trail file is  ( i have only one speaker for testing )

111_call_1176 111_call_1176-0000524-0000982 target
111_call_1176 111_call_1176-0001134-0001552 target
111_call_1176 111_call_1176-0006447-0007252 target
111_call_1176 111_call_1176-0010021-0010446 target
111_call_1176 111_call_1176-0010476-0010963 target
111_call_1176 111_call_1176-0012020-0012437 target
111_call_1176 111_call_1176-0021600-0022012 target
111_call_1176 111_call_1176-0025994-0026586 target
111_call_1176 111_call_1176-0026586-0026995 target
111_call_1176 111_call_1176-0028624-0029100 target
111_call_1176 111_call_1176-0031733-0032369 target
111_call_1176 111_call_1176-0033412-0033826 target
111_call_1176 111_call_1176-0034358-0034953 target
111_call_1176 111_call_1176-0036669-0037197 target
111_call_1176 111_call_1176-0037197-0037719 target
111_call_1176 111_call_1176-0037719-0038293 target
can you please help me how to sort this 

thank you 
venkat sai 

Ho Yin Chan

unread,
Aug 16, 2020, 6:38:20 AM8/16/20
to kaldi-help
You didn't  put the training speaker xvector (spk_xvector.scp) in plda-scoring, you put scp:exp/xvector_nnet_1a/xvector_test/xvector.scp ark in both rspecifier

Usage: ivector-plda-scoring <plda> <train-ivector-rspecifier> <test-ivector-rspecifier>

john harvey

unread,
Aug 16, 2020, 10:10:04 AM8/16/20
to kaldi-help
I am trying to do speaker recognition on my own test file with enrollment utterances for my 3 speakers that are in the audio file. Can you please clarify how the trials file is generated.  I found a few threads here but this was following the steps in  one  and since I was trying the pre-trained voxceleb model on my test file (so I think I need to create the file without the third column), but was not sure of the first 2 columns. Is file generated from the segmented spk2utt or utt2spk files?
Also in the above  ivector-plda-scoring  command you have mentioned scp:exp/xvector_nnet_1a/xvector_test/xvector.scp, where are the enrollment utterances used for comparison? 
Can you please clarify?

Ho Yin Chan

unread,
Aug 16, 2020, 11:33:42 AM8/16/20
to kaldi-help
You have to at least separate the enrollment data and test data during embedding extraction.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages