SRE16 Xvector Model

2,088 views
Skip to first unread message

Srikar Yekollu

unread,
Mar 28, 2018, 2:42:24 AM3/28/18
to kaldi-help
Hello Everyone,
 I am looking for some help building a speaker recognition component. Given a sample audio from a speaker and another multi person audio file, I would like to identify segments in the latter where the sample speaker is talking. I would like to use Kaldi for speaker recognition. I don't have enough speakers to build the UBM model. My area of specialisation is primarily NLP, but, I need to recognize the speaker in an audio before I can get to my objective. 
I have looked at spear and sidekit too, but the sparse documentation is making it hard for me to use them. I noticed that Kaldi had a "SRE16 Xvector Model" which seemed like a good option to get past my lack of data. Does anyone know if this can be used for my pupose and if yes, how? Is there an example? 

Thanks,
Srikar

David Snyder

unread,
Mar 28, 2018, 11:18:10 AM3/28/18
to kaldi-help
Hi Srikar,

There is more free data out there than you might think (for training models). For starters, look for VoxCeleb and LibriSpeech. You can train a reasonable i-vector (and even DNN embedding system, such as x-vectors) using just the VoxCeleb dataset. Of course, you can also buy corpora from the LDC to augment the free resources.

All the Kaldi examples live in the egs directory in Kaldi: https://github.com/kaldi-asr/kaldi/tree/master/egs. For many examples, the only documentation that exists for them lives in those directories. If a knowledgable user wants to see, for example, how the x-vector model is trained, used to extract embeddings, and compare speakers, they would typically follow the steps in https://github.com/kaldi-asr/kaldi/blob/master/egs/sre16/v2/run.sh. Unfortunately, if you're completely new to speech processing and Kaldi, this could be overwhelming. In that case, you might have to start with some general Kaldi tutorials (these only exist for ASR currently), e.g., http://kaldi-asr.org/doc/kaldi_for_dummies.html.

Your problem is more complicated than just speaker recognition. Since you have multiple speakers per recording, you'll (probably) want to first perform speaker diarization to split the recording into segments that belong to different speakers. There's a speaker diarization example in https://github.com/kaldi-asr/kaldi/tree/master/egs/callhome_diarization but it's currently only for i-vectors (generalizing it to work with the pretrained x-vector model isn't too difficult, though). Once the diarization is performed, you can then perform speaker recognition on each of the speakers. 

Of course this is just scratching the surface of what it will take to solve this problem. If you start making a serious effort, feel free to ask more in depth questions on the forum. Ultimately, I think this will require some experience to do a good job, though. Some of the members of this forum are willing to do paid consulting, and that could end up being your best option if this is for a commercial application. 

Best,
David

Srikar Yekollu

unread,
Mar 28, 2018, 12:03:47 PM3/28/18
to kaldi-help
Hi David,
 Thank you very much for the detailed response. 
1. Let me start with the kaldi for dummies and build an ASR model to get familiar with kaldi.
2. I will then try to get the x-vectors for a given audio using the pre-trained model. (since this does not need any data)
3. Once that is taken care of , l will then build a  diarization system using these vectors.
At this point, as you mentioned, I should be more informed about the right questions to ask. I would also be interested in a paid consultation at that point.

Thanks,
Srikar

David Snyder

unread,
Mar 28, 2018, 12:12:07 PM3/28/18
to kaldi-help
If you get to step 3 and want to start applying the pretrained x-vector model to diarization, you might find this thread helpful: https://groups.google.com/d/msg/kaldi-help/ROtSHHe3Z_I/_zlJA-qjBQAJ

Srikar Yekollu

unread,
Mar 28, 2018, 12:13:52 PM3/28/18
to kaldi...@googlegroups.com
Excellent, that is very thoughtful of you to fish this out for me. 
Thank You,
Srikar

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/NOrMx6hbBTM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/f6921593-9a54-4200-aad8-114168980701%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Srikar Yekollu

unread,
Apr 4, 2018, 2:26:53 AM4/4/18
to kaldi-help
Hi David,
 I followed your advice and did the following,
1. setup kaldi and trained an ASR model using LibriSpeech 100 dataset
2. Also got the aspire pre trained model running
3. Read through the scripts for the sre16/v2 model and modified the run.sh to compare two given audio files to tell me if they are the same speaker. I did this by using one audio for enroll and the other as test. (for test, I use the same audio for both enroll and test).
. cmd_local.sh
. path.sh
set -e


tmp_dir_name=tmp_data
src=$1
dst=data/$tmp_dir_name
mfccdir=`pwd`/mfcc
vaddir=`pwd`/mfcc
nnet_dir=exp/xvector_nnet_1a
utt_id="temp_utterance"

utt2spk=$dst/utt2spk; [[ -f "$utt2spk" ]] && rm $utt2spk
utt2dur=$dst/utt2dur; [[ -f "$utt2dur" ]] && rm $utt2dur



mkdir -p $dst || exit 1;

# all utterances are FLAC compressed
if ! which flac >&/dev/null; then
echo "Please install 'flac' on ALL worker nodes!"
exit 1
fi

wav_scp=$dst/wav.scp; [[ -f "$wav_scp" ]] && rm $wav_scp
echo "$utt_id flac -c -d -s $src |" > $wav_scp || exit 1

echo "$utt_id S0" >> $utt2spk || exit 1
spk2utt=$dst/spk2utt
utils/utt2spk_to_spk2utt.pl <$utt2spk >$spk2utt || exit 1
echo "$0: successfully prepared data in $dst"

# make mfcc features
steps/make_mfcc.sh --mfcc-config conf/mfcc.conf --nj 1 --cmd "$train_cmd" --validate-data-dir false \
data/${tmp_dir_name} exp/make_mfcc $mfccdir

sid/compute_vad_decision.sh --nj 1 --cmd "$train_cmd" \
data/${tmp_dir_name} exp/make_vad $vaddir


# extract xvectors
sid/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd" --nj 1 \
$nnet_dir data/$tmp_dir_name \
exp/xvectors_$tmp_dir_name


# Get results using the adapted PLDA model.
$train_cmd exp/scores/log/${tmp_dir_name}.log \
ivector-plda-scoring --normalize-length=true \
--num-utts=ark:exp/xvectors_${tmp_dir_name}/num_utts.ark \
"ivector-copy-plda --smoothing=0.0 exp/xvectors_sre16_major/plda_adapt - |" \
"ark:ivector-mean ark:data/${tmp_dir_name}/spk2utt scp:exp/xvectors_${tmp_dir_name}/xvector.scp ark:- | ivector-subtract-global-mean exp/xvectors_sre16_major/mean.vec ark:- ark:- | transform-vec exp/xvectors_sre_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
"ark:ivector-subtract-global-mean exp/xvectors_sre16_major/mean.vec scp:exp/xvectors_${tmp_dir_name}/xvector.scp ark:- | transform-vec exp/xvectors_sre_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
"cat '/tmp/trials.txt' | cut -d\ --fields=1,2 |" exp/scores/${tmp_dir_name} || exit 1;

My questions is what is the trials file. the file, sre16_trials=data/sre16_eval_test/trials" which i replaced with /tmp/trials.txt. I assumed it was an output file. Turns out this is an input file. 

Do you see anything wrong with my scripts? 

Thanks,
Srikar

On Wednesday, March 28, 2018 at 8:48:10 PM UTC+5:30, David Snyder wrote:

Srikar Yekollu

unread,
Apr 4, 2018, 4:57:59 AM4/4/18
to kaldi-help
Hi David, 
 Sorry to bother you, I think i will find my answer to the trials file here


Thanks,
Srikar

David Snyder

unread,
Apr 4, 2018, 12:21:48 PM4/4/18
to kaldi-help
Yeah, there's been a few questions on the Kaldi forums about this. Things like NIST SREs call the files that define the evaluation a "trials" file.

$train_cmd exp/scores/log/${tmp_dir_name}.log \
ivector-plda-scoring --normalize-length=true \
--num-utts=ark:exp/xvectors_${tmp_dir_name}/num_utts.ark \
"ivector-copy-plda --smoothing=0.0 exp/xvectors_sre16_major/plda_adapt - |" \
"ark:ivector-mean ark:data/${tmp_dir_name}/spk2utt scp:exp/xvectors_${tmp_dir_name}/xvector.scp ark:- | ivector-subtract-global-mean exp/xvectors_sre16_major/mean.vec ark:- ark:- | transform-vec exp/xvectors_sre_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
"ark:ivector-subtract-global-mean exp/xvectors_sre16_major/mean.vec scp:exp/xvectors_${tmp_dir_name}/xvector.scp ark:- | transform-vec exp/xvectors_sre_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |"

You can just run it up to this point, and look at the output. You should see something like <enroll-reco-id> <test-reco-id> <plda-score>. 

Srikar Yekollu

unread,
Apr 4, 2018, 1:03:07 PM4/4/18
to kaldi...@googlegroups.com
Yes. I noticed some more documentation in the help for the 
ivector-plda-scoring command. 
Does my approach /understanding of the tools sound correct? flac -> mfcc -> ivector -> ivector-plda-scoring -> unnormalized score.

I now need to figure out how to normalize this score and apply a threshold. Any suggestions for this?

Thanks,
Srikar

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/NOrMx6hbBTM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

David Snyder

unread,
Apr 4, 2018, 1:23:37 PM4/4/18
to kaldi-help
flac -> mfcc -> ivector -> ivector-plda-scoring -> unnormalized score.

Sounds fine. You might need to downsample audio to 8kHz (since that's what the pretrained model was trained on). Anyway, I think the MFCC extraction will complain if the sampling rate isn't what it expects. 

By the way, the pretrained model was trained primarily on conversational telephone speech. If your data is wideband microphone speech, you can probably train a model that will perform much better. Consider something like VoxCeleb. 

I now need to figure out how to normalize this score and apply a threshold. Any suggestions for this?

You don't necessarily need to normalize the scores, but you do need a threshold.

Normally you'll have a pile of labelled in-domain data. From that, you can construct a dev set. That will mean partitioning the data into an "enroll" set, and a "test" set. You'll also need a "trials" file that defines which enrollment recordings are compared with which test recordings, and whether or not they should be considered the same or different ("target" or "nontarget"). Once you do that, you can determine the threshold that optimizes an error-metric that is meaningful for your task. E.g., for some tasks equal error-rate (EER) might be OK, in which case you can use the binary compute-eer to determine the error-rate on your dataset and an appropriate threshold to achieve that. However, for many applications, false-positives are much more costly than false-negatives, so your threshold will need to reflect that.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Srikar Yekollu

unread,
Apr 4, 2018, 1:29:58 PM4/4/18
to kaldi...@googlegroups.com
Thank you, David.

To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Srikar Yekollu

unread,
Apr 9, 2018, 9:12:01 AM4/9/18
to kaldi-help
Hi David, 
   Thanks for helping me get here. My question this time is around Diarization. Do I need to create a separate thread for that?

I read through,
and 

I think that (2) is focused on training an xvector based diarization system from scratch.
 I am looking to use the pretrained SRE16 Xvector model to generate the ivectors.

to reiterate: My goal is to start from a wav/flac audio and get that split into segments where each segment has one speaker (ignoring multiple speakers speaking together for now)

So, after reading through the scripts, i sounds like the process is roughly as follows,
1. Generate the data in kaldi format (wav.scp, spk2utt [not sure how this will be])
2. Generate the mfcc features (steps/make_mfcc.sh)
3. Identify which segments have voice in them (sid/compute_vad_decision.sh)
4. local/nnet3/xvector/prepare_feats.sh
5. diarization/vad_to_segments.sh
6. Extract xvectors (diarization/extract_xvectors.sh)
7. Compute PLDA score based on the pretrained model (diarization/score_plda.sh)
8. Cluster both xvector and plda scores using some form of hierarchical clustering (diarization/cluster.sh)

Do you think this is accurate. Have I missed any other step here? 

Thanks,
Srikar

David Snyder

unread,
Apr 9, 2018, 2:55:37 PM4/9/18
to kaldi-help
Hi Srikar,

You should be able to take the pretrained x-vector DNN from here: http://kaldi-asr.org/models.html and use it in the run.sh from here:  https://github.com/david-ryan-snyder/kaldi/blob/xvector-diarization/egs/callhome_diarization/v2/run.sh.

You'll probably want to start at stage 8 (it will probably need further script-level modifications to work). Also, you may want to retrain the PLDA model that comes with the pretrained x-vector system. It will be better if the PLDA model is trained on some in-domain data, and short segments (e.g., 1-3s long).

All of your steps look reasonable except for step 3.

3. Identify which segments have voice in them (sid/compute_vad_decision.sh)

The callhome diarization recipe assumes that you have the oracle speech segments. Since you don't, you'll need to run a real speech activity detection (SAD) system first, to obtain speech/nonspeech segments. The script above is actually only used when creating the PLDA training list. 

I'll someone to point you to a SAD recipe. 

Best,
David

Thank you, David.

Vimal Manohar

unread,
Apr 9, 2018, 3:36:12 PM4/9/18
to kaldi-help
Hi,

See https://github.com/kaldi-asr/kaldi/blob/master/egs/swbd/s5c/local/run_asr_segmentation.sh for SAD. You can use the model trained using that to get speech/nonspeech segments.

Vimal

Srikar Yekollu

unread,
Apr 10, 2018, 12:01:12 AM4/10/18
to kaldi...@googlegroups.com
Hi,
 Vimal, is there a pre-trained model for SAD?
 I do not have access to the switchboard corpus. Is there a different set I can use? LibriSpeech/Voxceleb?


David,
1. Can I not get away with using VAD instead of SAD? Especially since I know that my audio is composed of conversations and does not contain music.
2. I do not have domain specific data for PLDA. Is it a no-go to use the PLDA model which comes with the pre-trained x-vector model?
3. If I start from step 8, I do not have the mfcc. Also, it looks like I will need more than the run.sh from https://github.com/david-ryan-snyder/kaldi/blob/xvector-diarization/egs/callhome_diarization/v2/run.sh
     There is a diarization/nnet3/xvector directory which needs more scripts. Just checking to make sure my understanding is correct.



To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Srikar Yekollu

unread,
Apr 10, 2018, 3:16:29 AM4/10/18
to kaldi-help
David,
  Are there any differences between the mfcc -> xvector -> plda_embedding extraction stages between,

and 


Thanks,
Srikar

Armin Oliya

unread,
Apr 10, 2018, 4:05:42 AM4/10/18
to kaldi-help
Hi Srikar, 


>>1. Can I not get away with using VAD instead of SAD? Especially since I know that my audio is composed of conversations and does not contain music.

i suggest you start with vad first and see if you're happy with the results, chances are you will be, if there aren't long non-speech parts or if you don't mind non-speech parts included with segment. once you have the pipeline with vad it's easy to include sad if you need.
you can use public datasets like VoxCeleb and LibriSpeech for sad training, if needed.

2. I do not have domain specific data for PLDA. Is it a no-go to use the PLDA model which comes with the pre-trained x-vector model?
again try it first. use the pretrained plda and judge the quality. i didn't get good results on my non-english domain so had to train my own plda; good thing is that you don't need a ton of data for this part (as compared to what's needed for xvector/ivector, check other forum discussions for details.) 

3. If I start from step 8, I do not have the mfcc. Also, it looks like I will need more than the run.sh from https://github.com/david-ryan-snyder/kaldi/blob/xvector-diarization/egs/callhome_diarization/v2/run.sh
     There is a diarization/nnet3/xvector directory which needs more scripts. Just checking to make sure my understanding is correct.

yes you still need to extract mffc features and other things like vad - read the code and comment out those steps which are specifically used for xvector/plda training. 
don't just glance through and try to understand what each part does, then you'll be able to glue the right pieces togetehr. 

  Are there any differences between the mfcc -> xvector -> plda_embedding extraction stages between,

not sure what you mean but those recipes have a lot in common and their mfcc/xvector extraction are the same. there are differences in how pldas are scored though, and you can find it by checking what each each recipe uses (like ivector-adapt-plda vs ivector-plda-scoring-dense). For now i'd say try to have your diarization pipeline ready first before doing speaker recognition (if you're following david's suggestion of diarization > recognition)


Armin 


Srikar Yekollu

unread,
Apr 10, 2018, 4:16:44 AM4/10/18
to kaldi-help
Thanks for the response, Armin.

I did already get the recognition pipeline working. Which is why I am now trying to see if there is any difference between the plda scoring I used for recognition (I think I needed a score here) vs in diarization where I need the plda projections/embeddings. Do I understand this correctly? 

Thanks,
Srikar

Srikar Yekollu

unread,
Apr 10, 2018, 12:34:09 PM4/10/18
to kaldi-help
Also, what is the role of utterance_id , speaker_id in the wav.scp and other files. I intend to start with an audio file and expect to get a set of labelled segments indicating the speaker in each segment. If I need to process just one audio file, do I just create dummy speaker and utterance ids to get the scripts to work (and modify the splitting logic to not do any) ? 

Thanks,
Srikar

Daniel Povey

unread,
Apr 10, 2018, 3:07:38 PM4/10/18
to kaldi-help
Also, what is the role of utterance_id , speaker_id in the wav.scp and other files. I intend to start with an audio file and expect to get a set of labelled segments indicating the speaker in each segment. If I need to process just one audio file, do I just create dummy speaker and utterance ids to get the scripts to work (and modify the splitting logic to not do any) ? 

Yes-- in general, prior to diarization, the speaker-id and utterance-id would be the same, IIRC, just reflecting the identity of the audio file you are splitting.  But David or Vimal may correct me.

 
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Vimal Manohar

unread,
Apr 10, 2018, 7:37:46 PM4/10/18
to kaldi...@googlegroups.com
On Tue, Apr 10, 2018 at 3:07 PM Daniel Povey <dpo...@gmail.com> wrote:
Also, what is the role of utterance_id , speaker_id in the wav.scp and other files. I intend to start with an audio file and expect to get a set of labelled segments indicating the speaker in each segment. If I need to process just one audio file, do I just create dummy speaker and utterance ids to get the scripts to work (and modify the splitting logic to not do any) ? 

Yes-- in general, prior to diarization, the speaker-id and utterance-id would be the same, IIRC, just reflecting the identity of the audio file you are splitting.  But David or Vimal may correct me.
Depending on how it's setup, speaker-id might be recording-id. David might know if that is used in the diarization recipe.
You can modify run_asr_segmentation.sh to work with Librispeech. You might have to specify sampling-rate as 16000 to reverberate_data_dir.py.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Vimal Manohar
PhD Student
Electrical & Computer Engineering
Johns Hopkins University

Srikar Yekollu

unread,
Apr 10, 2018, 11:11:33 PM4/10/18
to kaldi-help
got it. Thanks
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Srikar Yekollu

unread,
Apr 17, 2018, 5:20:10 AM4/17/18
to kaldi-help
Hi David,
 What is the difference between the xvectors_sre16_major and  xvectors_sre_combined directories in the "SRE 16 XVector" model? Both have a mean vector. I am pretty sure that I have missed understanding something here in the run.sh of the sre16 model. 

Thanks,
Srikar

David Snyder

unread,
Apr 17, 2018, 1:08:52 PM4/17/18
to kaldi-help
Hi Srikar,

You can probably find this info in the run.sh.

The sre_combined is the combination of multiple NIST SREs (prior to 2016), with data augmentation.

The sre16_major is a dev set that was distributed with NIST SRE 2016.

If you're not using this for SRE16, you should probably use the mean.vec from xvectors_sre_combined. Even better would be to compute a mean.vec from your own data (use the binary ivector-mean). You won't need to retrain the LDA or PLDA model if you only change the mean.vec. 

Best,
David

Srikar Yekollu

unread,
Apr 21, 2018, 11:24:20 PM4/21/18
to kaldi-help
Thank you everyone. I got decent performance without using the SAD module and using the models right out of the box. Thank you for all the help. 

Srikar Yekollu

unread,
Apr 24, 2018, 6:37:27 AM4/24/18
to kaldi-help
I had a working setup on my laptop. I tried replicating the setup and running the exact same version on a different machine and I run in to the following error when extracting xvectors using the pretrained SRE16XVector model

Does anyone have an idea of what might cause this. 

# Started at Tue Apr 24 16:04:11 IST 2018
#
nnet3-xvector-compute --use-gpu=no --min-chunk-size=25 --chunk-size=10000 'nnet3-copy --nnet-config=/home/srikar/workspace/kaldi/egs/callhome_diarization/v1/exp/xvector_nnet_1a/extract.config /home/srikar/workspace/kaldi/egs/callhome_diarization/v1/exp/xvector_nnet_1a/final.raw - |' scp:/home/srikar/tmp/diarization_michelle/exp/xvectors_tmp/subsegments_data/feats.scp ark,scp:/home/srikar/tmp/diarization_michelle/exp/xvectors_tmp/xvector.ark,/home/srikar/tmp/diarization_michelle/exp/xvectors_tmp/xvector.scp 
LOG (nnet3-xvector-compute[5.4.76~1-72739]:SelectGpuId():cu-device.cc:123) Manually selected to compute on CPU.
nnet3-copy --nnet-config=/home/srikar/workspace/kaldi/egs/callhome_diarization/v1/exp/xvector_nnet_1a/extract.config /home/srikar/workspace/kaldi/egs/callhome_diarization/v1/exp/xvector_nnet_1a/final.raw - 
ERROR (nnet3-copy[5.4.76~1-72739]:ExpectToken():io-funcs.cc:212) Expected token "<BiasParams>", got instead "asParams>".

Srikar Yekollu

unread,
Apr 24, 2018, 12:41:03 PM4/24/18
to kaldi-help
Never mind. I was stupid. I added the binary files to git without the right flags and that led to these issues, I think. 

Srikar Yekollu

unread,
May 1, 2018, 1:09:51 AM5/1/18
to kaldi-help
Hi David and Armin,
 Thanks for helping me out so far. While I got better performance than my previous system, I noticed that there are still quite some errors on the diarization despite clear audio. I am trying to understand where the next set of improvements might come from.
1. You mentioned that training the plda model will help with my dataset in specific. Do you know how much data is needed for this?
2. Are there any parameters that I can tune to improve. e.g. target_energy/window/overlap/ mfcc extraction?
3. What is the difference between the plda_adapt in xvectors_sre16_major and plda model in xvectors_sre_combined  (pretrained xvector models)
4. You mentioned that I could also get paid consultation/help. Do you know who can help me with this and how much it would cost? 

Thanks,
Srikar

David Snyder

unread,
May 1, 2018, 4:44:55 PM5/1/18
to kaldi-help
1. You mentioned that training the plda model will help with my dataset in specific. Do you know how much data is needed for this?

More is better. I'd aim for at least 1,000 speakers, with several recordings per speaker. 

I suggest obtaining VoxCeleb1 and VoxCeleb2 (http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ and http://www.robots.ox.ac.uk/~vgg/data/voxceleb2/). In total, that will give you over 7,000 speakers with plenty of recordings per speaker. This data will be a better match for application (which is wideband mic, right?). Also, you'll have enough data to train a new x-vector DNN from scratch (see egs/voxceleb/v2 for some help with that), with wideband features.

2. Are there any parameters that I can tune to improve. e.g. target_energy/window/overlap/ mfcc extraction?

The biggest impact will be from tuning the agglomerative clustering stopping threshold. You could also try increasing the target-energy option (try something like 0.95 and decrease from there). There's nothing you can do about MFCC extraction without retraining the x-vector DNN.

3. What is the difference between the plda_adapt in xvectors_sre16_major and plda model in xvectors_sre_combined  (pretrained xvector models)
 
This adaptation is specific to the SRE16 recipe. The SRE16 eval consists of Cantonese and Tagalog speech, but most of our training data is English, so the PLDA model in xvectors_sre_combined is trained mostly on English. We adapted it to a small pile of Cantonese and Tagalog data, to get the adapted PLDA model. The adapted model trained on Cantonese and Tagalog is probably not going to be helpful for you, unless they are your target language. 

4. You mentioned that I could also get paid consultation/help. Do you know who can help me with this and how much it would cost? 

I'll let you know if I think of something. Someone might contact you if they're interested.
Vimal

Thank you, David.

Your problem is more complicated than just speaker recognition. Since you have multiple speakers per recording, you'll (probably) want to first perform speaker diarization to split the recording into segments that belong to different speakers. There's a speaker diarization example in <a href="https://github.com/kaldi-asr/kaldi/tree/master/egs/callhome_diarization" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url

Srikar Yekollu

unread,
May 2, 2018, 3:34:36 AM5/2/18
to kaldi-help
Hi David,
 Thanks for the response. Please find my responses inline.


On Wednesday, May 2, 2018 at 2:14:55 AM UTC+5:30, David Snyder wrote:
1. You mentioned that training the plda model will help with my dataset in specific. Do you know how much data is needed for this?

More is better. I'd aim for at least 1,000 speakers, with several recordings per speaker. 

I suggest obtaining VoxCeleb1 and VoxCeleb2 (http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ and http://www.robots.ox.ac.uk/~vgg/data/voxceleb2/). In total, that will give you over 7,000 speakers with plenty of recordings per speaker. This data will be a better match for application (which is wideband mic, right?). Also, you'll have enough data to train a new x-vector DNN from scratch (see egs/voxceleb/v2 for some help with that), with wideband features.
- My audio is not wideband mic. It is telephony at 8000Hz. I think switchboard (I'll need to pay for this) or Libre might be the relevant ones, correct?


2. Are there any parameters that I can tune to improve. e.g. target_energy/window/overlap/ mfcc extraction?

The biggest impact will be from tuning the agglomerative clustering stopping threshold. You could also try increasing the target-energy option (try something like 0.95 and decrease from there). There's nothing you can do about MFCC extraction without retraining the x-vector DNN.

target-enery is the total variance captured by PCA, right? With the MFCC features, I was concerned about generating them in a way which is inconsistent from how they were generated for the x-vector DNN model. Does using the same mfcc config assure me of consistency here? 
 

3. What is the difference between the plda_adapt in xvectors_sre16_major and plda model in xvectors_sre_combined  (pretrained xvector models)
 
This adaptation is specific to the SRE16 recipe. The SRE16 eval consists of Cantonese and Tagalog speech, but most of our training data is English, so the PLDA model in xvectors_sre_combined is trained mostly on English. We adapted it to a small pile of Cantonese and Tagalog data, to get the adapted PLDA model. The adapted model trained on Cantonese and Tagalog is probably not going to be helpful for you, unless they are your target language. 

Got it. 

4. You mentioned that I could also get paid consultation/help. Do you know who can help me with this and how much it would cost? 

I'll let you know if I think of something. Someone might contact you if they're interested.

Thank you.
Reply all
Reply to author
Forward
0 new messages