Xvector Recipe

285 views
Skip to first unread message

Arkadi Gurevich

unread,
Mar 11, 2018, 1:36:52 PM3/11/18
to kaldi-help
Hey
I've been using kaldi for some time now.
I want to try the "SRE16 Xvector Model".
I downloaded it and I do not quite understand how I combine it to make recognition.
I am currently using the Aspire chain model.
I would be happy if someone could guide me

David Snyder

unread,
Mar 11, 2018, 2:00:51 PM3/11/18
to kaldi-help
This is intended for speaker recognition (as a standalone application), not speech recognition. In principle, you could use the embeddings (called xvectors) in place of ivectors for adaptation in an ASR acoustic model. However, it wasn't intended for this purpose, and if you wanted to do that, you'd need to write some extra *cc and *sh code to make that possible.

In short, this system is probably not relevant for you, unless you're working on speaker recognition. 

Arkadi Gurevich

unread,
Mar 11, 2018, 2:04:21 PM3/11/18
to kaldi-help
Thank you David

Arkadi Gurevich

unread,
Jun 4, 2018, 3:45:49 AM6/4/18
to kaldi-help
Hi David,
I started working on speaker recognition using xvector, I read the research "X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION " and went through the script ( in sre16/v2 ) but I'm not sure I understand how the method works.

From what I understood, I need to provide a file named trials .
When in a file, for each speaker and utt_id I determine whether it is a "target" or an "impostor", what is "target" and "imposter" ?

Suppose I have n speakers
1,2, .., n and I want to identify who is arkadi (that's me)
I also use an existing model ( The same downloadable model ) and do not want to train a new one.
how do I do it ?

Thanks in advance,
Arkadi






On Sunday, March 11, 2018 at 11:00:51 AM UTC-7, David Snyder wrote:

David Snyder

unread,
Jun 4, 2018, 9:29:50 AM6/4/18
to kaldi-help
The "trials" file is usually part of the evaluation data, and specifies which enrolled speakers are compared with which test utterances, along with whether they are they are "target" trials or "nontarget" trials. E.g.,

spk-id-A utt-id-A target
spk-id-A utt-id-B nontarget
spk-id-A utt-id-C nontarget
spk-id-B utt-id-A nontarget
spk-id-B utt-id-B target
.
.
.

It's not something you need to train system, but rather to compute an error-rate (e.g., EER, or minDCF) on the corresponding evaluation dataset.

If you have some test recording called "utt-id" (for example), and you don't know the identity, you can create something like the "trials" file, but without the 3rd column. E.g.,

spk-id-A utt-id
spk-id-B utt-id
spk-id-C utt-id
spk-id-D utt-id
spk-id-E utt-id
.
.

A file like this will be one of the arguments to ivector-plda-scoring (see how it's used here: https://github.com/kaldi-asr/kaldi/blob/master/egs/sre16/v2/run.sh#L291).

Also, be aware that ivector-plda-scoring just provides log likelihood ratios, not binary same-speaker or different-speaker decisions. If you want binary decisions (and it's an open set problem), you'll still need to decide on a threshold (e.g., a log likelihood ratio above this score means "same-speaker", otherwise it's "different-speaker"). This will probably involve creating some kind of evaluation set using your own data.

Also, search the forums for the word "trials." You'll find that questions have been asked about it before.

huangz...@gmail.com

unread,
Jun 4, 2018, 11:45:09 AM6/4/18
to kaldi-help
Hi David, I am trying to reproduce your result with Pytorch, but I met a problem. It seems that your TDNN has a padding operation. Suppose you select -2, 0, 2 frames for the first frame in the utterance then how did you deal with the -2 frame? I mean what is your padding strategy? Thanks!

在 2018年6月4日星期一 UTC+8下午9:29:50,David Snyder写道:

David Snyder

unread,
Jun 4, 2018, 11:59:34 AM6/4/18
to kaldi-help
Since the x-vector DNN is built in the nnet3 library, it does whatever nnet3 does by default to handle padding. 

My guess is that it pads by copying the first or last frames as many times as is needed. To me this makes sense, but I haven't checked to see if that's actually what is happening. 

huangz...@gmail.com

unread,
Jun 4, 2018, 10:58:42 PM6/4/18
to kaldi-help
Thanks for your rapid reply! Instead of padding in every layer, it seems that all the padding layer is done in the first layer. That's why I didn't get the exact same output.

在 2018年6月4日星期一 UTC+8下午11:59:34,David Snyder写道:

Kiran Karra

unread,
Jan 7, 2019, 12:51:18 PM1/7/19
to kaldi-help
Hi David,

Is there a reason why the ivector-plda-scoring is setup to take a 2 column file (i.e. you cannot include the target/nontarget information) into the file that goes into ivector-plda-scoring?  However, later on in the run.sh script (for sre16/v1/run.sh and sre16/v2/run.sh), columns 3 and 6 are extracted from the paste command, and that wouldn't work unless you have target/nontarget information in the trials file?  I wanted to make sure I wasn't missing anything else subtle.

Thanks,

David Snyder

unread,
Jan 7, 2019, 1:09:42 PM1/7/19
to kaldi-help
Look more carefully at this line: https://github.com/kaldi-asr/kaldi/blob/master/egs/sre16/v2/run.sh#L303

It's pasting the trials file, which contains the target / nontarget information together with the scores from the ivector-plda-scoring. Then, we use awk to extract the columns corresponding to the scores, followed by the target / nontarget information, which is all that is required to compute various error metrics.

Kiran Karra

unread,
Jan 10, 2019, 8:20:45 AM1/10/19
to kaldi-help
I see my mistake, thanks.  I was having a different error but it was masking itself as this.  I see that you do indeed only put 2 fields into the ivector-plda-scoring with the cut command a few lines above.
Reply all
Reply to author
Forward
0 new messages