DNN-based speaker embedding

1,121 views
Skip to first unread message

Paul Lin

unread,
Sep 8, 2017, 8:00:00 AM9/8/17
to kaldi-help
Hello everyone, 

I recently have interest in DNN-based speaker verification.

I want to try speaker embedding system like the following research using Kaldi.


Is there any related recipe or script or some suggestions?

Thank you very much, 


sincerely,

Paul Lin

David Snyder

unread,
Sep 8, 2017, 9:20:21 AM9/8/17
to kaldi-help
Hi Paul,

We're actively working on adding something similar to this in Kaldi (based on http://www.danielpovey.com/files/2017_interspeech_embeddings.pdf).  I'll put up a useable pull request on github (e.g., https://github.com/kaldi-asr/kaldi/pulls) in the next week or so. It should make it's way into the master branch by the end of the month, I think. 

Best,
David

Paul Lin

unread,
Sep 10, 2017, 9:40:58 PM9/10/17
to kaldi-help
Hi David, 

That's cool, thank you very much : )



David Snyder於 2017年9月8日星期五 UTC+8下午9時20分21秒寫道:

David Snyder

unread,
Sep 24, 2017, 3:56:29 PM9/24/17
to kaldi-help
There's a fairly stable PR for this here now: https://github.com/kaldi-asr/kaldi/pull/1896.

Paul Lin

unread,
Sep 26, 2017, 2:48:48 AM9/26/17
to kaldi-help
Hi, David, 

OK, I will go through the scripts and thank you for your contribution.

: )

best,
Paul Lin

David Snyder於 2017年9月25日星期一 UTC+8上午3時56分29秒寫道:

abhishek...@quantiphi.com

unread,
Sep 28, 2017, 5:48:47 AM9/28/17
to kaldi-help
Hi David,

When will this implementation be merged with main kaldi git repo?

David Snyder

unread,
Sep 28, 2017, 10:05:25 AM9/28/17
to kaldi-help
It hasn't been reviewed yet, but it's in Dan's queue. 

You're free to check it out in the meantime, though. Although some code might change, models generated from this PR will most likely still be interchangeable with whatever makes it into the master branch. The additions are mostly at the script level, and generating training examples. 

David Snyder

unread,
Oct 3, 2017, 6:11:32 PM10/3/17
to kaldi-help
This has been merged into Kaldi master now: https://github.com/kaldi-asr/kaldi/pull/1896

paul89...@speech.cm.nctu.edu.tw

unread,
Oct 6, 2017, 4:03:00 AM10/6/17
to kaldi-help
Hi, David,

Thank you for your contribution.

Now I want to use WSJ database to do the same xvectors training.

After going through v2/run.sh, I have two problems:

1. In the run.sh:
    
sre16_trials=data/sre16_eval_test/trials

What is the purpose of "trials" ?
  
I'm still confusing after reading the manual as following:

Note: the 'trials-file' has lines of the form
<key1> <key2>
and the output will have the form
<key1> <key2> [<dot-product>]


2. For simple computation, could I use the cosine distance instead of PLDA by the command "ivector-compute-dot-products"?

But the performance may degrade when I use the simple cosine distance, right?


Thank you again : )

sincerely,
Paul Lin



David Snyder於 2017年10月4日星期三 UTC+8上午6時11分32秒寫道:

paul89...@speech.cm.nctu.edu.tw

unread,
Oct 6, 2017, 4:34:58 AM10/6/17
to kaldi-help
Sorry for bothering.


Thank you very much.


Sincerely, 
Paul Lin


David Snyder

unread,
Oct 6, 2017, 10:35:32 AM10/6/17
to kaldi-help
Based on what you described (that you want to train on WSJ), I think you're better of starting with a traditional i-vector recipe, like egs/sre10/v1. It will be hard to use the DNN embeddings successfully, if you don't have much data and aren't familiar with standard speaker recognition corpora. I-vectors will probably also work poorly in this scenario, but it will be a little more forgiving as most of the pipeline is unsupervised.

If you really want to use the DNN embeddings for this, we have a pretrained model at http://kaldi-asr.org/models.html. However, you'd have to downsample WSJ first, and there will be domain mismatch. This can be partially remediated by adapting the pretrained PLDA model to the WSJ data (there's a binary for that) and compute a new mean.vec from the same WSJ list.

paul89...@speech.cm.nctu.edu.tw

unread,
Oct 13, 2017, 6:16:49 AM10/13/17
to kaldi-help
Hi, David,

Thank you very much for detailed suggestions.
Could I ask two more question after going through the script v2/run.sh?

(1).
I used the command "ivector-copy-plda" to convert plda binary to text format as following:

<Plda>  [ -0.3278072 -0.7067469 -0.1050988 0.08867783 0.01797846 -0.0005074288 -0.02028046 ... ]
[
1.391678 0.05561691 -0.3256346 0.1296889 0.3704112 0.2042539 -0.1662901 0.006633 0.001512441 ...
-0.07409662 1.331819 -0.7165985 -0.696762 0.2092285 -0.3082701 0.07281973 0.0187183 ...
...
...
...
0.004158025 0.002392154 -0.0003522192 0.003662211 -0.01681598 -0.01174923 -0.02803592 ... ]
[ 20.41709 12.52674 9.670902 8.060157 7.153937 6.513659 5.529801 5.3912 4.52434 4.384726 ... ]
</Plda>

What the values mean?
Are they the means or covariance of within-class / across-class covariance matrices or not?

(2).
One line in the script local/nnet3/xvector/run_xvector.sh:

stats-layer name=stats config=mean+stddev(0:1:1:${max_chunk_size})

The polling segment window is from frame 0 to frame ${max_chunk_size}.
When the frame number < ${max_chunk_size}, what is the output of this polling stats-layer?
For example, if the variable ${max_chunk_size} is 10 and the frame number is 5 (less than 10) now,
how does it compute the mean and variance of within the window whose length is only 5 instead of 10?

Thank you very much again : )

best,
Paul Lin




David Snyder

unread,
Oct 13, 2017, 10:04:55 AM10/13/17
to kaldi-help
Hi Paul,

The quantities in the PLDA object are <mean> <transform_> <psi_>, where transform_ is a transformation that "makes within-class covar unit and diagonalizes the between-class covar." The matrix psi_ is the "The between-class (diagonal) covariance elements, in decreasing order." I'm referring to https://github.com/kaldi-asr/kaldi/blob/master/src/ivector/plda.h and the corresponding .cc file. It shouldn't be too hard to compute the covariances before diagonalization, if that is what you want. I had to do something like this in a personal branch, take a look at lines 298-313 in https://github.com/david-ryan-snyder/kaldi/blob/domain-adaptation/src/ivector/plda.cc.

The polling segment window is from frame 0 to frame ${max_chunk_size}.
When the frame number < ${max_chunk_size}, what is the output of this polling stats-layer?
For example, if the variable ${max_chunk_size} is 10 and the frame number is 5 (less than 10) now,
how does it compute the mean and variance of within the window whose length is only 5 instead of 10?

In this example, egs/sre16/v2 recipe max_chunk_size=10000, which is 100 seconds, so we will often encounter input segments that are shorter than this. Generally, it will still work fine. For example, with 100, 500, 1000, 5000, etc frames, it will compute the mean and variance as you'd expect.

One small complication is the min_chunk_size, which is 25 in the recipe. This corresponds to the context needed needed by the TDNN, plus a few extra frames so we have something to compute the stats over. Currently, the script will not produce xvectors for features with fewer frames than the min_chunk_size and will print out a warning. We might add an option to nnet3-xvector-compute that pads the input if it is < min_chunk_size, but for now, this is something you'd have to take care of in advance of extracting xvectors.

Best,
David

paul89...@speech.cm.nctu.edu.tw

unread,
Oct 16, 2017, 7:49:50 AM10/16/17
to kaldi-help
Hi, David,

That's so clear explanation.
Thank you very much.

Last question about "segment level" and "frame level":

In this example, egs/sre16/v2 recipe max_chunk_size=10000, which is 100 seconds, so we will often encounter input segments that are shorter than this. Generally, it will still work fine. For example, with 100, 500, 1000, 5000, etc frames, it will compute the mean and variance as you'd expect.

As what you said, we assume that the case "500 frames" is chosen.

The stacked features will input to the x-vector DNN frame by frame.
When the number of frame which was inputted to DNN is less than 500 (e.g.100), what is the output of pooling layer?

Is it correct that there is no output of pooling layer if the number of frame inputted to DNN is less than 500?
The output will be computed when #frame reaches 500, and the output are the mean and variance of that 500 frames, right?
So, one speech chunk (whose size are 100, 500, ..., ect.) will produce one corresponding embedding a and b, right?

Sorry for this basic problem of DNN embedding,
Thank you again,

best,
sincerely,
Paul Lin

David Snyder

unread,
Oct 16, 2017, 10:36:16 AM10/16/17
to kaldi-help
When the number of frame which was inputted to DNN is less than 500 (e.g.100), what is the output of pooling layer?

The output is still the mean and standard deviation vectors concatenated together. If there's only 100 frames in your input segment, then those statistics are computed from 100 frame-level activations from the previous layer. 

Is it correct that there is no output of pooling layer if the number of frame inputted to DNN is less than 500?

If the input segment is 500 frames, there's no output of the pooling layer until it reaches 500 frames. [At least, not in the xvector setup. Other Kaldi recipes use the same layer in a different way.]
 
The output will be computed when #frame reaches 500, and the output are the mean and variance of that 500 frames, right?
So, one speech chunk (whose size are 100, 500, ..., ect.) will produce one corresponding embedding a and b, right?

I think what you mean to say is correct. In the embedding DNN architecture, it's not until we reach the end of the input segment that the stats pooling layer outputs the mean and standard deviation. The way it's implemented, the stats layer has a context from 0 to T frames, where T is some large number. The binary nnet3-xvector-compute gets the output of the DNN at time t=0. However, to get the output of the stats layer, it need to look as far right (temporally) as possible.

paul89...@speech.cm.nctu.edu.tw

unread,
Oct 17, 2017, 1:37:29 AM10/17/17
to kaldi-help
I appreciate it, thanks.

These help me a lot : )


best,
Paul Lin

David Snyder於 2017年10月16日星期一 UTC+8下午10時36分16秒寫道:

Linh Vu

unread,
Mar 22, 2018, 10:37:08 PM3/22/18
to kaldi-help
David,

I would like to adapt both the pretrained DNN and PLDA models on VoxCeleb dataset. Is this possible with the current code? Do you foresee any issues this might involve? (I'm a very new Kaldi user).

Thank you for the great work!

David Snyder

unread,
Mar 22, 2018, 10:52:07 PM3/22/18
to kaldi-help
Hi Linh,

In theory it'll work, assuming you downsample VoxCeleb to 8kHz. You'll need to modify the pretrained DNN so that the correct number of speakers appears in the output layer. You can create a new nnet.config file (where the size of the output layer is equal to the number of speakers in VoxCeleb), and run nnet3-copy --nnet-config=new_nnet.config final.raw 0.raw. Then, continue training the model, using 0.raw as a starting point. 

If your application is microphone speech, you might find it more effective to simply train a new xvector DNN from scratch on 16kHz VoxCeleb.

In either case, you'll probably want to augment your VoxCeleb data. This is almost always a good idea for xvector training. Take a look at the run.sh in egs/sre16/v2/ to see how the augmentation is done. Note that the augmentation datasets (MUSAN and a reverb set) are freely available on openslr.org

Best,
David

David Snyder

unread,
Mar 22, 2018, 10:58:44 PM3/22/18
to kaldi-help
Also, you might want to consider building an i-vector system trained on just the wideband VoxCeleb data (without augmentation). Since you only have about 1,000 speakers in this dataset, you might find that i-vectors work well enough. 
Reply all
Reply to author
Forward
0 new messages