How to implement online xvector extaction?

1,498 views
Skip to first unread message

Bruce

unread,
Jan 19, 2018, 3:10:22 PM1/19/18
to kaldi-help
[How to implement online xvector extaction]

[Background]:
 I want to follow the similar setup by David Snyder, Daniel Garcia-Romero, Daniel Povey, Sanjeev Khudanpur( https://david-ryan-snyder.github.io/2017/10/04/model_sre16_v2.html) and extract xvector for a 10s long audio file, but do not want to wait until I get all the 10s file.  So I want to change the xvector extraction setup into online mode while achieving reasonable performance.

[What I have done]:
By referring 
The current solution is implement mfcc online extraction.
Step a (online part): do the online mfcc feature extraction
Step b (offline part): do vad and cmn operation. 
But the RunNnetComputation function (defined in http://kaldi-asr.org/doc/nnet3-xvector-compute_8cc_source.html )  will take most of the computation time. 

How can I go deeper and do the online operation for RunNnetComputation?
I have one tentative solution in my mind:
Step 1 (online part):  I can handle audio frames in a streaming way and output the results of level 5 (immediately before statics pooling stage). 
Step 2 (offline part): Then after the audio streaming is done, I can go ahead to do statistics pooling and derive the embeddings (i.e, get Xvectors) 

Any suggestions or any reference?

Many thanks for your help.

Bruce.

Daniel Povey

unread,
Jan 19, 2018, 3:39:14 PM1/19/18
to kaldi-help
There is already support for online i-vector extraction, and the difference in performance between i-vectors and x-vectors is not that huge.  Is there a reason you can't just use i-vectors?

It might be easiest to only evaluate the early layers as a neural network, and to just write custom code to implement the later layers.  It's a bit ugly though.

There is a mechanism for online decoding (search for 'looped'; it's used in ASR decoding), which can be used for incremental operation like this; however, it's not compatible with the way the statistics pooling component currently works.  It would be possible to rewrite the statistics pooling component to be recurrent, like an LSTM, and compatible with the looped computation.  However, this would require a certain amount of nnet3 coding work, and script changes, and it would be very hard for you to do.  I'm doubtful that you really even need xvectors; ivectors  might work fine.



--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/c2137ea7-6c54-4624-9ed8-f89a08d60857%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bruce

unread,
Jan 22, 2018, 2:57:05 AM1/22/18
to kaldi-help
Hi Dan,
Many thanks for your quick and kind help.
The reason to explore online x-vector is to take advantage of the its good performance in short duration for speaker verification tasks.
Thanks for pointing out the online decoding. It is very helpful.
I can get the output for each feature frame via DecodableNnetSimpleLooped::GetOutputForFrame function defined in the following source file,
After that, then I got lost and did not know how to continue to derive xvector for an utterance.
At this point, I use the mean vector (dim= output_dim) of the output (a matrix frame_num by output_dim), the performance is much worse than original xvector settings.
Can you or anybody help?

Many thanks.

Bruce
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

David Snyder

unread,
Jan 22, 2018, 12:36:20 PM1/22/18
to kaldi-help
The reason to explore online x-vector is to take advantage of the its good performance in short duration for speaker verification tasks.

 Have you been able to verify that this is true for your application? Hopefully you can perform some evaluation using the offline recipes before you spend a lot of time modifying nnet3 code.

To get good performance out of the x-vectors lot of data (with speaker labels), and it's a good idea to supplement whatever data you have with augmentations that make sense for your domain. Depending on your application, you might see an improvement over i-vectors (e.g., see https://www.dropbox.com/s/d9y0ll65cjiw2r2/ICASSP18_xvectors.pdf?dl=0 for more recent results)... It's a good idea to check that first, before you plunge into the code. 

At this point, I use the mean vector (dim= output_dim) of the output (a matrix frame_num by output_dim), the performance is much worse than original xvector settings.
Can you or anybody help?

In the current recipe, the statistics are [mean, stddev]. The x-vector is extracted from an affine layer on top of these stacked statistics. It sounds like you're using the mean vector as your embedding, and are omitting the stddev and affine transform... So that's probably why your results are worse. 

Maybe an easier option for you, is to compute an x-vector from chunks of speech every 3 seconds (or so), and average the x-vectors as your consume more input. I haven't thought about this in depth yet, but it sounds easier than what you're trying to do.

Daniel Povey

unread,
Jan 22, 2018, 1:42:50 PM1/22/18
to kaldi-help
Hm.
It's not going to be particularly easy.  The method you describe isn't quite right.  However I don't want to spend time at the current moment helping you to do this because I'm not convinced you even need it.  Have you even verified that for your data, x-vector gives you better performance than i-vectors?
Dan


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Liu Gang

unread,
Jan 22, 2018, 4:09:38 PM1/22/18
to kaldi...@googlegroups.com
Hi David & Dan,

1) About the performance of i-vector vs x-vector:
i-vector:  9.8%     (based on egs/sre10/v1 framework)
x-vector: 8.5%     (based on egs/sre16/v2 framework)
x-vector: 15.6%   (my modification, mean version of DecodableNnetSimpleLooped::GetOutputForFrame)
My application scenario is:  fixed text with short duration, usually less than 3s.
From the experiment, we do watch clear advantage from x-vector

2)About my code modification: 
I tried to approach the following example: 
from the following source file:

In nnet3-xvector-compute.cc,

  CuMatrix<BaseFloat> input_feats_cu(features);
  computer.AcceptInput("input", &input_feats_cu);
  computer.Run();
  CuMatrix<BaseFloat> cu_output;
  computer.GetOutputDestructive("output", &cu_output);
  xvector->Resize(cu_output.NumCols());
  xvector->CopyFromVec(cu_output.Row(0));

My understanding is that the following code will prepare the production of xvector (via some_function_of_affine_transformation([mean, stddev]))
 computer.GetOutputDestructive("output", &cu_output);

I definitely should ream more coding about this, but from my reading of:

DecodableNnetSimpleLooped::GetOutputForFrame produce something quite close to the output from the GetOutputDestructive,
I just do not quite sure how to close the gap.


An alternative solution for next step is that I will try is to compute an x-vector from chunks of speech every 1 or 2 second.

Thanks once again for the suggestions and analysis from all of you.

Bruce

You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/Eu5tiaPGdwo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Gang LIU 
Seattle, Washionton 
https://sites.google.com/site/GangLiuResearch



Daniel Povey

unread,
Jan 22, 2018, 4:13:52 PM1/22/18
to kaldi-help
The problem with the way that you are doing it is that the xvector extraction neural network already does averaging within the network itself, and it will use as much data as is available.  Also there are nonlinear operations after the averaging.  Basically this means that without modification, the online-decoding methods aren't really applicable to xvectors, they aren't expected to give the correct result.
Since the computation isn't very excessive, one possibility is to, each second, just feed all the data you have obtained so far into the xvector computation.  It won't re-use pre-existing computation, but at least it will give the correct answer.

Dan


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

David Snyder

unread,
Jan 22, 2018, 5:34:05 PM1/22/18
to kaldi-help
An alternative solution for next step is that I will try is to compute an x-vector from chunks of speech every 1 or 2 second.

This should be very easy to test actually. 

When you call the script extract_xvectors.sh, pass in the option --chunk-size=200 (which means 2 seconds). The binary nnet3-xvector-compute will then extract embeddings from non-overlapping 200 frame chunks of features, and will average them, to produce the final embedding. Just make sure you recompute the embeddings you use to train the backend (e.g., centering, LDA, PLDA, etc).
Hi David & Dan,
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.



--
Gang LIU 
Seattle, Washionton 
https://sites.google.com/site/GangLiuResearch



--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

Liu Gang

unread,
Jan 30, 2018, 1:44:21 AM1/30/18
to kaldi...@googlegroups.com
To Dan and David,

Many thanks for the previous suggestion and help (including the ICASSP2018 paper).  
Your suggestion helped improve the accuracy-wise performance a lot.

I value each single second you put in my case. So I did want to try my best first and here comes my follow-up.
By working on smaller chunk,say 2s, it can partially solve the problem. But not completely.
After reading your docs/responses, I can use nnet3-copy to play with nnet3 models.
To achieve my streaming goal, I splited the original trained dnn (final.raw) into two parts/steps: (something may be wrong, so I provided details here:)
#step1: only include first 5 layers  (named this dnn as dnn_step2) (will be used in a for loop to handle streaming data)  :
nnet3-copy --prepare-for-test=true --nnet-config='echo output-node name=output input=tdnn5.batchnorm |' --edits='remove-orphans' final.raw dnn_step1

#step2:the rest of the layers, i.e, statistic extraction/pooling layer + layer 6 (named this dnn as dnn_step2)
nnet3-copy --binary=false --nnet-config=final.raw dnn_part2.tmp
#Then remove the component-node of the firt 5 layers and change input node and input dim
grep -v "component-node name=tdnn[1-5]" dnn_step2.tmp  | sed -e "s|input-node name=input dim=20|input-node name=input dim=1500|; s|input=tdnn5.batchnorm|input=input|" > tmp.net
#remove orphans:
nnet3-copy --edits='remove-orphans' tmp.net dnn_step2

In other words, I want divide a DNN into two parts like the following:
######## dnn_step1

input-node name=input dim=23
component-node name=tdnn1.affine component=tdnn1.affine input=Append(Offset(input, -2), Offset(input, -1), input, Offset(input, 1), Offset(input, 2))
component-node name=tdnn1.relu component=tdnn1.relu input=tdnn1.affine
component-node name=tdnn1.batchnorm component=tdnn1.batchnorm input=tdnn1.relu
...
component-node name=tdnn5.affine component=tdnn5.affine input=tdnn4.batchnorm
component-node name=tdnn5.relu component=tdnn5.relu input=tdnn5.affine
component-node name=tdnn5.batchnorm component=tdnn5.batchnorm input=tdnn5.relu

########### (I want to split the dnn: final.raw from here) ##################

######## dnn_step2
component-node name=stats-extraction-0-10000 component=stats-extraction-0-10000 input=tdnn5.batchnorm
component-node name=stats-pooling-0-10000 component=stats-pooling-0-10000 input=stats-extraction-0-10000
component-node name=tdnn6.affine component=tdnn6.affine input=Round(stats-pooling-0-10000, 1)
...
component-node name=output.log-softmax component=output.log-softmax input=output.affine
output-node name=output input=output.log-softmax objective=linear


====

My plan is to rely on the function RunNnetComputation from the following file:
to do the following operation:
step 1: for-loop:proccess mfcc feature stream with dnn_step1
step 2: get mean & stdvar and derive xvector with dnn_step2


No. 1 challenge here is that: 
It complains:
 cindex output(0, 0, 0) is not computable for the following reason
Detailed info is attached:
kNotComputable.log.txt

since I can not make this work, so I also tried the loopped version (as mentioned in my earlier emails, and Thanks for Dan's suggestion.)
This can make step 1 successfully run.
Then I continue rely on RunNnetComputation to finish step 2.
Surprisingly, this approach (let's name it as system4) gives different xvetors therefore results. 
I assume we should get exactly close results as non-splited version. (system2 or 3) (see the following table)


To sum up, I listed the performance for different system here:
system1: i-vector: 9.8%   (based on egs/sre10/v1 framework)
system2: x-vector: 8.5%   (based on egs/sre16/v2 framework,max_chunk_size=5s)
system3: x-vector:11.2%   (my modification, mean version of DecodableNnetSimpleLooped::GetOutputForFrame) (previously EER=15.6%, there's some mis-match between training & testing)
system4: x-vector:14.2%   (looped_step1 + step2)
system5: x-vector: 8.8%   (based on egs/sre16/v2 framework, but with smaller chunk size: max_chunk_size=2s); (Thanks for David's suggestion)

Any suggestions? 
Thanks once again.

Bruce

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/Eu5tiaPGdwo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.
kNotComputable.log.txt.log

Daniel Povey

unread,
Jan 30, 2018, 6:33:32 PM1/30/18
to kaldi-help

It should be possible in principle to use our current 'looped' computation framework to do an online-decoding-compatible version of the stats-pooling layer that is recurrent.  David was going to mess with that code anyway for another reason, perhaps we can have him do that.  But that wouldn't be ready for a few weeks even if he does it.

In the short term, the might be easier:
The statistics-pooling layer basically sums up its input and then turns data of the form
(count; mean stats; optionally diagonal 2nd-order stats) into data of the form
(mean; optionally standard deviation).

You could easily implement that in code.  The first part of the network (before the statistics-pooling layer)
would still be a Nnet object which you could use looped decoding for, and the rest of the network would 
also be a Nnet object (and you'd only evaluate it for frame zero, and give input for frame zero).

Dan


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Liu Gang

unread,
Mar 8, 2018, 8:35:26 PM3/8/18
to kaldi...@googlegroups.com
Hi Dan,

Many thanks for your detailed instruction and help. Over the past weeks, I made it work by following your suggestions.

BTW,
I am also looking forward to learn more from David's implementation. 

Thanks for the help from both you and David.

Best wishes,

Bruce



mili lali

unread,
Jul 24, 2019, 6:50:51 AM7/24/19
to kaldi-help
Hi 
  But that wouldn't be ready for a few weeks even if he does it.
1- Is it now possible to extract online x-vectors? 
2- In your opinion is it possible to use online i-vector extractor (use in ASR) in speaker verification systems? (something like TCP port in speech recognition.)


David Snyder

unread,
Jul 24, 2019, 9:56:08 AM7/24/19
to kaldi-help
1- Is it now possible to extract online x-vectors? 

This isn't implemented in the master branch, as far as I know. I also wouldn't expect anyone else to get to this in the near future. But, you're free to implement this yourself, and make a pull request if you want to share it with the community. Dan outlined how this would be done. 
Reply all
Reply to author
Forward
0 new messages