DNN Embeddings for Speaker Verification

808 views
Skip to first unread message

Rongjin Li

unread,
Oct 28, 2017, 9:48:29 PM10/28/17
to kaldi-help

Thanks a lot for the patient help of David on e-mail!


The xvector recipe is so amazing and helpful that I want to research and reference it in my future paper. However, I meet some doubt when experiencing it for language recognition. My training dataset is AP17_OLR that has many short utterances. And my main problems are at stage 4 about creating the nnet examples:


1. What is the relationship between the arguments --min(max)-frames-per-chunk and the min_len=500(500 frames)? Can I skip the stage 3 to run the stage 4 directly?


2. I find it is hard for me to follow your explanation about some arguments, such as chunk, archive and --num-repeats. Would you pls kindly tell me much more details for them, respectively? How do I set them reasonably according to my training dataset?  


3. At the 'StatisticsPoolingComponent', the left-context=0 and right-context=10000 means that we pool over an input segment starting at frame 0 and ending at frame 10000 or earlier, right? How can you determine the ending at frame 10000? 


4. I have tried some different values, but all of them did not work. My error is 'ERROR (nnet3-compute-prob[5.2.124~1-70748]:CreateComputation():nnet-compile.cc:59) Not all outputs were computable, cannot create computation.' And the following screenshot is my arguments setting:



I hope you could send me more information which would do me a good favor in comprehending your algorithm. BTW, if I want to extract embeddings by building CNN instead of TDNN with Kaldi, would such way work better?


Thanks!

Rongjin Li

unread,
Oct 28, 2017, 9:56:22 PM10/28/17
to kaldi-help
在 2017年10月29日星期日 UTC+8上午9:48:29,Rongjin Li写道:

The following part(the screenshot is lost) is my arguments setting:


if [[ $stage -le 1 && 1 -le $endstage ]];then

sid/nnet3/xvector/get_egs.sh --cmd "$train_cmd" \

    --nj 5 \

    --stage 0 \

    --frames-per-iter 4000000 \

    --frames-per-iter-diagnostic 100000 \

    --min-frames-per-chunk 10 \

    --max-frames-per-chunk 30 \

    --num-diagnostic-archives 3 \

    --num-repeats 2 \

    "$data" $egs_dir

fi

David Snyder

unread,
Oct 28, 2017, 10:31:15 PM10/28/17
to kaldi-help

1. What is the relationship between the arguments --min(max)-frames-per-chunk and the min_len=500(500 frames)? Can I skip the stage 3 to run the stage 4 directly?



Stage 3 prepares the features that we will later use to extract the training examples from. Part of this preparation is the removal of recordings that have too little speech, as defined by $min_len.  Later on, we form training examples by picking chunks that range from min-frames-per-chunk and max-frames-per-chunk. A recording can't be used if it's too short, so we remove it ahead of time.

So, you probably don't want to skip this step, but you might want to decrease min_len to something smaller, but make sure that it's as long or longer than max-frames-per-chunk. If you want to pad your features with more silence, you could also try reducing --vad-proportion-threshold in https://github.com/kaldi-asr/kaldi/blob/master/egs/sre16/v2/conf/vad.conf (but you need to dig into this to make sure you understand what's happening...). 

2. I find it is hard for me to follow your explanation about some arguments, such as chunk, archive and --num-repeats. Would you pls kindly tell me much more details for them, respectively? How do I set them reasonably according to my training dataset?  


--num-repeats is (approximately) the number of times that a class label (e.g., speaker or language) repeats per archive. Given that this is language ID, and you have a lot more recordings per language than we'd have recordings per speaker, you should set this to a large number, like --num-repeats=500 or more. 

For the other parameters, you might need to play with this until you get a handle on how it works. Somewhere in your exp/ directory, where the examples are created, will be a set of files called ranges.*. These files tell you which recordings the speech chunks are getting extracted from, and what the corresponding language (or speaker) label is. To avoid wasting time, you could put an exit 1 command right after the ranges are created in get_egs.sh, and then inspect them yourself. 

I'm realizing the example creation is a little confusing to use if you start adapting it to other applications, e.g., language ID. In the future, I'll probably add some methods to make it easier to automatically set these parameters based on some properties of your data... Unfortunately, for now you'll have play around with it until you get a feel for how it works. 

3. At the 'StatisticsPoolingComponent', the left-context=0 and right-context=10000 means that we pool over an input segment starting at frame 0 and ending at frame 10000 or earlier, right? How can you determine the ending at frame 10000? 


The statistics pooling layer requires some finite right context, and I use 10,000 as a stand in for an infinite right context. If the input is, for example, 500 frames, it will compute the mean and stddev from those 500 frames as you'd expect. If the input is over 10,000, it will compute several x-vectors and average those. If you want, you could add a few extra 0s to the right context, but then things like nnet3-info will be very slow.

4. I have tried some different values, but all of them did not work. My error is 'ERROR (nnet3-compute-prob[5.2.124~1-70748]:CreateComputation():nnet-compile.cc:59) Not all outputs were computable, cannot create computation.' And the following screenshot is my arguments setting:

What is your neural network configuration? It's possible your smallest chunks are smaller than the minimum temporal context required by the TDNN layers. If you're using the nnet configuration here: https://github.com/kaldi-asr/kaldi/blob/master/egs/sre16/v1/local/nnet3/xvector/tuning/run_xvector_1a.sh then the requires context is 15 or more. You'll either need to increase --min-frames-per-chunk to 15 or more, or remove some splicing in the TDNN.

BTW, if I want to extract embeddings by building CNN instead of TDNN with Kaldi, would such way work better?

I think it's reasonable the CNN layers might help in language ID, but not necessarily in speaker ID. This is something I wanted to try myself. BTW, we are using this architecture in the current NIST language recognition evaluation, and it appears to work very well. 


Best,
David

Rongjin Li

unread,
Oct 31, 2017, 8:19:26 AM10/31/17
to kaldi-help
Thanks, David. Our system can work now, but its result is not so good, maybe the arguments are set unreasonably somewhere. Again, thanks for your support, hope could still get your kindly help in future.

Bests,
Rongjin

在 2017年10月29日星期日 UTC+8上午10:31:15,David Snyder写道:

Daniel Povey

unread,
Oct 31, 2017, 12:20:42 PM10/31/17
to kaldi-help
Unless you have a lot of data this won't work well.  I'm not sure how much but I think 1000 hours or so is probably a minimum.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/4a3335e8-318d-4787-876b-d5335cb04ec1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Rongjin Li

unread,
Nov 10, 2017, 10:48:32 PM11/10/17
to kaldi-help
Thanks Dan and David. I have met some doubt again when optimizing model for the language dataset. (>_<)

1. What is definition about the ‘iter’ of ‘frames per iter’ in Kaldi ? And I find two different explanations about ‘frames per eg’: one is at the line 177 of ‘steps/libs/nnet3/train/frame_level_objf/common.py’, the other is at the line 54 of ‘steps/nnet3/train_raw_dnn.py’, do you know why do two different explanations exist? 

2. We want to know the process of operating the data when training. Do you know what is the relationship among num_jobs, frames_per_eg, archive, chunk, mini_batch?

3. Our train_loss and valid_loss tend to increase as the iteration increases in the accuracy.output.report, they should not tend to increase, should they? 

Our training set includes 10 languages, each language is about 10 hours. The training set is not enough, however, we want to try it out. And the current EER result is 18.19%(lda + cosine). Below are some attachments about our experience scripts, e.g. run_xvector.sh and accuracy.output.report, you could take for reference.

Many thanks in advance for your kindly help.

Best,
Rongjin Li

在 2017年11月1日星期三 UTC+8上午12:20:42,Dan Povey写道:
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.
accuracy.output.report
run_xvector.sh

Daniel Povey

unread,
Nov 10, 2017, 11:14:09 PM11/10/17
to kaldi-help
David can answer some of the other questions, but I'm going to fix the "loss" to "objective" in the report file (it was badly named).
I'll also change the documentation for the frames_per_eg, which was not very clear.  It will now read:

        frames_per_eg:                                                                                                                                        
            The frames_per_eg, in the context of (non-chain) nnet3 training,    
            is normally the number of output (supervised) frames in each training  
            example.  However, the frames_per_eg argument to this function should                                                                             
            only be set to that number (greater than zero) if you intend to                                                                                   
            train on a single frame of each example, on each minibatch.  If you                                                                               
            provide this argument >0, then for each training job a different                                                                                  
            frame from the dumped example is selected to train on, based on                                                                                   
            the option --frame=n to nnet3-copy-egs.                                                                                                           
            If you leave frames_per_eg at its default value (-1), then the                                                                                    
            entire sequence of frames is used for supervision.  This is suitable                                                                              
            for RNN training, where it helps to amortize the cost of computing                                                                                
            the activations for the frames of context needed for the recurrence.      

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages