Speaker diarization using online ivectors

3,539 views
Skip to first unread message

Danijel Korzinek

unread,
Aug 31, 2016, 6:18:43 AM8/31/16
to kaldi-help
Could the existing setup for online ivector computation be used to perform speaker diarization?

I found something similar mentioned in this paper: http://hltcoe.jhu.edu/uploads/publications/papers/17309_slides.pdf

Provided I get the clustering working, would this work on the online ivectors produced by Kaldi?

Also, you mentioned before that you intend to do diarization "right". Can you quickly elaborate on what that would entail?

Thanks,
Danijel

David Snyder

unread,
Aug 31, 2016, 3:08:43 PM8/31/16
to kaldi-help, Matthew Maciejewski
Hi Danijel,

I think so, yes. A few months ago Matthew (cc'd) and I were working on a something like that. You can find some partial work at https://github.com/mmaciej2/kaldi/tree/kaldi-diarization . I'm providing this to give you some hints on how to proceed with your idea, but isn't mature enough to work out of the box. The setup assumes you have some i-vector extractor, trained from a recipe like egs/sre08/v1.  It also assumes that you've already run some VAD to get speech segments. The scripts egs/sre08/v1/sid/extract_ivectors_dense.sh and egs/sre08/v1/sid/cluster_ivectors.sh are used to extract and cluster the i-vectors. Looking at the binaries ivector-extract-dense.cc and ivector-cluster.cc in that branch should be helpful for you, but again, they are not complete. 

In the near future, we plan on adding a full-fledged speaker diarization recipe, but it might not be based on the above work. 

Best,
David

Danijel Korzinek

unread,
Sep 1, 2016, 1:49:53 AM9/1/16
to kaldi-help, mmac...@jhu.edu
That's excellent! I'll be keeping an eye out for that.

I will also conduct a few experiemnts of my own and report if I have some results. We've been using shout and Lium before, but honestly the performance on real-world data was less than impressive. Hope I  can get something better to work using this method.

Danijel

abhishek...@quantiphi.com

unread,
Jun 12, 2017, 6:24:47 AM6/12/17
to kaldi-help, mmac...@jhu.edu
Hi,

Is the speaker diarization setup usable now ?

David Snyder

unread,
Jun 12, 2017, 11:41:54 AM6/12/17
to kaldi-help, mmac...@jhu.edu, Vimal Manohar
Short answer is "no." AFAIK, the plan now is to focus on releasing a speech activity detection (SAD) recipe first. Once that is available in Kaldi, we're going to build on that by including speaker diarization. I'm not sure what the status is of the SAD component. Maybe Vimal (cc'd) has an idea. 

Daniel Povey

unread,
Jun 12, 2017, 1:39:07 PM6/12/17
to kaldi-help, Matthew Maciejewski, Vimal Manohar
We're closer to a final version of the speech-segmentation component.  I'm hoping within a few weeks to a month we'll be able to check that in, and at that point we'll start thinking seriously about what to commit w.r.t. diarization.

Dan


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

vinc...@yahoo.com

unread,
Jun 12, 2017, 3:55:30 PM6/12/17
to kaldi-help, mmac...@jhu.edu, vimal.m...@gmail.com, dpo...@gmail.com

May I ask: what is speec activity detection based on ?

How does it dissociate music from speech ?

Does it segment two speakers ?

Does it segment same speaker based on silence ?

Also, is there kind of a paper for this ?

Thanks
V

Daniel Povey

unread,
Jun 12, 2017, 4:29:11 PM6/12/17
to kaldi-help, Matthew Maciejewski, Vimal Manohar
We don't have a paper yet.  It's based on training a network with 3 targets, "speech", "silence" and "not-sure" [i.e. things that might be speech or silence... what we do with this output is flexible]; the targets can be obtained from training-data alignments and/or decodings of unsupervised data.

Right now we're focusing on making the training framework multi-purpose and easy to use so that you can adapt it to different scenarios.  We're not aiming for a single speech segmenter that you apply to all possible scenarios-- it's more a way of building a segmenter for a specific task, and you can choose how you want to classify different types of non-speech events (assuming you can figure them out from transcripts or alignments).

It will give you raw speech segments that you can then put into diarization for things like detecting speaker changes.  We're not aiming to have the segmenter attempt to figure out speaker-change boundaries at this point.

Regarding music, it would be possible to use the MUSAN corpus to add music to your training data and make it classify music in the way you desire.  But for the initial recipes, I want to keep it simple so that probably won't be part of it.

Dan



--
Message has been deleted

Ben Reaves

unread,
Oct 23, 2017, 7:01:16 PM10/23/17
to kaldi-help
How is the current status of the speaker diarisation? I'm very interested in it - for speech, not music. I have some audio files with two speakers interviewing each other, and I need to get rid of the interviewer's speech. I don't have many segments with two speakers talking simultaneously but does your software detect that also?

Thanks
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Oct 23, 2017, 7:46:13 PM10/23/17
to kaldi-help
The segmentation part is checked in (see here
https://github.com/kaldi-asr/kaldi/pull/1676) but it will probably be
at least a month for the diarization.

On Mon, Oct 23, 2017 at 6:56 PM, Ben Reaves <ben.r...@gmail.com> wrote:
> I'm looking forward to this very much. I'm working on an application that
> requires that we keep only the speech of one of two speakers. We know we
> have 2 speakers, but we need to keep only one. And we also want to ignore
> speech where the two might be speaking over each other, or interrupting
> without pause.
>
> Thank you
>>> email to kaldi-help+...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/ab2837d9-3d8d-4ada-8d22-957dce45d6a0%40googlegroups.com.

Armin Oliya

unread,
Nov 27, 2017, 10:40:29 AM11/27/17
to kaldi-help
Hi Dan, 

In your recent paper titled "Speaker diarization using deep neural network embeddings" you suggest eliminating i-vectors completely while getting better or comperable results.
Can I check if someone is working on implementing and checking it in too? or all diarizations will be based on ivectors for foreseeable future?


Thanks 
Armin

David Snyder

unread,
Nov 27, 2017, 11:59:42 AM11/27/17
to kaldi-help
Hi Armin,

Yes, but it will probably be based on a slightly different architecture that we've found works better for both verification and diarization (see http://www.danielpovey.com/files/2017_interspeech_embeddings.pdf).

In this system, the embeddings (which we call "x-vectors" in Kaldi) are extracted from the DNN and used like i-vectors. This makes it convenient to plug into things like PLDA and other backends developed for i-vectors. For your purposes, this means that once the embeddings are computed, they can share the same diarization code used in the i-vector system. We have an x-vector speaker recognition recipe in Kaldi, under egs/sre16/v2, and I've found that it works well for diarization on Callhome as well (I haven't extensively tested on other diarization datasets). There's also a pretrained model from that recipe under http://kaldi-asr.org/models.html.

Best,
David

Armin Oliya

unread,
Nov 28, 2017, 9:55:31 AM11/28/17
to kaldi-help
Thanks David, I wasn't aware of the x-vectors. So i guess the ordering of features/archiectures that yield better results could be:
i-vector -> no i-vector method -> x-vector -> fusion. 

If you don't mind i have a few related questions: 


- if i understand the results of x-vector paper correctly, the extractor can be still applied to out of domain (eg. non english data) without any caliberation? (thanks for the pretrained model!)

- is it right to say that a diarization method based on such embedding vectors would achieve the current practical state of art results? 
how do you think experimenting with other architectures (eg. rnn) could improve the results? 
how would the results roughly compare with Lium (to my understanding, one of the current most used libraries)? 
Could the comparison in this work be taken as a rough estimate, since they seem to have a similar definition for speaker embeddings (i understand network architecture, experiment setup, and other details are different).

- Given the superiority of x-vectors/fusion for this task, is there a plan to use them instead of i-vectors in current nnet3 based acoustic models?

- how does this method deal with music interleaved with speech (eg. music on hold in customer service calls)? 
will the clustering algorithm applied on i/x vectors (eg. hierarchical clustering) do a good job in clustering music frames? 
or will it be a better idea to first remove music with a recipe like bn_music_speach? 



Thanks again for your feedback :)
Armin 

David Snyder

unread,
Nov 28, 2017, 1:27:03 PM11/28/17
to kaldi-help
So i guess the ordering of features/archiectures that yield better results could be:
i-vector -> no i-vector method -> x-vector -> fusion

On the test sets we've looked at, the x-vectors have been comparable or better than i-vectors for diarization. I think it's going to be worth looking at, if you have a diarization infrastructure in place that can interface with it (or willing to wait until Kaldi has one). Also, fusion with i-vectors is easy, as you can just fuse the PLDA scores before doing agglomerative clustering.
 
- if i understand the results of x-vector paper correctly, the extractor can be still applied to out of domain (eg. non english data) without any caliberation? (thanks for the pretrained model!)

It doesn't appear to suffer from this any more than an i-vector system would. You can still train the backend on in-domain data. However, the pretrained model was trained primarily on English, so it's likely you could create a more optimal recipe with more in-domain data.

- is it right to say that a diarization method based on such embedding vectors would achieve the current practical state of art results? 
 
For diarization, I wouldn't say that quite yet. It hasn't been explored thoroughly outside of a few datasets. 

 is there a plan to use them instead of i-vectors in current nnet3 based acoustic models?

We haven't looked into it yet, but it might not be worth the hassle. The i-vectors used in the acoustic models are quite lightweight and easy to train (no speaker labels required, for example). To get the x-vectors to work well, you need a lot of labelled training data, and maybe augmentations etc. 

- how does this method deal with music interleaved with speech (eg. music on hold in customer service calls)? 
will the clustering algorithm applied on i/x vectors (eg. hierarchical clustering) do a good job in clustering music frames? 
or will it be a better idea to first remove music with a recipe like bn_music_speach? 

I think it'll be best to remove the music first, but not with the bn_music_speech recipe. I think it'd be best to use something like Vimal's speech activity detection system (e.g., https://github.com/kaldi-asr/kaldi/blob/master/egs/swbd/s5c/local/segmentation/tuning/train_stats_asr_sad_1a.sh).

Armin Oliya

unread,
Mar 7, 2018, 8:04:49 AM3/7/18
to kaldi-help
Hi David, 


Based on your feedback, I'm working on a diarization pipeline for dutch language phone calls. I have 300 hours of labelled generic data (tv, interviews, ..) and a couple hundred hours of unlabelled phone calls. 
As you suggested, I'm going to use your pretrained xvectors and treat them as ivectors when it comes to plda training/adaptation and clustering. 

After an initial attempt, here are a few questions pls: 
  • would it make more sense to adapt sre_combined (sre16 recipe) plda to unlabelled in-domain data directly, or train plda on generic data first and then do the in-domain adaptation?
  • using unlabelled data for adaptation and testing: the scripts in callhome_diarization and those in sre's require data folders to have basic kaldi files (wav.scp, utt2spk, segments). This doesn't seem to be a problem with for example callhome evaluation set, but what's the best way to go about it if i only have wav files. Currently i start with one speaker per recording and segments as long as the whole recording; then i use compute_vad_decisions and vad_to_segments to come with finer segments. However, these finer segments often have more than one speaker in them - which i understand cause it's not meant to be diariazed. But these segments are carried all the way to plda scoring and clustering "as-is" and the final labels are assigned to the whole segments which contain more than one speaker. I was thinking vad segments should be further split by a minimum length (20ms?), get scored, and then be clustered with some possible post processing to favor continuous labels?
  • output of vad_to_segments often contains short segments of silence. What's the best way to have more conservative segments?
  • would there be situations to consider using sad instead of vad? 
  • the final stage of clustering, diarization/make_rttm.py , fails because of RuntimeError: Missing label for segment xxx (around 100 segments missing in labels files). What could be the possible cause?


Thanks in advance for your feedback :)
Armin 

David Snyder

unread,
Mar 7, 2018, 10:24:26 AM3/7/18
to kaldi-help
  • would it make more sense to adapt sre_combined (sre16 recipe) plda to unlabelled in-domain data directly, or train plda on generic data first and then do the in-domain adaptation?

I think you'll have to experiment with this stuff... Fortunately experiments in the backend should not be too time consuming once your evaluation framework is set up. 

  • would there be situations to consider using sad instead of vad? 
You need to use a real SAD to get the speech activity marks for the data you want to cluster. If you search on the Kaldi forums for some variations of "Vimal's SAD" you should be able to find something. That SAD should have the options you need for controlling the average segment length. As far as I know, the vad_to_segments script is only suitable for coarsely segmenting the PLDA training data. 

But these segments are carried all the way to plda scoring and clustering "as-is" and the final labels are assigned to the whole segments which contain more than one speaker. 

I agree that overlapping speakers is a problem, but it's not something we have a general purpose solution for yet.  

  • the final stage of clustering, diarization/make_rttm.py , fails because of RuntimeError: Missing label for segment xxx (around 100 segments missing in labels files). What could be the possible cause?

I'm not sure. I'll see if Matthew Maciejewski, who prepared that script, can answer this one. 

A few other comments:

The x-vector is trained on features with a sliding window CMN, but the i-vector recipe in callhome_diarization/v1 does not using a sliding window CMN. So obviously you'll need to make sure the features are used consistently. However, there's one complication to be aware of: if you apply CMN inside some script in diarization/*sh, you could end up applying it to really short segments, which doesn't make a lot of sense (since the mean is supposed to be computed over a 3 second window). Instead of doing this, I suggest writing the features to disk with CMN already applied, prior to segmenting them for clustering (e.g., make some data/test and a data/test_cmn). 

If you want to start training new models (either i-vector or x-vector) and you don't have a lot of data, take a look at the VoxCeleb project: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/. You'll have to do a little work to acquire all of the audio, but it's freely available and very useful once you have it.

Matthew Maciejewski

unread,
Mar 7, 2018, 10:39:44 AM3/7/18
to kaldi-help
Unless the segments and labels files are malformed, this means that there is no cluster label for some segments. This means either that the clustering algorithm failed to produce output for some segments (which I doubt is possible), the PLDA model failed to produce scores for some segments (also unlikely), or that your speaker representation (i.e. ivectors or xvectors) failed to produce a vector for that segment (this is the most likely).

Chances are it's the last one. Some of the extraction scripts will not produce output if the segment is too short (which is something that can come up in diarization). David can probably help you if it's that one.

You can diagnose the problem, though, by going through the outputs of the above sections of the pipeline, and either checking to make sure the segment ID is in the archive, or if it's a recording-indexed matrix, that the dimensions are correct (the matrix should be N-by-N where N is the number of segments in the recording).

—Matt


On Wednesday, March 7, 2018 at 10:24:26 AM UTC-5, David Snyder wrote:
  • the final stage of clustering, diarization/make_rttm.py , fails because of RuntimeError: Missing label for segment xxx (around 100 segments missing in labels files). What could be the possible cause?

I'm not sure. I'll see if Matthew Maciejewski, who prepared that script, can answer this one. 


David Snyder

unread,
Mar 7, 2018, 11:00:30 AM3/7/18
to kaldi-help
Chances are it's the last one. Some of the extraction scripts will not produce output if the segment is too short (which is something that can come up in diarization). David can probably help you if it's that one.

Yes, since the frame-level layers of the x-vector DNN have a left and right context. You can copy this from my personal branch https://github.com/david-ryan-snyder/kaldi/blob/xvector-diarization/src/nnet3bin/nnet3-xvector-compute.cc.

You might also find it useful since you're using x-vectors https://github.com/david-ryan-snyder/kaldi/blob/xvector-diarization/egs/callhome_diarization/v2/run.sh. At some point I want to merge in some x-vector based diarization stuff, but I haven't gotten around to it yet. Feel free to use this as you see fit in the meanwhile. 

David Snyder

unread,
Mar 7, 2018, 11:04:40 AM3/7/18
to kaldi-help
To clarify, the script  https://github.com/david-ryan-snyder/kaldi/blob/xvector-diarization/src/nnet3bin/nnet3-xvector-compute.cc pads the input if it's too short. Look for an option with "pad" in it's name. 

Armin Oliya

unread,
Mar 7, 2018, 12:58:41 PM3/7/18
to kaldi-help
Thank you both for helpful comments.

I can see that those missing segments can't be found in  xvector.scp so i think the problem is with xvector calculation. I'll try the new .cc with pad option. thanks for sharing your branch david! 


About CMN and your suggestion to precalculate features before clustering, it's not something i'm familiar with; could you pls guide me in the direction of which scripts to use?


Got your point about sad, but that leaves me with one practical question when the full diarization + transcription pipeline is used in production: there will be three nnet3 decoding stages involved : sad, xvector extraction, online decoding (actual transcription)- now I haven't looked in the nnet config details of sad and xvectors but could it mean ~3x decoding time without diarization?

And since you mentioned VoxCeleb, is the volume and diversity of data as much as the set you personally used for training xvectors (srex, swbd, ..)? 


Thanks again
Armin

David Snyder

unread,
Mar 7, 2018, 1:15:48 PM3/7/18
to kaldi-help
About CMN and your suggestion to precalculate features before clustering, it's not something i'm familiar with; could you pls guide me in the direction of which scripts to use?

Take a look here: https://github.com/david-ryan-snyder/kaldi/blob/xvector-diarization/egs/callhome_diarization/v2/run.sh#L71 .  There's a script there, local this this branch, that applies the CMN. You'll need to do that prior to segmenting the audio. It's just a wrapper on top of the binary apply-sliding-cmvn.

Got your point about sad, but that leaves me with one practical question when the full diarization + transcription pipeline is used in production: there will be three nnet3 decoding stages involved : sad, xvector extraction, online decoding (actual transcription)- now I haven't looked in the nnet config details of sad and xvectors but could it mean ~3x decoding time without diarization?

If you put diarization in the pipeline, the whole thing will be offline since this diarization system is offline.

The models aren't all the same size so it isn't necessarily 3x slower. AFAIK both the SAD and x-vector models are smaller than the average acoustic model.

And since you mentioned VoxCeleb, is the volume and diversity of data as much as the set you personally used for training xvectors (srex, swbd, ..)? 

 VoxCeleb has something like 1k+ speakers from 20k YouTube videos. It's quite diverse but not as large as something like Switchboard plus all the past NIST SREs. Still, if your data is wideband, you might find that you can get comparable performance by training on just VoxCeleb. 

Armin Oliya

unread,
Mar 8, 2018, 5:01:51 PM3/8/18
to kaldi-help
Thanks David for the tips, i think i got it, gonna try it. 

Also going to follow run_asr_segmentation.sh under swbd. To clarify on your earlier comment, what current sad recipes do is basically label frames as speech | non-speech and they don't tell you about speaker changes, and there's currently no good solution for this. Is my understanding correct? 

Daniel Povey

unread,
Mar 8, 2018, 5:15:56 PM3/8/18
to kaldi-help, Vimal Manohar
That's correct, it's only for speech/non-speech segmentation.
There is a separate recipe for diarization (I believe it's for callhome; Matthew Maciejewski committed it), but it's not currently integrated with the run_asr_segmentation.sh script.

We found a bug in nnet3-get-egs and how it handles frame-subsampling-factor, which affects this setup.  Vimal (cc'd) has been experimenting with the effect of the fix-- for some reason it doesn't affect the results-- and the fix is not merged yet.  But you might want to run with the fix anyway.  Vimal, would you mind making a PR for that fix?

Dan


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

David Snyder

unread,
Mar 8, 2018, 5:17:53 PM3/8/18
to kaldi-help
Right. So the SAD will just give you the speech/nonspeech segments. If you run something similar to callhome_diarization/v1 on top of the speech segments, it will perform a (somewhat) coarse clustering of subsegments. So this will get you the speaker changes, but it won't solve the overlapping speech problem. The overlapping segments will end up assigned one or the other speaker (or in the worst case, a completely wrong speaker).  

Armin Oliya

unread,
Mar 8, 2018, 5:35:37 PM3/8/18
to kaldi-help
Thank you both. 

Got you on overlapping speakers. What i still have problem understanding is how speaker changes are detected in callhome diarization recipes. because for instance, SAD could cut a continuous 10 second long segment with two (non-overlapping) speakers in it. diarization scripts assume there's one speaker per segment (?) and only score and label each segment, although they may contain more than one (non-overlapping) speakers. 

David Snyder

unread,
Mar 8, 2018, 5:40:11 PM3/8/18
to kaldi-help
The diarization recipe first cuts long segments into lots of short, overlapping subsegments. By default (but this is an option to diarization/extract_ivectors.sh), the chunks are 1.5 seconds long with a 0.75 second overlap. Embeddings are extracted from these chunks and clustered, to get the speaker labels. If you have a 10 second long segment, it will not assume there is only one speaker in that segment.

Armin Oliya

unread,
Mar 8, 2018, 5:43:46 PM3/8/18
to kaldi-help
Thanks for the clear explanation, got it :)

Armin Oliya

unread,
Mar 16, 2018, 12:23:46 PM3/16/18
to kaldi-help
Hi David, 


I just finished a few experiments based on your v2 recipe; it works really well on both studio quality and cts, if i train plda with in-domain dutch data (100 hours, 2k speakers). I also tried plda adaptation/whitening to ~5 hours of test-domain-like data (similar to sre16 recipe), i can't call it an improvement yet..


The mismatch between labels and segments still happens though, but much fewer, like 1-2 segments per 5 hour. they're quite short and usually happen at the end of recordings. here are a few examples: 

seg rec 13.54 13.83
seg rec
699.94 700.71

I used vad instead of a fullblown sad - at the end, extra silence gaps are nicely merged with last speech segments which is good enough for my usecase; also that saves me another round of nnet decoding in sad recipes. 


A few questions pls: 
  1. is the minimum segment length in final rttm file the same as window overlap of 0.75, would you suggest increasing it if i'm interested in larger min seg durations? 
  2. current mfcc.conf is based on 8k sample rate and pretrained xvector nnet weights are sensetive to changing that i guess. i use 16k in my environment (for decoding eg) and rather keep it like that to avoid having multiple wav.scp files, or reuse mfcc features for decoding. How do you suggest to go about it?
  3. In my current asr pipeline i diarize first (using Lium) and use steps/online/nnet3/decode.sh to decode resulting segment. Is it right to say that the final wer is dependent on the quality of diarized segments, mainly because pure uni-speaker segments are better represented with language model?
  4. the nnet decoding time to extract xvectors is notable and i'm thinking how i can fit it performance-wise in my end to end asr pipeline. 
    1. is there any performance to be gained by using gpu for extracting xvectors? 
    2. how would you compare the decode-time of ivectors and xvectors, and is there a tradeoff between extraction time and down stream error rates?
    3. is there a pretrained ivector extractor that i can use to test and benchmark?

Thanks again for putting everything together :) 

David Snyder

unread,
Mar 16, 2018, 1:09:21 PM3/16/18
to kaldi-help

I just finished a few experiments based on your v2 recipe; it works really well on both studio quality and cts, if i train plda with in-domain dutch data (100 hours, 2k speakers). I also tried plda adaptation/whitening to ~5 hours of test-domain-like data (similar to sre16 recipe), i can't call it an improvement yet..

Probably too little data for estimating the whitening matrix. I think you might want to use the test-domain-like data for centering, though. Normally you'll center to whatever data you're extracting embeddings from (center PLDA to to the PLDA training list, center your test data to some test-data-like dataset).

is the minimum segment length in final rttm file the same as window overlap of 0.75, would you suggest increasing it if i'm interested in larger min seg durations?

Matt worked on that, I'll get him to answer this.

current mfcc.conf is based on 8k sample rate and pretrained xvector nnet weights are sensetive to changing that i guess. i use 16k in my environment (for decoding eg) and rather keep it like that to avoid having multiple wav.scp files, or reuse mfcc features for decoding. How do you suggest to go about it?

For now you don't have any options I'm aware of, besides training a new xvector DNN on wideband data. It's going to require a lot of data, though.

  1. In my current asr pipeline i diarize first (using Lium) and use steps/online/nnet3/decode.sh to decode resulting segment. Is it right to say that the final wer is dependent on the quality of diarized segments, mainly because pure uni-speaker segments are better represented with language model?

I think this is a good question, but it's not something I've investigated. Perhaps someone else can comment on this.

  1. the nnet decoding time to extract xvectors is notable and i'm thinking how i can fit it performance-wise in my end to end asr pipeline. 
    1. is there any performance to be gained by using gpu for extracting xvectors? 
    2. how would you compare the decode-time of ivectors and xvectors, and is there a tradeoff between extraction time and down stream error rates?
    3. is there a pretrained ivector extractor that i can use to test and benchmark?

This xvector DNN was designed for speaker recognition. If I were training one for diarization, I'd downsize the model a bit, to make it faster.

You can certainly use gpus to extract the xvectors. I believe there's a --use-gpu option in the script extract_xvectors.sh.

The standard i-vector system is going to be somewhat faster, but I haven't measured it. If the i-vectors incorporate an ASR DNN to improve performance (e.g., bottleneck features or replacing GMM posteriors) it's going to be slower than most x-vectors systems. 

Regarding pretrained models: I think the current x-vector model is the only pretrained system we have for speaker recognition or diarization. We're open to uploading more things, but it's just a matter of getting the time to prepare the systems. At some point I want to upload an x-vector DNN trained on lots of wideband data--when that happens, it will probably be helpful for your data. I'll try to keep in mind what you said about wanting faster models, and upload several models (with different sizes/speeds) trained on the same data. Anyway, no timeline on when that will happen yet.

Matthew Maciejewski

unread,
Mar 16, 2018, 1:33:20 PM3/16/18
to kaldi-help
  1. is the minimum segment length in final rttm file the same as window overlap of 0.75, would you suggest increasing it if i'm interested in larger min seg durations? 
Yes, the minimum segment length is defined by the overlap. I'm not sure increasing it is necessarily the correct thing to do if you're interested in larger minimum segment durations. You can kind of think of the sliding window method as quantizing the signal in time. Increasing the period of your sliding window is essentially lowering the sampling rate, so though the segment minimum increases, that would just be due to lower time resolution. In theory if your segments are longer than 0.75s, the algorithm should just never be assigning single windows to a unique speaker. If they're not, it means the ivectors (or xvectors) are not capturing the speaker very well.

I'm not going to say that you shouldn't decrease the frequency of the sliding window, because it's very possible that it will do what you want. But I also suggest maybe trying to increase the window length itself. Generally using a larger window is going to give higher-quality ivectors, but that comes at the cost of having a higher likelihood of getting multiple speakers in that window and producing a contaminated ivector, which is particularly important for short segments, since if the segment is short and the window is large, you may not have any "clean" ivectors of that speech. The 1.5s window we typically use is meant to balance wanting a small window to minimize the chance of multiple speakers in that window while also being large enough to collect enough speech frames to produce a reasonable ivector. If you have reason to believe there are no short segments in your data, it might be reasonable to increase the window size to try to produce higher-quality speaker representations. 

  1. In my current asr pipeline i diarize first (using Lium) and use steps/online/nnet3/decode.sh to decode resulting segment. Is it right to say that the final wer is dependent on the quality of diarized segments, mainly because pure uni-speaker segments are better represented with language model?
I haven't done any real experiments with this, but I did try an "oracle" experiment once on the AMI dataset where I ran an ASR system with and without speaker labels to see how much a diarization system could theoretically help, and although there were gains, they were small (around <1% absolute if I remember correctly). There are multiple ways in which a diarization system could help an ASR system, but how much it helps is going to depend on what your data set looks like and how sensitive your ASR system is to various issues that diarization can solve. 

—Matt

Arseniy Gorin

unread,
Mar 17, 2018, 4:00:46 AM3/17/18
to kaldi-help
Hi, 

I think I can add something in addition to what David and Matthew have said
  1. the nnet decoding time to extract xvectors is notable and i'm thinking how i can fit it performance-wise in my end to end asr pipeline. 
When I played with xvector model, I noted that if you decode many short segments (such as in diarization), you may be spending way too much time on compilation step.
In the following I assume you are using chunk-size smaller that the default value (say 500-1000)

In this case, there are 3 optimizations in nnet3-xvector-compute that I've found useful:

1. The easiest ope. Increasing the default 64 compiler cache capacity

- CachingOptimizingCompiler compiler(nnet, opts.optimize_config);
+ CachingOptimizingCompilerOptions compiler_config;
+ compiler_config.cache_capacity = <some_large_number>;
+ CachingOptimizingCompiler compiler(nnet, opts.optimize_config, compiler_config);

This speeds up things a lot by not re-compiling the model if a segment of the same chunk size has been processed in the past.
This makes sense if you process large number of segments within a single call to nnet3-xvector-compute

2. If you do not process large batches of features and rather doing multiple calls to nnet3-xvector-compute, you can write a simple executable to precompute the cache on a range of segments of varying length.
This is like executing nnet3-xvector-compute using an ark file, where you have segments of the size (min_chunk_size, min_chunk_size+1, ..., + chunk-size -1, chunk-size) and then saving the compiler to a file using the following method:

compiler.WriteCache(ko.Stream(), binary_write);

you can then add an option in nnet3-xvector-compute specifying the cache file path and load it using

compiler.ReadCache(ki.Stream(), cache_binary_in); 

See an example in nnet3bin/nnet-discriminative-training.cc:
If pre-computed cache covers all segment size range that you expect in nnet3-xvector-compute, the speed-up is really huge.

3. Finally, if you guarantee that nnet3-xvector-compute will always use precomputed cache and will never actually compile, you can parallelize computation using TaskSequencer. It is not possible in the current implementation as CachingOptimizingCompiler is not thread-safe. However, it seems that this works OK if you only reading from cache never writing to it. I may miss something in this statement, so I'd rather consided this as not safe optimization.

Daniel Povey

unread,
Mar 17, 2018, 6:02:43 PM3/17/18
to kaldi-help
It would be good to get a PR for some of those changes (e.g. RE cache capacity) if they are fairly easily applicable to the currently existing binaries (and if David agrees that it makes sense).

I have created a PR to make CachingOptimizingCompiler thread safe:


It would be good if you could test it.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

David Snyder

unread,
Mar 17, 2018, 6:10:09 PM3/17/18
to kaldi-help
Yes, this makes sense to me. 

Arseniy, if you're already working with Dan on this topic, do you mind making a PR for your proposed change, at least for the first option?
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Arseniy Gorin

unread,
Mar 18, 2018, 1:24:59 AM3/18/18
to kaldi-help
David, sure, I'll send a PR for 1 and 2.
Dan, great news! I'll try to test over the week and let you know

David Snyder

unread,
Mar 18, 2018, 1:39:37 AM3/18/18
to kaldi-help
I wouldn't worry about adding the 2nd option unless it's reasonably simple and there's an immediate need for it. I think the 1st thing will be helpful enough.

Arseniy Gorin

unread,
Mar 18, 2018, 1:58:50 AM3/18/18
to kaldi-help
It is simple, but you are right - it is more for production version that would likely find no application in kaldi scripts.
I'll add only cache then
Thanks

Armin Oliya

unread,
Apr 1, 2018, 11:54:50 AM4/1/18
to kaldi-help
Thank you all for your helpful comments.

@david, indeed would be great if other models for ivector/wide band are available. if you need help with putting things together or testing pls let me know. 
The use-gpu option didn't yield runtime improvement and it actually seemed there's not much load on gpus while xvector decoding. 
btw, i think your current xvector branch is stable and ready for PR. 

@Matt, thanks for your clear explanation, got it. I did a few tests with diarization <> wer and found that accurate diarization (compared to my baseline lium) can help with up to 1% wer, and even slightly better if i do speech activity detection before diarization. On the other hand, bad diarization can have a strong effect; for example, on one of my test sets i had only one speaker per recording and when i forced diarization to have two speakers, wer changed from 5% to 9%. 


@Arseniy, i tried your first suggestion on your PR, with cache size set to 1e4 and 1e6, but got no difference in xvector decode wall-time. i tested with two different test sets of 30 mins (90 files) and 1h 30mins (16 files). 

Arseniy Gorin

unread,
Apr 2, 2018, 3:48:32 AM4/2/18
to kaldi...@googlegroups.com
Interesting. The fact that there is no improvement means you have not too many segments of different length.Then yes, this option will be useless.

By the way, if speed is really crucial for you, you may check multi-threaded implementation of the extractor https://github.com/kaldi-asr/kaldi/pull/2303 
It will not be added to master branch, but you can check and let me know if it works for you

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/ROtSHHe3Z_I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Armin Oliya

unread,
Apr 2, 2018, 4:01:28 PM4/2/18
to kaldi-help
Thanks will try it next. 


On Monday, April 2, 2018 at 9:48:32 AM UTC+2, Arseniy Gorin wrote:
Interesting. The fact that there is no improvement means you have not too many segments of different length.Then yes, this option will be useless.

By the way, if speed is really crucial for you, you may check multi-threaded implementation of the extractor https://github.com/kaldi-asr/kaldi/pull/2303 
It will not be added to master branch, but you can check and let me know if it works for you
2018-04-01 18:54 GMT+03:00 Armin Oliya <armin...@gmail.com>:
Thank you all for your helpful comments.

@david, indeed would be great if other models for ivector/wide band are available. if you need help with putting things together or testing pls let me know. 
The use-gpu option didn't yield runtime improvement and it actually seemed there's not much load on gpus while xvector decoding. 
btw, i think your current xvector branch is stable and ready for PR. 

@Matt, thanks for your clear explanation, got it. I did a few tests with diarization <> wer and found that accurate diarization (compared to my baseline lium) can help with up to 1% wer, and even slightly better if i do speech activity detection before diarization. On the other hand, bad diarization can have a strong effect; for example, on one of my test sets i had only one speaker per recording and when i forced diarization to have two speakers, wer changed from 5% to 9%. 


@Arseniy, i tried your first suggestion on your PR, with cache size set to 1e4 and 1e6, but got no difference in xvector decode wall-time. i tested with two different test sets of 30 mins (90 files) and 1h 30mins (16 files). 



On Sunday, March 18, 2018 at 6:58:50 AM UTC+1, Arseniy Gorin wrote:
It is simple, but you are right - it is more for production version that would likely find no application in kaldi scripts.
I'll add only cache then
Thanks

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/ROtSHHe3Z_I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages