Diarization setup tuning, data, etc.

François Hernandez

unread,

Feb 1, 2018, 12:19:10 PM2/1/18

to kaldi-help

Hi,

I have some questions regarding an ivector diarization setup.
As I don't have access to SWBD / SRE data, I made my own recipe (based on the callhome_diarization scripts) to use tedlium data as training set for ivector extractor and PLDA scoring.
(automatically labeled with the rule 1 recording = 1 speaker, assuming it's true for >95% of ted recordings)

It works quite good in general, but it seems a bit too sensitive in some cases. By that I mean, it sometimes cuts an intervention in several parts, assigning different speaker labels, when it shouldn't (and no noticeable significant change in volume, pitch or anything may disturb it).

1) First, is there some publically available (with 16k sampling) evaluation set for diarization? It would really help in tuning such a setup to have some baselines to refer to.

2) I reckon the tedlium corpus is not the obvious choice for such a task, but it would be good to have such a setup working with easily available data. May the issue of some speakers being split up too much come from a lack of certain types of speakers in the data? E.g. more male than female, certain accents, or specific tones.

3) The callhome setup seems to take a whole lot of data from SWBD and SRE corpora. Do we really need this much data or, would a significant amount of small samples (a few minutes) of different speakers would be enough for the PLDA part?

4) May the ivector extractor itsefl be a plausible cause of such issues of bad clustering for certain speakers?

5) Would playing with the ivector window size be of some help in tuning the setup?

6) Finally, it seems having a very long recording (over an hour), with a significant amount of speakers (over 5) and some perturbations (noise, laughter, crosstalk) is quite messing with the clustering part. Is that a normal behaviour of the clustering approach? And would the PLDA threshold need some different tuning for different types of files? (length, amount of speakers, etc.)

That's a lot of questions, but I hope it will help others in tackling the diarization topic.

Thanks in advance!
François

Daniel Povey

unread,

Feb 1, 2018, 5:50:22 PM2/1/18

to kaldi-help

I have some questions regarding an ivector diarization setup.
As I don't have access to SWBD / SRE data, I made my own recipe (based on the callhome_diarization scripts) to use tedlium data as training set for ivector extractor and PLDA scoring.
(automatically labeled with the rule 1 recording = 1 speaker, assuming it's true for >95% of ted recordings)

I think the number of speakers in Tedlium may be too small for this to work very well.

The ivector extractor is quite sensitive to the amount of data; typically you want to train it on a huge amount of data if possible.

It works quite good in general, but it seems a bit too sensitive in some cases. By that I mean, it sometimes cuts an intervention in several parts, assigning different speaker labels, when it shouldn't (and no noticeable significant change in volume, pitch or anything may disturb it).

1) First, is there some publically available (with 16k sampling) evaluation set for diarization? It would really help in tuning such a setup to have some baselines to refer to.

Not that I know of..

2) I reckon the tedlium corpus is not the obvious choice for such a task, but it would be good to have such a setup working with easily available data. May the issue of some speakers being split up too much come from a lack of certain types of speakers in the data? E.g. more male than female, certain accents, or specific tones.

I think there may be too few speakers for this to be a good choice for training, at least.

3) The callhome setup seems to take a whole lot of data from SWBD and SRE corpora. Do we really need this much data or, would a significant amount of small samples (a few minutes) of different speakers would be enough for the PLDA part?

Yes, I think it does make a significant difference. At least in terms of training the ivector extractor; that tends to require a lot of data. Sometimes it's better to train the PLDA on a smaller amount of in-domain data though.

4) May the ivector extractor itsefl be a plausible cause of such issues of bad clustering for certain speakers?

Possibly... Anyway, diarization is hard.

5) Would playing with the ivector window size be of some help in tuning the setup?

You could try, but I'd first try reducing the number of Gaussians in the UBM, and the dimension of the ivectors. You may not have enough data to train it right now.

6) Finally, it seems having a very long recording (over an hour), with a significant amount of speakers (over 5) and some perturbations (noise, laughter, crosstalk) is quite messing with the clustering part.

Not sure what you mean here.. perhaps you mean it gives poor diarization performance for such files. Diarization will obviously be harder with more speakers.

Is that a normal behaviour of the clustering approach? And would the PLDA threshold need some different tuning for different types of files? (length, amount of speakers, etc.)

Probably, yes. From what I've been told, I think you are supposed to tune the clustering threshold by hand depending on what type of data you have, and if it's very heterogeneous, different thresholds for different types of data may be best. If you know the number of speakers per file and are allowed to use that information, I suspect there is some way to tell that to the clustering code, but I don't know the exact mechanism.

Dan

That's a lot of questions, but I hope it will help others in tackling the diarization topic.

Thanks in advance!
François

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/cdf32d02-afc6-461f-b44a-919744c0f772%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matthew Maciejewski

unread,

Feb 1, 2018, 6:52:33 PM2/1/18

to kaldi-help

I essentially agree with everything Dan has said. Diarization is harder than people think, in my opinion. I do have some additional comments, however.

Unless I'm mistaken, both the AMI Meeting Corpus and VoxCeleb are freely available and are 16kHz, though they both will require some work to set up. VoxCeleb has a decent amount of data for training, but does not have ground truth for diarization. AMI can be used as a diarization task, but the microphones that capture all speakers are distant. I have summed the head-mounted microphones to a single audio track to do a synthetic near-field 16kHz diarization task in the past, however. It's also worth noting that I have found AMI too small for training as well.

I can also vouch for the claim that the PLDA isn't necessarily as data-hungry. My best-performing AMI setups involve training the PLDA on only in-domain data. The ivector extractor is very data hungry, but the PLDA seems to be very sensitive to the type of data given to it.

Also, as you presumably know from the CallHome setup, the code does support a target number of speakers as Dan suspected. Ideally there should be a way to tune the clustering threshold, but we do not have any particularly good automatic method of detecting it. There are methods that people use, but it also is application-dependent to some degree. Unless the threshold is egregiously wrong, it is not going to have a huge effect on the system performance, however.

The situation you described is likely just a challenging situation to cluster. Diarization systems perform best with fewer number of speakers who represent relatively even shares of the recording. Often, with many speakers, some of them will speak only briefly, and these can be difficult to deal with, even if you are clustering using the oracle number of speakers.

François Hernandez

unread,

Feb 2, 2018, 6:04:55 AM2/2/18

to kaldi-help

Thanks a lot for all your answers.

So, if I recap a bit, I could:
- train the data-hungry ivector extractor with some significant amount of unlabeled audio data (from our production corpora);
- compile a corpus with extracts of tedlium speaker + some (lots) of our in-domain data that we shall have labeled to train the PLDA to the best. (Would you think a few minutes per speaker is enough in that case?)

I was aware of the oracle option in the callhome setup, but it may not be applicable to all of our use cases, but thanks for mentioning it.

The strange thing about the issue I mentioned earlier is that all speakers (6 female 1 male in this specific example) were approximately evenly distributed, but it appeared one or two of them had totally messed up segmentation: e.g, 3 different speakers found in a single intervention.

Anyways, I'll try with more data and will let you know.

Matthew Maciejewski

unread,

Feb 2, 2018, 11:21:24 AM2/2/18

to kaldi-help

Yes, that sounds reasonable.

It is also possible that if your ivector extractor or PLDA were trained on data that was heavily male, the models would do poorly on female speech.

Also, there's a trick I found helpful with training the PLDA model for the AMI corpus that might help for you. I found that extracting 1-3 ivectors per utterance, and then considering each utterance to be a new speaker for the purpose of PLDA training improved the resulting PLDA model quite a bit. My best guess is that if you have a limited number of speakers, you can artificially increase the number of speakers by splitting them up, relying on the assumption that ivectors from temporally-local time segments will be more similar than ivectors from across the recording, even if they were spoken by the same speaker.

François Hernandez

unread,

Feb 2, 2018, 11:24:43 AM2/2/18

to kaldi-help

Oh, that's very good to know. I thought of that trick (which I use in other cases for example to split short recordings in several jobs), but I feared it would mess up even more with the clustering.

I'll try all of that. Thanks!

Lahiru Samarakoon

unread,

Feb 25, 2018, 9:17:41 PM2/25/18

to kaldi...@googlegroups.com

Hi All,

I want to experiment with diarization but the main problem is not having a sufficient amount of data. So, I am planning to built a dataset by concatenating speech from Tedlium, AMI,..etc. Mainly the interest lies in diarization of two speaker conversations. This would be my first venture into diarization so any advice on creating a dataset is highly appreciated.

Thank you,

Best,

Lahiru

--

Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/44c08742-15d1-433c-8d81-a2ee8d15c1ff%40googlegroups.com.

David Snyder

unread,

Feb 25, 2018, 9:38:02 PM2/25/18

to kaldi-help

I think AMI is a reasonable choice for an evaluation dataset. Some people publish on it, so you'll be able to find benchmarks to compare with.

If you're short on training resources, you might want to try Librispeech, to increase the amount of training data for the UBM and ivector extractor.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Matthew Maciejewski

unread,

Feb 25, 2018, 10:10:38 PM2/25/18

to kaldi-help

I have done diarization on AMI, both with the distance mic and summing the headset mics.

A word of warning—though AMI makes for a good diarization evaluation due to being transcribed based on individual headset mics, the files do contain 4-5 speakers each. It is not representative of the 2-speaker situation.

On Sunday, February 25, 2018 at 9:17:41 PM UTC-5, Lahiru Samarakoon wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Lahiru Samarakoon

unread,

Feb 26, 2018, 8:16:40 AM2/26/18

to kaldi...@googlegroups.com

Thanks Guys, I'll look into Librispeech corpus to add more data.

Matthew, do you know about an open source kaldi setup for the AMI data?

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/40867419-58a3-44ed-be0b-61617f21e652%40googlegroups.com.

Matthew Maciejewski

unread,

Feb 26, 2018, 9:54:27 AM2/26/18

to kaldi-help

I have a preliminary setup here:

https://github.com/mmaciej2/kaldi/blob/ami-diarization/egs/ami_diarizaiton/v1/run.sh

It does not contain any of the work I have put into tuning a system, but it might be a useful starting point. It also is currently only the single distant microphone condition, but the rest should be easy to adapt from the regular AMI ASR recipe.

Lahiru Samarakoon

unread,

Feb 26, 2018, 7:47:47 PM2/26/18

to kaldi...@googlegroups.com

Thanks Matthew. This is very helpful.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a81c43bf-7e36-481a-a1ec-fb23fece85a5%40googlegroups.com.

Reply all

Reply to author

Forward