Speaker diarisation in Kaldi

samuel.g...@gmail.com

unread,

Sep 12, 2017, 9:29:24 AM9/12/17

to kaldi-help

Hi All,

I'm completing my undergraduate Thesis on speech detection and diarisation and have been playing around with some of the example scripts.

I am trying to set up and API to transcribe english tagged with a speaker ID, similar in structure to IBM's text to speech API (https://speech-to-text-demo.mybluemix.net/?cm_mc_uid=52725515916915048313143&cm_mc_sid_50200000=1505192320&cm_mc_sid_52640000=1505192320).

The primary focus part on this project is on the diarisation functionality and does not require extremely high accuracy by any means. I've looked into sre08 and sre10 and this appears to exhibit some of the required functionality. As I don't have access to LDC data I'm considering finding another source and attempting to train a model on that.

I am after some advice as to if you would recommend using Kaldi for this function. I've looked into LIUM (http://www-lium.univ-lemans.fr/diarization/doku.php/overview) and SPEAR(https://pythonhosted.org/bob.bio.spear/) which appear more limited in terms of performance but substantially easier to implement, maybe some of you have experience in this domain.

If you have any other comments or recommendations it would be much appreciated. Thanks.

David Snyder

unread,

Sep 12, 2017, 2:32:39 PM9/12/17

to kaldi-help

Hi Samuel,

Are you sure you're looking to do speaker diarization and not speaker recognition? Some of the tools you referenced are for speaker recognition and some are for diarization, so it's worth double checking that you are trying to solve the right problem. This tutorial gives an overview of speaker recognition, and some related problems if you are new to the terminology: http://people.csail.mit.edu/sshum/talks/ivector_tutorial_interspeech_27Aug2011.pdf.

If your university works in NLP or speech processing, there's a good chance they have an LDC membership. They might already have some of the corpora you'd need to train or evaluate a system. You might want to look around more before concluding that you don't have any data.

As you've seen, Kaldi does have support for speaker recognition. However, there is no direct support for speaker diarization, though many of the algorithms you'd need to implement it are already there. Very broadly speaking, diarization looks like the following steps:

1. Speech activity detection to segment the audio into speech and nonspeech segments,
2. Extract fixed-dimensional representations from the speech areas every 1-2 seconds,
3. Cluster the representations and assign speaker labels to the audio segments those representations were extracted from,
4. Refine the speaker segmentation at a more fine-grained level (e.g., frame-level).

We already have some recipes for step 1 in Kaldi (but again, you need some data to train the systems). Look here https://github.com/kaldi-asr/kaldi/blob/master/egs/babel/s5d/local/run_asr_segmentation.sh. I-vectors work well for steps 2-3. The clustering mechanisms exist in Kaldi, but there's nothing that puts it all together. I have a work in progress branch that does some of these steps: https://github.com/david-ryan-snyder/kaldi/blob/kaldi-diarization-v3/egs/callhome_diarization/v1/run.sh. Keep in mind that this is separate from the main Kaldi branch. Step 4 can be omitted if you're not too concerned with high performance.

Best,

David

samuel.g...@gmail.com

unread,

Sep 12, 2017, 7:38:39 PM9/12/17

to kaldi-help

Hi David,

Thanks for the advice. I'm definitely looking into diarisation however was thinking I may be able to use continuous samples of speaker recognition to perform diarisation, if that makes sense (this appears to be similar to the method you have suggested).

I'll contact my university and research what you have posted.

Cheers,

Sam

Reply all

Reply to author

Forward