Hi Samuel,
Are you sure you're looking to do speaker
diarization and not speaker
recognition? Some of the tools you referenced are for speaker recognition and some are for diarization, so it's worth double checking that you are trying to solve the right problem. This tutorial gives an overview of speaker recognition, and some related problems if you are new to the terminology:
http://people.csail.mit.edu/sshum/talks/ivector_tutorial_interspeech_27Aug2011.pdf.
If your university works in NLP or speech processing, there's a good chance they have an LDC membership. They might already have some of the corpora you'd need to train or evaluate a system. You might want to look around more before concluding that you don't have any data.
As you've seen, Kaldi does have support for speaker recognition. However, there is no direct support for speaker diarization, though many of the algorithms you'd need to implement it are already there. Very broadly speaking, diarization looks like the following steps:
- 1. Speech activity detection to segment the audio into speech and nonspeech segments,
- 2. Extract fixed-dimensional representations from the speech areas every 1-2 seconds,
- 3. Cluster the representations and assign speaker labels to the audio segments those representations were extracted from,
- 4. Refine the speaker segmentation at a more fine-grained level (e.g., frame-level).
Best,
David