Hi,
There are a few different things you'll need for training, diarizing, and scoring the output.
Training:
To train the system, you will need speech activity detection segmentation and speaker labels. The CallHome recipe uses SRE data that is not segmented, but is clean enough that we run a basic energy VAD to generate segmentation (line 71 of the run.sh script), which would not be needed if you have actual segmentation. We can also do this because we know that each recording has only one speaker in it.
In terms of the training data directory, you should only need the following files:
wav.scp
segments
utt2spk and spk2utt
It's also worth noting that the speaker labels are only used for the PLDA training. The ivector extractor is relatively data-hungry, so it can be beneficial to train it with extra data that isn't speaker labeled if you have it.
Diarizing:
If you want to just generate diarization output for some audio files, all you need is speech activity detection segmentation. In the CallHome recipe, we just use the ground truth segmentation, but you can also run a SAD system first. The "models" part of the Kaldi website has a SAD model you could download and use, though I personally have not tried that particular model.
Scoring:
If you want to score your diarization output (evaluate the performance), you'll need to have ground truth to score against. What this means is that you'll have to have the time marks and speaker labels for all the speech in your evaluation set. You will also have to create an rttm file that the NIST scoring tool uses. There is documentation available online on what the format for these files is. Also, if you want to cluster based on the oracle number of speakers as opposed to according to a threshold, you will need to generate a reco2num_spk file.
—Matt