Hi Amber,
The callhome diarization recipe uses the oracle speech activity detection (SAD). I believe it gets that information from some existing reference file. Most literature on this corpus use the oracle SAD marks, so that's what we do in this recipe.
Of course, in a real application you'll need to run your own SAD system in order to determine the speech/nonspeech boundaries.
So for the data you wish to diarize, you'll need the usual files you'd find in an ASR recipe, utt2spk, spk2utt, segments, wav.scp, etc. Like in many of the ASR recipe, the utt2spk file is not really a mapping from utterance to speaker, but from a speech segment to a recording (and vice versa for spk2utt). If you're new to Kaldi, you'll probably want to run the data preparation in an ASR recipe that uses freely available data, to see what these files look like.
Since you're using the VoxCeleb recipe, you already have a pretrained PLDA model, so I suggest just using that for now (of course, in the future you can probably improve on it with more in-domain data, but this should be a reasonable baseline). What you'll need to do is something like this:
1. Prepare a version of the data that doesn't know anything about speech segments. That means the utt2spk file for the data is just an identity, from recording ID to recording ID
2. Run a SAD system on the previous dataset. One solution is to use a pretrained SAD DNN, e.g.,
http://kaldi-asr.org/models/m4. You'll need to prepare a separate set of features for this DNN. Another option is simply using the energy-based VAD with some smoothing (you'll need to do that smoothing yourself, I think), and then run diarization/vad_to_segments.sh to get the segments themselves.
3. Now that you have a segmented version of the original dataset, you'll need to extract features for that. That means extracting MFCCs, and running
https://github.com/kaldi-asr/kaldi/blob/master/egs/callhome_diarization/v2/run.sh#L80. Bear in mind that the features in the callhome diarization recipe are different from the ones the voxceleb recipe were trained on (one is 8kHz and the other is 16kHz), but it's not hard to adapt the callhome recipe to the wideband data, you just need to change the mfcc.config.
5. Now you can run the steps starting at stage 9
Best,
David