Speaker diarization with x-vector

1,538 views
Skip to first unread message

Amber Xiangli

unread,
Jun 21, 2018, 7:40:26 AM6/21/18
to kaldi-help
Hi,

I intended to do speaker diarization on my own dataset using trained x-vector on Voxceleb2. I found callhome_diarization v2 recipe and I didn't quite follow make_callhome.sh (I think it uses the fullref.rttm file to prepare data...). 

So, suppose I have an unlabelled .wav file consists of multiple speakers without overlapping: 
1. I think I still need spk2utt, utt2spk and wav.scp, but what will these files look like considering there's no label?
2. If I had these files prepared, then after for mfcc and vad computation in stage1, can I continue from stage9 (perform plda scoring) provided that I already had PLDA trained on some in-domain data?

Thank you very much.

Regards,
Amber

David Snyder

unread,
Jun 21, 2018, 11:02:45 AM6/21/18
to kaldi-help
Hi Amber,

The callhome diarization recipe uses the oracle speech activity detection (SAD). I believe it gets that information from some existing reference file. Most literature on this corpus use the oracle SAD marks, so that's what we do in this recipe. 

Of course, in a real application you'll need to run your own SAD system in order to determine the speech/nonspeech boundaries. 

So for the data you wish to diarize, you'll need the usual files you'd find in an ASR recipe, utt2spk, spk2utt, segments, wav.scp, etc. Like in many of the ASR recipe, the utt2spk file is not really a mapping from utterance to speaker, but from a speech segment to a recording (and vice versa for spk2utt). If you're new to Kaldi, you'll probably want to run the data preparation in an ASR recipe that uses freely available data, to see what these files look like.

Since you're using the VoxCeleb recipe, you already have a pretrained PLDA model, so I suggest just using that for now (of course, in the future you can probably improve on it with more in-domain data, but this should be a reasonable baseline). What you'll need to do is something like this:

1. Prepare a version of the data that doesn't know anything about speech segments. That means the utt2spk file for the data is just an identity, from recording ID to recording ID

2. Run a SAD system on the previous dataset. One solution is to use a pretrained SAD DNN, e.g., http://kaldi-asr.org/models/m4. You'll need to prepare a separate set of features for this DNN. Another option is simply using the energy-based VAD with some smoothing (you'll need to do that smoothing yourself, I think), and then run diarization/vad_to_segments.sh to get the segments themselves.

3. Now that you have a segmented version of the original dataset, you'll need to extract features for that. That means extracting MFCCs, and running https://github.com/kaldi-asr/kaldi/blob/master/egs/callhome_diarization/v2/run.sh#L80. Bear in mind that the features in the callhome diarization recipe are different from the ones the voxceleb recipe were trained on (one is 8kHz and the other is 16kHz), but it's not hard to adapt the callhome recipe to the wideband data, you just need to change the mfcc.config. 

4. Then extract x-vectors for these speech segments. Look at https://github.com/kaldi-asr/kaldi/blob/master/egs/callhome_diarization/v2/run.sh#L209 to see how it's done. 

5. Now you can run the steps starting at stage 9

Best,
David

Amber Xiangli

unread,
Jun 22, 2018, 6:48:41 AM6/22/18
to kaldi-help
Dear David,

Thank you for your help and your patience. I check out ASpIRE SAD model and noticed it was trained on 8kHz data while my data is 16kHz. Does it mean I cannot use this pretrained model but have to train from scratch?

Regards,
Amber 

David Snyder

unread,
Jun 22, 2018, 9:39:01 AM6/22/18
to kaldi-help
You don't need to retrain the SAD DNN. You just need to add --allow-downsample=true to the mfcc.conf you use for the SAD DNN. You'll need two mfcc.conf files to extract two seats of features: one for the SAD DNN and another for the x-vector DNN.

Bear in mind that the ASPIRE SAD model was trained on telephone data (I think) so its performance will probably not be optimal for the wideband data you plan to use it on. You'd probably get a better SAD model if you trained a new one, but that has its own challenges.

Amber Xiangli

unread,
Jun 23, 2018, 7:17:34 PM6/23/18
to kaldi-help
Dear David,

Sorry to bother again. I followed your instruction on using SAD and obtained a set of segments. Then I applied MFCC extraction and VAD computation using configurations in voxceleb/v2/conf, then used prepare_feats.sh (callhome_diarization/nnet3) to apply CMVN. After that I tried to perform diarization with extract_xvectors but it gave the following error:

ERROR (nnet3-xvector-compute[5.4.114~2-fb54]:AcceptInput():nnet-compute.cc:556) Num-cols mismatch for input 'input': 23 in computation-request, 30 provided.

which is caused by the difference between the feat-dim of the pretrained model (23) and the feat-dim of my input. I don't know where to specify or change feat-dim but I think it is related to feat.scp. Can you help me with this problem?

Thank you very much.
Amber 

On Thursday, 21 June 2018 12:40:26 UTC+1, Amber Xiangli wrote:

David Snyder

unread,
Jun 24, 2018, 2:58:36 PM6/24/18
to kaldi-help

Amber Xiangli

unread,
Jun 26, 2018, 10:43:03 AM6/26/18
to kaldi-help
Hi,
Thank you for helping. I used a model trained from an older version of kaldi. Now the diarization pipeline works fine. 

After diarization there is a rttm file indicating the segments and its corresponding speaker, and I would like to use this file for future speaker recognition task. Is there a way to prepare utt2spk, spk2utt and wav.scp without cutting the audio file into actual segments (or somehow combine rttm file with data preparation)?

Many thanks.
Amber

On Thursday, 21 June 2018 12:40:26 UTC+1, Amber Xiangli wrote:
Reply all
Reply to author
Forward
Message has been deleted
0 new messages