Ivector Diarization

Raihan Hossain

unread,

Aug 5, 2019, 12:58:05 PM8/5/19

to kaldi-help

Hi,

I have some queries regarding the paper CHARACTERIZING PERFORMANCE OF SPEAKER DIARIZATION SYSTEMS ON FAR-FIELD SPEECH USING STANDARD METHODS:

1. What is actually meant by segment based PLDA training (fig 1 of the paper)?

2. It is found that external ivector extractor performs better than the original extractor, in that case can I use the pretrained VoxCeleb Ivector System or SITW Ivector System and get similar results? If yes then, should I use the UBM provided in the pretrained model or train the UBM on my own?

3. Figure 3 shows the results of training the PLDA using the long-term ivectors, in this case did you extract I vectors for 180/120sec and then used these ivectors for training the PLDA model? What are the script level changes that I can do to get longer segment ivectors?

Best Regards,

Raihan

Matthew Maciejewski

unread,

Aug 6, 2019, 2:21:21 PM8/6/19

to kaldi-help

Hi Raihan,

Regarding your first question, the segment-based PLDA training just means that each segment (natural utterance) is considered a unique speaker. As in, if someone speaks a couple sentences, then pauses for a few seconds, then says another sentence, we would consider the 2-3 ivectors from the first sentences to be one speaker, then the 1-2 from the other sentence to be a different speaker.

For your 2nd question, you would want to use the pre-trained model. That being said, unless you have a particular reason to want to use ivectors, I'd strongly recommend using xvectors, which are neural network-bsed embeddings in contrast to the statistical ivectors. They more-or-less can be used interchangeably, but the performance of xvectors greatly exceeds that of ivectors, and we have pre-trained models available on the Kaldi website. I personally would not consider using anything besides a pre-trained xvector model for any speaker ID or diarization task.

For question 3, I'm not exactly sure how to replicate that functionality. I believe you need to change the code to concatenate feature vectors. I think the xvector extractor code actually does something along those lines automatically. Of course, if you download an xvector model it will include a good PLDA already as well.

I think ultimately, as far as I know, for any speaker ID type application, basically the best thing to do is to give your model as much data as possible, which is why using a pre-trained model can be desirable (unless you have access to a tremendous amount of data). You can usually tweak some minor extra performance out of a PLDA model with some kind of in-domain adaptation, but that will not likely compare to just throwing more data at it.

—Matt

Kaldi_new_uSER

unread,

Sep 23, 2019, 11:42:01 PM9/23/19

to kaldi-help

On Tuesday, August 6, 2019 at 11:51:21 PM UTC+5:30, Matthew Maciejewski wrote:

Hi Raihan,

Regarding your first question, the segment-based PLDA training just means that each segment (natural utterance) is considered a unique speaker. As in, if someone speaks a couple sentences, then pauses for a few seconds, then says another sentence, we would consider the 2-3 ivectors from the first sentences to be one speaker, then the 1-2 from the other sentence to be a different speaker.

In order to simulate this - I modified utt2spk (and spk2utt) files so that each utterance has a unique speaker. But it gave following error in PLDA training

vector-compute-plda ark:exp/sdm1/ivectors_train_oraclespk/spk2utt 'ark:ivector-subtract-global-mean       scp:exp/sdm1/ivectors_train_oraclespk/ivector.scp ark:-       | transform-vec exp/sdm1/ivectors_dev/transform.mat ark:- ark:-       | ivector-normalize-length ark:- ark:- |' exp/sdm1/ivectors_dev/plda
ivector-subtract-global-mean scp:exp/sdm1/ivectors_train_oraclespk/ivector.scp ark:-
transform-vec exp/sdm1/ivectors_dev/transform.mat ark:- ark:-
ivector-normalize-length ark:- ark:-
LOG (ivector-subtract-global-mean[5.5.37~1-107e]:main():ivector-subtract-global-mean.cc:76) Read 58331 iVectors.
LOG (ivector-subtract-global-mean[5.5.37~1-107e]:main():ivector-subtract-global-mean.cc:79) Norm of iVector mean was 0.72456
LOG (ivector-subtract-global-mean[5.5.37~1-107e]:main():ivector-subtract-global-mean.cc:108) Wrote 58331 mean-subtracted iVectors
LOG (transform-vec[5.5.37~1-107e]:main():transform-vec.cc:85) Applied transform to 58331 vectors.
LOG (ivector-normalize-length[5.5.37~1-107e]:main():ivector-normalize-length.cc:90) Processed 58331 iVectors.
LOG (ivector-normalize-length[5.5.37~1-107e]:main():ivector-normalize-length.cc:94) Average ratio of iVector to expected length was 1.25182, standard deviation was 0.152301
LOG (ivector-compute-plda[5.5.37~1-107e]:main():ivector-compute-plda.cc:109) Accumulated stats from 58331 speakers (0 with no utterances), consisting of 58331 utterances (0 absent from input).

ERROR (ivector-compute-plda[5.5.37~1-107e]:main():ivector-compute-plda.cc:117) No speakers with multiple utterances, unable to estimate PLDA.

How did you circumvent this issue?

Matthew Maciejewski

unread,

Sep 23, 2019, 11:51:57 PM9/23/19

to kaldi-help

The error says that it is unable to estimate the PLDA because there are no speakers with multiple utterances. PLDA models are computed by jointly minimizing within-class covariance while maximizing between-class covariance. If there is only one sample per class, there is no way to compute a class covariance at all. You need to make sure there is at least one speaker with more than one utterance, preferably more. There is no way to avoid this constraint; it is inherent to how PLDA works.

Kaldi_new_uSER

unread,

Sep 24, 2019, 12:43:01 AM9/24/19

to kaldi-help

Thanks for reply. Do you mean that for segment based PLDA training (as per the paper) the spk2utt file should look like this

exp/sdm1/ivectors_train_oraclespk/spk2utt (539 lines)

AMI_EN2001a_SDM_FEO065 AMI_EN2001a_SDM_FEO065_0021133_0021442-00000000-00000300 AMI_EN2001a_SDM_FEO065_0021442_0022058-00000000-00000300 AMI_EN2001a_SDM_FEO065_0081159_0081631-00000000-00000300 AMI_EN2001a_SDM_FEO065_0089104_0089875-00000000-00000300 AMI_EN2001a_SDM_FEO065_0090130_0090775-00000000-00000300 AMI_EN2001a_SDM_FEO065_0090775_0091688-00000000-00000300 AMI_EN2001a_SDM_FEO065_0162071_0162292-00000000-00000221 ...

Or should there be more than 539 speakers?

Matthew Maciejewski

unread,

Sep 24, 2019, 11:01:02 PM9/24/19

to kaldi-help

It's been a long time since I've looked at the code for this, but that's not correct. I believe your speaker labels should be, things along the line of AMI_EN2001a_SDM_FEO065_0021133_0021442 and then you need to make sure there's segments with more than one subsegment (i.e. in addition to 00000000-00000300 there's an additional subsegment like 000000425-00000725).

From an implementation point of view, it's easiest to generate the utt2spk file, then use the utt2spk_to_spk2utt.pl script to generate the spk2utt file. Basically you would take your list of utterances, copy them, and cut off the last two fields, i.e. something like:

AMI_EN2001a_SDM_FEO065_0021133_0021442-00000000-00000300 AMI_EN2001a_SDM_FEO065_0021133_0021442

AMI_EN2001a_SDM_FEO065_0021442_0022058-00000000-00000300 AMI_EN2001a_SDM_FEO065_0021442_0022058

AMI_EN2001a_SDM_FEO065_0090130_0090775-00000000-00000300 AMI_EN2001a_SDM_FEO065_0090130_0090775

...

then you just run utt2spk_to_spk2utt to generate the spk2utt file and also verify that there are some "speakers" that have more than one utterance.

On Tuesday, September 24, 2019 at 12:43:01 AM UTC-4, Kaldi_new_uSER wrote:

Thanks for reply. Do you mean that for segment based PLDA training (as per the paper) the spk2utt file should look like this

exp/sdm1/ivectors_train_oraclespk/spk2utt (539 lines)

AMI_EN2001a_SDM_FEO065 AMI_EN2001a_SDM_FEO065_0021133_0021442-00000000-00000300

AMI_EN2001a_SDM_FEO065_0081159_0081631-00000000-00000300 AMI_EN2001a_SDM_FEO065_0089104_0089875-00000000-00000300 AMI_EN2001a_SDM_FEO065_0090130_0090775-00000000-00000300 AMI_EN2001a_SDM_FEO065_0090775_0091688-00000000-00000300 AMI_EN2001a_SDM_FEO065_0162071_0162292-00000000-00000221 ...

Reply all

Reply to author

Forward