Speaker Diarization

Yi Yang

unread,

Oct 14, 2019, 6:17:17 AM10/14/19

to kaldi-help

Hi All,

I am trying on the callhome_diarization example for speaker diarization.

And referring to this link (https://github.com/kaldi-asr/kaldi/issues/2523#issuecomment-408935477), from the explanation below:

Diarized Speech
If everything went well, you should have a file called rttm in the directory $nnet_dir/xvectors_$name/plda_scores_threshold_${threshold}/. The 2nd column is the recording ID, the 3rd column is the start-time of a segment, and the 4th is the time offset. The 8th column is the speaker label assigned to that segment.
SPEAKER mfcny 0 86.200 16.400 <NA> <NA> 1 <NA> <NA>
SPEAKER mfcny 0 103.050 5.830 <NA> <NA> 1 <NA> <NA>
SPEAKER mfcny 0 109.230 4.270 <NA> <NA> 1 <NA> <NA>
SPEAKER mfcny 0 113.760 8.625 <NA> <NA> 1 <NA> <NA>
SPEAKER mfcny 0 122.385 4.525 <NA> <NA> 2 <NA> <NA>
SPEAKER mfcny 0 127.230 6.230 <NA> <NA> 2 <NA> <NA>
SPEAKER mfcny 0 133.820 0.850 <NA> <NA> 2 <NA> <NA>

The example explain that the 8th column of the rttm file is the speaker label/id of the segment, from which part of the example I can find out from where the Speaker ID "1" and "2" come from?

Regards,

YiYang

David Snyder

unread,

Oct 14, 2019, 10:31:30 AM10/14/19

to kaldi-help

The speaker IDs come from the clustering stage of the diarization pipeline. The labels correspond to arbitrary cluster labels, e.g., 1, 2, 3, etc. This information appears in a file called "labels" which appears in the same directory as the rttm file.

You can see how the labels and rttm files are created in stages 0 and 1 of the cluster.sh script: https://github.com/kaldi-asr/kaldi/blob/master/egs/callhome_diarization/v1/diarization/cluster.sh#L91.

Yi Yang

unread,

Oct 15, 2019, 3:59:50 AM10/15/19

to kaldi-help

Hi David,

Below is the sample of data I prepared:

wav.scp
VL180810112737108 ./wavFile/VL180810112737108.wav

segments
spk001-VL180810112737108-001 VL180810112737108 0.00 0.76
spk001-VL180810112737108-002 VL180810112737108 1.03 1.92
spk002-VL180810112737108-001 VL180810112737108 1.92 2.86
spk002-VL180810112737108-002 VL180810112737108 7.11 10.47


utt2spk
spk001-VL180810112737108-001 spk001
spk001-VL180810112737108-002 spk001
spk002-VL180810112737108-001 spk002
spk002-VL180810112737108-002 spk002

spk2utt
spk001 spk001-VL180810112737108-001 spk001-VL180810112737108-002
spk002 spk002-VL180810112737108-001 spk002-VL180810112737108-002

I have use the data to run the speaker diarization example following this (https://github.com/kaldi-asr/kaldi/issues/2523#issuecomment-408935477) with the provided pretrained model,

From the "labels" file generated, it will show different labes for spk001, like number 29 or 8

My confusion is how that number show in the "labels" or "rttm" files can refer that segment/utterance is speak by spk001 or spk002?

Regards,

YiYang

David Snyder

unread,

Oct 15, 2019, 3:23:43 PM10/15/19

to kaldi-help

For speaker recognition, the speaker IDs should actually be the recording IDs.

The speakers identified by the clustering stage have no connection with the speaker IDs in the spk2utt file. Again, the speaker IDs should really be recording IDs.

Yi Yang

unread,

Oct 17, 2019, 5:37:06 AM10/17/19

to kaldi-help

Hi David,

I prepare the data with a recording that have 24 segments speak by 2 different speaker and the speaker ID is the recording ID.

After I run the diarization/cluster.sh:

If I include the reco2num-spk file where I set the number of speaker as 2, the rttm file that I get is as below:

SPEAKER VL180810112737108 0   0.000   0.760 <NA> <NA> 1 <NA> <NA>
SPEAKER VL180810112737108 0   1.030   1.830 <NA> <NA> 1 <NA> <NA>
SPEAKER VL180810112737108 0   7.110   3.360 <NA> <NA> 1 <NA> <NA>
SPEAKER VL180810112737108 0  11.930   1.330 <NA> <NA> 1 <NA> <NA>
SPEAKER VL180810112737108 0  13.350   1.740 <NA> <NA> 1 <NA> <NA>
SPEAKER VL180810112737108 0  20.050   4.570 <NA> <NA> 1 <NA> <NA>
SPEAKER VL180810112737108 0  24.640   0.320 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0  25.730   1.020 <NA> <NA> 1 <NA> <NA>
SPEAKER VL180810112737108 0  27.950   4.210 <NA> <NA> 1 <NA> <NA>
SPEAKER VL180810112737108 0  32.190   0.260 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0  33.900 -13.895 <NA> <NA> 1 <NA> <NA>
SPEAKER VL180810112737108 0  20.005 -14.605 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0   5.500   0.560 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0   6.360   0.640 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0  11.300   0.300 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0  11.620   0.270 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0  16.290   0.230 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0  16.680   1.090 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0  18.180   0.250 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0  18.530   0.630 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0  19.260   0.710 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0  33.430   0.350 <NA> <NA> 2 <NA> <NA>

From the outcome, is it the number 1 and 2 in the 8th column refer to which speaker speak that segment?
The outcome just label the speaker in this recordings as number 1 and 2?

Then if without the reco2num-spk file, the rttm file outcome that I get with the same data is as below:

SPEAKER VL180810112737108 0   0.000   0.760 <NA> <NA> 8 <NA> <NA>
SPEAKER VL180810112737108 0   1.030   1.830 <NA> <NA> 8 <NA> <NA>
SPEAKER VL180810112737108 0   7.110   3.360 <NA> <NA> 8 <NA> <NA>
SPEAKER VL180810112737108 0  11.930   1.330 <NA> <NA> 8 <NA> <NA>
SPEAKER VL180810112737108 0  13.350   1.740 <NA> <NA> 8 <NA> <NA>
SPEAKER VL180810112737108 0  20.050   0.465 <NA> <NA> 1 <NA> <NA>
SPEAKER VL180810112737108 0  20.515   4.105 <NA> <NA> 8 <NA> <NA>
SPEAKER VL180810112737108 0  24.640   0.320 <NA> <NA> 7 <NA> <NA>
SPEAKER VL180810112737108 0  25.730   1.020 <NA> <NA> 8 <NA> <NA>
SPEAKER VL180810112737108 0  27.950   4.210 <NA> <NA> 8 <NA> <NA>
SPEAKER VL180810112737108 0  32.190   0.260 <NA> <NA> 7 <NA> <NA>
SPEAKER VL180810112737108 0  33.900 -13.895 <NA> <NA> 8 <NA> <NA>
SPEAKER VL180810112737108 0  20.005 -14.605 <NA> <NA> 9 <NA> <NA>
SPEAKER VL180810112737108 0   5.500   0.560 <NA> <NA> 2 <NA> <NA>
SPEAKER VL180810112737108 0   6.360   0.640 <NA> <NA> 9 <NA> <NA>
SPEAKER VL180810112737108 0  11.300   0.300 <NA> <NA> 9 <NA> <NA>
SPEAKER VL180810112737108 0  11.620   0.270 <NA> <NA> 3 <NA> <NA>
SPEAKER VL180810112737108 0  16.290   0.230 <NA> <NA> 4 <NA> <NA>
SPEAKER VL180810112737108 0  16.680   1.090 <NA> <NA> 9 <NA> <NA>
SPEAKER VL180810112737108 0  18.180   0.250 <NA> <NA> 5 <NA> <NA>
SPEAKER VL180810112737108 0  18.530   0.630 <NA> <NA> 9 <NA> <NA>
SPEAKER VL180810112737108 0  19.260   0.710 <NA> <NA> 9 <NA> <NA>
SPEAKER VL180810112737108 0  33.430   0.350 <NA> <NA> 6 <NA> <NA>

The highest number I get from this outcome is "9", so it mean there have 9 different speakers in the recording ?

Regards,

YiYang

David Snyder

unread,

Oct 17, 2019, 10:39:05 AM10/17/19

to kaldi-help

Hi Yi Yang,

From the outcome, is it the number 1 and 2 in the 8th column refer to which speaker speak that segment?
The outcome just label the speaker in this recordings as number 1 and 2?

Yes. The output in column 8 is just a speaker label, given an arbitrary numerical ID to distinguish between other speakers in the recording.

The highest number I get from this outcome is "9", so it mean there have 9 different speakers in the recording ?

It's overestimating the number of speakers because the stopping criteria is not tuned well.

To better understand what's going on, you might want to read the Clustering Speakers section here: https://github.com/kaldi-asr/kaldi/issues/2523#issuecomment-408935477 .

Best,

David

Yi Yang

unread,

Oct 21, 2019, 5:35:15 AM10/21/19

to kaldi-help

Hi David,

I am currently working on a voice biometrics project and I am looking for the most accurate text-independent (or text-dependent) speaker verification method/script using Kaldi to do it?

I am very much appreciative of your wonderful work in Kaldi and "speaker diarization" project and really looking forward for your kind advice on the above mentioned.

Thank you.

YiYang

David Snyder

unread,

Oct 21, 2019, 3:53:08 PM10/21/19

to kaldi-help

Both speaker recognition and diarization use similar technology based on DNN embeddings that capture speaker characteristics.

For speaker recognition, look at the egs/sre16/v2 or egs/sitw/v2 recipes for narrowband and wideband recipes respectively. There are also pretrained models available at http://kaldi-asr.org/models.html.

But in general, good performance depends more on your training data than anything. If you want an accurate system, you need lots of in domain training data.

Reply all

Reply to author

Forward