callhome_diarization - Data preparation

1,015 views
Skip to first unread message

sandeep cb

unread,
May 1, 2018, 5:30:16 AM5/1/18
to kaldi-help
Hi ,

I am trying to achieve speaker diarization using the callhome diarization recipe.
I have a set of data containing conversational data.

How is the data preparation done for this example.
I couldn't find any proper documentation for callhome diarization.
I dont have access to the NIST SRE datasets.

Please help me in this regard.

Matthew Maciejewski

unread,
May 1, 2018, 3:06:13 PM5/1/18
to kaldi-help
Hi,

There are a few different things you'll need for training, diarizing, and scoring the output.

Training:

To train the system, you will need speech activity detection segmentation and speaker labels. The CallHome recipe uses SRE data that is not segmented, but is clean enough that we run a basic energy VAD to generate segmentation (line 71 of the run.sh script), which would not be needed if you have actual segmentation. We can also do this because we know that each recording has only one speaker in it.

In terms of the training data directory, you should only need the following files:
wav.scp
segments
utt2spk and spk2utt

It's also worth noting that the speaker labels are only used for the PLDA training. The ivector extractor is relatively data-hungry, so it can be beneficial to train it with extra data that isn't speaker labeled if you have it.

Diarizing:

If you want to just generate diarization output for some audio files, all you need is speech activity detection segmentation. In the CallHome recipe, we just use the ground truth segmentation, but you can also run a SAD system first. The "models" part of the Kaldi website has a SAD model you could download and use, though I personally have not tried that particular model.

Scoring:

If you want to score your diarization output (evaluate the performance), you'll need to have ground truth to score against. What this means is that you'll have to have the time marks and speaker labels for all the speech in your evaluation set. You will also have to create an rttm file that the NIST scoring tool uses. There is documentation available online on what the format for these files is. Also, if you want to cluster based on the oracle number of speakers as opposed to according to a threshold, you will need to generate a reco2num_spk file.


—Matt

sandeep cb

unread,
May 3, 2018, 9:08:11 AM5/3/18
to kaldi-help
Thanks Matthew for explaining it in such detail.
I was able to get the segments from the SAD model provided by kaldi(It was pretty good).

I am a bit confused here oh how to create the utt2spk and spk2utt file.
Since each utterance has more than one speaker.
Should I label each segments of the utterance to the speaker.

I also have an other dataset where each utterance has only one speaker.
Should I use this data and get the segments and train a PLDA.

Which scenario is the best in this regard as i would finally use the 
conversational utterances for diarization.

Matthew Maciejewski

unread,
May 7, 2018, 8:59:54 PM5/7/18
to kaldi-help
For the training data, you will need speaker segmentation, not just speech activity detection. Each segment should have only one speaker, and the utt2spk file should contain the speaker label for that segment.

For training the ivector extractor, each utterance has to only have one speaker.
For training the PLDA, each utterance has to have only one speaker and you need to know the speaker's identity.

Depending on what data you have, you may be able to train the ivector extractor with utterances containing multiple speakers, but from a theoretical perspective that is incorrect, and I would only consider doing that if there was not enough data to train a reasonable model otherwise.

sandeep cb

unread,
May 16, 2018, 5:57:00 AM5/16/18
to kaldi-help
I am sorry for ask you this.
But I am not able to sort the new utt2spk file.
I tried many things, nothing seems to work.

I tried something like these :
utt2spk(Changed speaker id) :
test18-0000000-0000081 test18-1
test18-0000830-0001061 test18-2

utt2spk(Changed utterence id) :
test18-1-0000000-0000081 test18-1
test18-1-0000830-0001061 test18-2

utt2spk(Older):
test18-0000000-0000081 test18
test18-0000830-0001061 test18

wav.scp:
test18 /path/file.wav

Matthew Maciejewski

unread,
May 16, 2018, 2:44:27 PM5/16/18
to kaldi-help
Can you clarify what you're trying to do and what you mean by "nothing seems to work"? Are you getting an error? The error should explain things to some extent.

Nothing about those new utt2spk files seems wrong. The speaker id should be a prefix of the utterance id, though, for sorting reasons, i.e. it should be something like:

test18-1-0000000-0000081 test18-1
test18-2-0000830-0001061 test18-2

And then of course, the rest of the data directory (for example the segments file) needs to match. You should probably read through the data directory part of this page. It will probably help. It is primarily for an ASR setup but it should be fairly obvious to tell what parts are not necessary.

—Matt

sandeep cb

unread,
May 17, 2018, 1:53:12 AM5/17/18
to kaldi-help
Hi Matthew,

I mean while fixing the data directory 
using utils/fix_data_dir.sh script. The speakers are filtered to zero.

utils/fix_data_dir.sh: file data/callhome1/utt2spk is not in sorted order or not unique, sorting it
utils/fix_data_dir.sh: file data/callhome1/segments is not in sorted order or not unique, sorting it
utils/fix_data_dir.sh: filtered data/callhome1/segments from 135 to 0 lines based on filter /tmp/kaldi.VxqW/recordings.
utils/fix_data_dir.sh: filtered data/callhome1/wav.scp from 10 to 0 lines based on filter /tmp/kaldi.VxqW/recordings.
fix_data_dir.sh: no utterances remained: not proceeding further

U see my wav.scp, it says test18 for utterence-id.
But the utt2spk starts with test18-1- 
So, I think that is the problem.

Thanks,
Sandeep

Matthew Maciejewski

unread,
May 17, 2018, 9:38:08 AM5/17/18
to kaldi-help
You should definitely carefully read the kaldi docs page section on data that I linked—it will help with those problems.

For example, wav.scp does not contain utterance IDs, but rather recording IDs. The map between utterances and recordings is within the segments file. So, you should have files with lines something like this:

wav.scp:
test18 /path/to/file/test18.wav

segments:
test18-1-0000000-0000081 test18 0.00 0.81

utt2spk:
test18-1-0000000-0000081 test18-1

They also have to be in sorted order, as well, and the sort order matters, which is why the utterance-ID needs to have the speaker ID as a prefix of the ID.

—Matt

sandeep cb

unread,
May 17, 2018, 10:04:05 AM5/17/18
to kaldi-help
Thank you so much. That worked. 
Reply all
Reply to author
Forward
0 new messages