Online Diarization

Nathan Lindle

unread,

Aug 9, 2021, 1:04:04 PM8/9/21

to kaldi-help

Hello -- I'm working on a system to do online diarization using xvectors. I know that the problem is generally hard, but I have a few assumptions I can make to limit the scope.

First of all, I know that the majority of the audio is just 2 speakers, so I'm only looking to differentiate between those two. I also have analyzed thousands of past conversations to come up with some phrases that I am very confident will be said by one speaker but not the other. This means that if I run speech recognition in parallel, I can compile samples of speech from each of the 2 speakers. The diarization also doesn't have to come on right away, so I can wait to build up these examples.

I trained the xvector extractor using callhome-diarization v2 recipe up to stage 7. The only major difference (which I don't think matters much) is that I needed to use the Aspire SAD model instead of vad, so I had to tweak prepare_feats slightly. I also changed the egs so that they are all the same size, which I set to 30 frames.

The way I have it set up currently, I'm starting with the basic flow found in online2-tcp-nnet3-decode-faster.cc. I'm using a feature pipeline which is set up with the mfcc config from the xvector training, and also a cmvn-config with these lines:

--cmn-window=300

--norm-vars=false

I then periodically check for frames that are ready from the feature pipeline and create 30 frame chunks which I feed into the xvector extractor (I used the same architecture and I'm extracting from tdnn6.affine as suggested). I'm doing this at a period of every 10 frames. During the "enrollment" period, I may register these xvectors as belonging to silence, speaker 1 or speaker 2, as indicated by ASR.

I have also experimented with chunk sizes of 50 frames and 100 (with an appropriately trained nnet)

After I hit a threshold of example vectors in each of the 3 categories, I try to match each new xvector into one of these groups. I've been using cosine similarity to compare them. I've tried k-nearest neighbors, and also averaging all the example data in each category to create a "center" and then comparing the new vector to those.

The system seems decent at recognizing the silence, but it's somewhat spastic when someone is actually talking -- it detects transitions at the wrong times and misses real ones.

So, I'm wondering if anyone has any suggestions for improvement? I have a couple ideas, but I don't know if any of them have any merit:

Maybe the cmn window is messing with speech because silence isn't actually being filtered out like in training? Could training without cmn work?
Since I have the sets of examples, maybe I could do some sort of PCA on them like in extract_xvectors.sh? Or even just try subtracting the mean of the vectors with speech?
Would it be possible to retrain the xvector extractor with the example data on the fly? This seems excessive..

I'm also curious if anyone else has any ideas / critiques / flaws to point out. Thanks!

Nathan

nshm...@gmail.com

unread,

Aug 9, 2021, 7:10:18 PM8/9/21

to kaldi-help

> I try to match each new x-vector into one of these groups.

UIS-RNN reports that k-means gives 17% DER vs 10% DER for UIS-RNN.

> Or even just try subtracting the mean of the vectors with speech?

Proper vector normalization is important and usually a source of bad accuracy

> I also changed the egs so that they are all the same size, which I set to 30 frames.

Did you estimate the quality of your x-vector after that? What is EER of your x-vector? Did you check Google UIS-RNN and LSTM work?

FULLY SUPERVISED SPEAKER DIARIZATION
https://github.com/google/uis-rnn
https://arxiv.org/pdf/1810.04719.pdf

LINKS: A HIGH-DIMENSIONAL ONLINE CLUSTERING METHOD
https://arxiv.org/pdf/1801.10123.pdf

From the corresponding paper:

Another minor but important trick is that, the speaker recognizer model used in [3] and [6] are trained on windows of size 1600ms, which causes performance degradation when we run inference on smaller windows. For example, in the diarization system, the window size is only 240ms. Thus we have retrained a new model “dvector V3” by using variable-length windows, where the window size is drawn from a uniform distribution within [240ms, 1600ms] during training

EER for their d-vectors also listed for comparison.

You can also check recent Nemo implementation, not fully complete though, but it implements a great idea of using word segmentation from ASR for speaker change points:

https://github.com/NVIDIA/NeMo/compare/online_diarization#diff-d1c2565f44828be0e173d20c95503f6b702cb4269af314f8bf356aa04632d27f

Nathan Lindle

unread,

Aug 10, 2021, 5:15:44 PM8/10/21

to kaldi-help

Thank you very much for the thoughtful reply and the links. I had been hoping to not stray too far from the Kaldi framework for simplicity's sake, but it was good to read more about uis-rnn and consider their approach. It does seem like improving the quality of their embeddings had some of the biggest impact.

> Did you estimate the quality of your x-vector after that?

Yes, the validation data on the xvector trainer indicates that there is a very significant performance drop going from 1s - 0.5s - 0.3s. The 100 frame vectors perform significantly better, but they of course miss shorter responses.

Regarding the window size mismatch -- yes I think the original callhome diarization recipe had a minimum size of 2 seconds, which is why I wanted to retrain them with a smaller window. Perhaps they would still benefit from a variable size though.

> it implements a great idea of using word segmentation from ASR for speaker change points

I think utilizing more information from ASR is another very interesting idea. Perhaps I could extract x-vectors for individual words and limit speaker transitions to those word boundaries as you are suggesting. Obviously the variable window size training would be essential for that.

Thanks again!

Reply all

Reply to author

Forward