Hello -- I'm working on a system to do online diarization using xvectors. I know that the problem is generally hard, but I have a few assumptions I can make to limit the scope.
First of all, I know that the majority of the audio is just 2 speakers, so I'm only looking to differentiate between those two. I also have analyzed thousands of past conversations to come up with some phrases that I am very confident will be said by one speaker but not the other. This means that if I run speech recognition in parallel, I can compile samples of speech from each of the 2 speakers. The diarization also doesn't have to come on right away, so I can wait to build up these examples.
I trained the xvector extractor using callhome-diarization v2 recipe up to stage 7. The only major difference (which I don't think matters much) is that I needed to use the Aspire SAD model instead of vad, so I had to tweak prepare_feats slightly. I also changed the egs so that they are all the same size, which I set to 30 frames.
The way I have it set up currently, I'm starting with the basic flow found in online2-tcp-nnet3-decode-faster.cc. I'm using a feature pipeline which is set up with the mfcc config from the xvector training, and also a cmvn-config with these lines:
--cmn-window=300
--norm-vars=false
I then periodically check for frames that are ready from the feature pipeline and create 30 frame chunks which I feed into the xvector extractor (I used the same architecture and I'm extracting from tdnn6.affine as suggested). I'm doing this at a period of every 10 frames. During the "enrollment" period, I may register these xvectors as belonging to silence, speaker 1 or speaker 2, as indicated by ASR.
I have also experimented with chunk sizes of 50 frames and 100 (with an appropriately trained nnet)
After I hit a threshold of example vectors in each of the 3 categories, I try to match each new xvector into one of these groups. I've been using cosine similarity to compare them. I've tried k-nearest neighbors, and also averaging all the example data in each category to create a "center" and then comparing the new vector to those.
The system seems decent at recognizing the silence, but it's somewhat spastic when someone is actually talking -- it detects transitions at the wrong times and misses real ones.
So, I'm wondering if anyone has any suggestions for improvement? I have a couple ideas, but I don't know if any of them have any merit:
- Maybe the cmn window is messing with speech because silence isn't actually being filtered out like in training? Could training without cmn work?
- Since I have the sets of examples, maybe I could do some sort of PCA on them like in extract_xvectors.sh? Or even just try subtracting the mean of the vectors with speech?
- Would it be possible to retrain the xvector extractor with the example data on the fly? This seems excessive..
I'm also curious if anyone else has any ideas / critiques / flaws to point out. Thanks!
Nathan