--Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
So i guess the ordering of features/archiectures that yield better results could be:i-vector -> no i-vector method -> x-vector -> fusion
- if i understand the results of x-vector paper correctly, the extractor can be still applied to out of domain (eg. non english data) without any caliberation? (thanks for the pretrained model!)
- is it right to say that a diarization method based on such embedding vectors would achieve the current practical state of art results?
is there a plan to use them instead of i-vectors in current nnet3 based acoustic models?
- how does this method deal with music interleaved with speech (eg. music on hold in customer service calls)?will the clustering algorithm applied on i/x vectors (eg. hierarchical clustering) do a good job in clustering music frames?or will it be a better idea to first remove music with a recipe like bn_music_speach?
- would it make more sense to adapt sre_combined (sre16 recipe) plda to unlabelled in-domain data directly, or train plda on generic data first and then do the in-domain adaptation?
- would there be situations to consider using sad instead of vad?
But these segments are carried all the way to plda scoring and clustering "as-is" and the final labels are assigned to the whole segments which contain more than one speaker.
- the final stage of clustering, diarization/make_rttm.py , fails because of RuntimeError: Missing label for segment xxx (around 100 segments missing in labels files). What could be the possible cause?
- the final stage of clustering, diarization/make_rttm.py , fails because of RuntimeError: Missing label for segment xxx (around 100 segments missing in labels files). What could be the possible cause?
I'm not sure. I'll see if Matthew Maciejewski, who prepared that script, can answer this one.
Chances are it's the last one. Some of the extraction scripts will not produce output if the segment is too short (which is something that can come up in diarization). David can probably help you if it's that one.
About CMN and your suggestion to precalculate features before clustering, it's not something i'm familiar with; could you pls guide me in the direction of which scripts to use?
Got your point about sad, but that leaves me with one practical question when the full diarization + transcription pipeline is used in production: there will be three nnet3 decoding stages involved : sad, xvector extraction, online decoding (actual transcription)- now I haven't looked in the nnet config details of sad and xvectors but could it mean ~3x decoding time without diarization?
And since you mentioned VoxCeleb, is the volume and diversity of data as much as the set you personally used for training xvectors (srex, swbd, ..)?
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/ecb5a264-97c3-474a-ae79-ea639de0069c%40googlegroups.com.
seg rec 13.54 13.83
seg rec 699.94 700.71
I just finished a few experiments based on your v2 recipe; it works really well on both studio quality and cts, if i train plda with in-domain dutch data (100 hours, 2k speakers). I also tried plda adaptation/whitening to ~5 hours of test-domain-like data (similar to sre16 recipe), i can't call it an improvement yet..
is the minimum segment length in final rttm file the same as window overlap of 0.75, would you suggest increasing it if i'm interested in larger min seg durations?
current mfcc.conf is based on 8k sample rate and pretrained xvector nnet weights are sensetive to changing that i guess. i use 16k in my environment (for decoding eg) and rather keep it like that to avoid having multiple wav.scp files, or reuse mfcc features for decoding. How do you suggest to go about it?
- In my current asr pipeline i diarize first (using Lium) and use steps/online/nnet3/decode.sh to decode resulting segment. Is it right to say that the final wer is dependent on the quality of diarized segments, mainly because pure uni-speaker segments are better represented with language model?
- the nnet decoding time to extract xvectors is notable and i'm thinking how i can fit it performance-wise in my end to end asr pipeline.
- is there any performance to be gained by using gpu for extracting xvectors?
- how would you compare the decode-time of ivectors and xvectors, and is there a tradeoff between extraction time and down stream error rates?
- is there a pretrained ivector extractor that i can use to test and benchmark?
- is the minimum segment length in final rttm file the same as window overlap of 0.75, would you suggest increasing it if i'm interested in larger min seg durations?
- In my current asr pipeline i diarize first (using Lium) and use steps/online/nnet3/decode.sh to decode resulting segment. Is it right to say that the final wer is dependent on the quality of diarized segments, mainly because pure uni-speaker segments are better represented with language model?
- the nnet decoding time to extract xvectors is notable and i'm thinking how i can fit it performance-wise in my end to end asr pipeline.
--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/370fc2ee-eaa7-44f2-af8e-d14b7e5eb39c%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/ROtSHHe3Z_I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/08f28feb-0038-459e-a119-57ca32686254%40googlegroups.com.
Interesting. The fact that there is no improvement means you have not too many segments of different length.Then yes, this option will be useless.By the way, if speed is really crucial for you, you may check multi-threaded implementation of the extractor https://github.com/kaldi-asr/kaldi/pull/2303It will not be added to master branch, but you can check and let me know if it works for you
2018-04-01 18:54 GMT+03:00 Armin Oliya <armin...@gmail.com>:
Thank you all for your helpful comments.@david, indeed would be great if other models for ivector/wide band are available. if you need help with putting things together or testing pls let me know.The use-gpu option didn't yield runtime improvement and it actually seemed there's not much load on gpus while xvector decoding.btw, i think your current xvector branch is stable and ready for PR.@Matt, thanks for your clear explanation, got it. I did a few tests with diarization <> wer and found that accurate diarization (compared to my baseline lium) can help with up to 1% wer, and even slightly better if i do speech activity detection before diarization. On the other hand, bad diarization can have a strong effect; for example, on one of my test sets i had only one speaker per recording and when i forced diarization to have two speakers, wer changed from 5% to 9%.@Arseniy, i tried your first suggestion on your PR, with cache size set to 1e4 and 1e6, but got no difference in xvector decode wall-time. i tested with two different test sets of 30 mins (90 files) and 1h 30mins (16 files).
On Sunday, March 18, 2018 at 6:58:50 AM UTC+1, Arseniy Gorin wrote:It is simple, but you are right - it is more for production version that would likely find no application in kaldi scripts.I'll add only cache thenThanks
--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/ROtSHHe3Z_I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.