Dear Kaldi Community,
I have been using Kaldi to perform two way classification (such as Yes/No detection) and trying to optimize the system using variations of librispeech recipe. I wanted to understand the segmentation and time alignment process of Kaldi in detail (especially during decoding). I ran two sets of experiments by using (1) actual transcriptions (sequential information preserved), (2) dummy transcriptions (only one fake label for an entire file).
Using an identical acoustic model, I ran decoding on both real and dummy testset transcriptions.
I present the results as sensitivity and specificity here which are derived based on NIST scoring metric. (for one of the words)
For HMM-monophone system:-
(a) with real transcriptions:
- Sensitivity: 55.28 %
- Specificity: 50.86 %
(b) with dummy transcriptions:
- Sensitivity: 41.86 %
- Specificity: 50.30 %
-> Clearly observable difference but not as bad as latter systems.
For LDA-MLLT (monophone) system:-
(c) with real transcriptions:
- Sensitivity: 91.48 %
- Specificity: 55.53 %
(d) with dummy transcriptions:
- Sensitivity: 33.92 %
- Specificity: 60.42 %
For DNN system:-
(e) with real transcriptions:
- Sensitivity: 79.92 %
- Specificity: 74.17 %
(f) with dummy transcriptions:
- Sensitivity: 4.83 %
- Specificity: 73.72 %
Currently, I am running decoding using RBM trained model on the same data too.
I intended to understand Kaldi's automatic segmentation capabilities and it looks like systems other then HMM-monophone, does very poorly on automatically segmenting the test data. Monophone system's performance is poor as well.
It looks like the system collects sequences related to utterance from a particular time interval (i.e. from 20th second to 21st second), decodes it and maps the hypothesis back to the original time interval (i.e. from 20th second to 21st second). (But how a system would know that a segment starts at 20th second and ends at 21st... -> real world situation)
If this is true, are there any ways of decoding to efficiently do automatic segmentation of input signals using Kaldi.
Thanks,
Vii