Kaldi automatic segmentation recipe

207 views

Skip to first unread message

Vinit Shah

unread,

Feb 8, 2018, 12:42:59 AM2/8/18

to kaldi-help

Dear Kaldi Community,

I have been using Kaldi to perform two way classification (such as Yes/No detection) and trying to optimize the system using variations of librispeech recipe. I wanted to understand the segmentation and time alignment process of Kaldi in detail (especially during decoding). I ran two sets of experiments by using (1) actual transcriptions (sequential information preserved), (2) dummy transcriptions (only one fake label for an entire file).

Using an identical acoustic model, I ran decoding on both real and dummy testset transcriptions.

I present the results as sensitivity and specificity here which are derived based on NIST scoring metric. (for one of the words)

For HMM-monophone system:-

(a) with real transcriptions:

- Sensitivity: 55.28 %

- Specificity: 50.86 %

(b) with dummy transcriptions:

- Sensitivity: 41.86 %

- Specificity: 50.30 %

-> Clearly observable difference but not as bad as latter systems.

For LDA-MLLT (monophone) system:-

- Sensitivity: 91.48 %

- Specificity: 55.53 %

(d) with dummy transcriptions:

- Sensitivity: 33.92 %

- Specificity: 60.42 %

For DNN system:-

(e) with real transcriptions:

- Sensitivity: 79.92 %

- Specificity: 74.17 %

(f) with dummy transcriptions:

- Sensitivity: 4.83 %

- Specificity: 73.72 %

Currently, I am running decoding using RBM trained model on the same data too.

I intended to understand Kaldi's automatic segmentation capabilities and it looks like systems other then HMM-monophone, does very poorly on automatically segmenting the test data. Monophone system's performance is poor as well.

It looks like the system collects sequences related to utterance from a particular time interval (i.e. from 20th second to 21st second), decodes it and maps the hypothesis back to the original time interval (i.e. from 20th second to 21st second). (But how a system would know that a segment starts at 20th second and ends at 21st... -> real world situation)

If this is true, are there any ways of decoding to efficiently do automatic segmentation of input signals using Kaldi.

Thanks,

Vii

Daniel Povey

unread,

Feb 8, 2018, 12:46:22 AM2/8/18

to kaldi-help

Kaldi is a toolkit for speech recognition and other similar sequence tasks.

You should probably understand the basics of speech recognition to start with: try reading this

https://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf

I don't think running different experiments like that is going to give you very much insight into how these things work. And I certainly don't have time to interpret their results for you.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/995195b4-6a30-45fb-aefb-42e5bbb81115%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages