Speech tools in Kaldi

Rémi Francis

unread,

Mar 31, 2016, 10:15:42 AM3/31/16

to kaldi-help

Hi everyone,

I'd like to be able to do these speech tasks with Kaldi :

- Voice activity detection.

- Language/dialect recognition.

- Diarization.

Are there recommended recipes to follow for these tasks? I can't find this in the doc.

Best regards.

Jan Trmal

unread,

Mar 31, 2016, 10:34:37 AM3/31/16

to kaldi-help

Yes, these are things we are working on, but not exactly ready or available.

Dan will perhaps able to give some time scale.

y.

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ilya Platonov

unread,

Mar 31, 2016, 2:03:41 PM3/31/16

to kaldi-help

Here is a good and simple VAD implementation https://github.com/sstepashka/VAD_C

Vijayaditya Peddinti

unread,

Mar 31, 2016, 2:06:44 PM3/31/16

to kaldi-help, Vimal manohar

Vimal has been working on VAD recipes for Kaldi. See the PR

https://github.com/kaldi-asr/kaldi/pull/353

--

Daniel Povey

unread,

Mar 31, 2016, 2:11:35 PM3/31/16

to kaldi-help

About voice activity detection- we don't have anything checked in exactly. Vimal is working on this but I'm not sure of an ETA. What we did in ASPIRE is to do two passes of nnet2-online decoding-- the first pass just decoded everything, and we got iVectors only on the high-confidencd speech, and then decoded everything a second time. Obviously this is slow. I'll try to speed up the process of checking something in, maybe in a couple of months we'll have something but I'm not sure.

About language id: we do already have scripts, see egs/lre07.

About diarization: that's something we have fairly immediate plans to do but it's a big subject, so we may not have something for a year or so.

Dan

Qingsong Liu

unread,

Mar 31, 2016, 3:11:06 PM3/31/16

to kaldi...@googlegroups.com

btw, is there a plan to do some speech synthesis tasks with kaldi, something like using lstm as a mapping/regression model (linguistic features -> vocoder parameters).

Qingsong

---------------------------------------------
Qingsong Liu
liuqs...@gmail.com
Univ. of Sci.& Tech. of China
----------------------------------------------

Jan Trmal

unread,

Mar 31, 2016, 3:20:21 PM3/31/16

to kaldi-help, Matthew Aylett

There is the Matthew Aylett's "idlak" project. Not sure about the state.

y.

Matthew Aylett

unread,

Mar 31, 2016, 5:12:16 PM3/31/16

to Jan Trmal, kaldi-help

Hi

Blaise Potard has been working on an end to end DNN speech synthesis system based in Idlak that we have christened Tangle. The output quality is pretty good (better than HTS demo). The DNN setup is pretty simple but will act as a nice baseline for further work. We have just finished a paper on it for Interspeech.

We have a lot of tidying up in git etc to do for the release but it is coming along and is planned to be in place over the Summer.

v best

Matthew

Qingsong Liu

unread,

Mar 31, 2016, 9:51:59 PM3/31/16

to kaldi...@googlegroups.com, Jan Trmal

That's great, thanks.

Rémi Francis

unread,

Apr 5, 2016, 9:26:29 AM4/5/16

to kaldi-help

Thanks for the inputs everyone.

On Thursday, 31 March 2016 19:03:41 UTC+1, Ilya Platonov wrote:

Here is a good and simple VAD implementation https://github.com/sstepashka/VAD_C

Thanks! Do you use it? It seems fairly basic, I wonder how performant it is.

On Thursday, 31 March 2016 19:06:44 UTC+1, Vijayaditya Peddinti wrote:

Vimal has been working on VAD recipes for Kaldi. See the PR
https://github.com/kaldi-asr/kaldi/pull/353

Yeah it's quite a lot of commits though. I had seen it a while ago and I thought I'd wait for it to be merged properly, instead of spending too much time figuring out what to use, but it doesn't seem to be moving in this direction very fast.

On Thursday, 31 March 2016 19:11:35 UTC+1, Dan Povey wrote:

About voice activity detection- we don't have anything checked in exactly. Vimal is working on this but I'm not sure of an ETA. What we did in ASPIRE is to do two passes of nnet2-online decoding-- the first pass just decoded everything, and we got iVectors only on the high-confidencd speech, and then decoded everything a second time. Obviously this is slow. I'll try to speed up the process of checking something in, maybe in a couple of months we'll have something but I'm not sure.

Yeah, the point of the VAD would be not too have to transcribe any unnecessary audio, so it should preferably be considerably faster than the decode.

About language id: we do already have scripts, see egs/lre07.

Thanks, I'll have a look. There's is no RESULTS.txt file, is there any benchmark anywhere?

About diarization: that's something we have fairly immediate plans to do but it's a big subject, so we may not have something for a year or so.

I see, thanks.

David Snyder

unread,

Apr 5, 2016, 3:16:48 PM4/5/16

to kaldi-help

Hi Remi,

The results for the LRE07 script are at the bottom of the egs/lre07/v1/run.sh script. Let me know if you have any other questions about it.

Best,

David

Rémi Francis

unread,

Apr 7, 2016, 11:42:19 AM4/7/16

to kaldi-help

Thanks, I'm not going to work on this right now, but this is good to keep in a corner of my mind.

Reply all

Reply to author

Forward