Embedded Speech recognition systems.

anip sharma

unread,

Dec 1, 2016, 10:21:38 AM12/1/16

to kaldi-help

I have been trying to develop an embedded speech recognition system that can be both language and speaker independent. I am doing my research regarding the same. But I just wanted to ask if anyone could help. Few of the challenges we are facing are :

Locating the endpoints of an utterance in a single speech signal. (Voice Activation Detection).
Making the system efficient and accurate without using artificial neural networks.
The possibility of accurately being able to differentiate between all of the phonemes based on one or a combination of feature extraction processes.

Thanks.

Daniel Povey

unread,

Dec 1, 2016, 1:46:33 PM12/1/16

to kaldi-help

Even developing an embedded speech recognition for a single specific
domain of a specific language would be quite challenging (e.g. chess
moves in English). Trying to do this in a language-independent way is
impossible for a newbie, and probably out of reach even for a
well-funded corporate research group. [It's not even clear if you can
formulate the problem in a sensible way].

You should set your sights a lot lower.
Dan

> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

anip sharma

unread,

Dec 2, 2016, 4:14:55 AM12/2/16

to kaldi-help, dpo...@gmail.com

You have a great experience in the field of speech recognition and processing, so I really think you can help me. That is why I am going to explain the problem statement to you in a little more detail.

The project intends to take sound as the input (be it speech or any other sound) and process them to display a particular symbol for a particular sound, being language and speaker independent. Language independent stems from the fact that the system is not concerned with the meaning that the input sound should convey, but just the acoustics.

The system, in no way, has to be concerned about the meaning (or the text) associated with the input, just the sound.

Speaker independence is another feature that I am looking for.
It will be such a huge help if you could help in any which way. (Feasibility, directions, resources etc.)

Daniel Povey

unread,

Dec 2, 2016, 3:06:24 PM12/2/16

to anip sharma, kaldi-help

What you are saying is very hard to do accurately. You'd have to
train on data from a bunch of languages and normalize it to a common
phone set (e.g. XSAMPA). But this kind of project is suitable for
someone who already is very experienced in speech recognition (and in
embedded programming too). I don't think you'll be able to execute
it.
Dan

Danijel Korzinek

unread,

Dec 4, 2016, 4:00:43 AM12/4/16

to kaldi-help, dpo...@gmail.com

Maybe look for publications by Tanja Shultz - she did a lot of work on language independent models. Also check out the Globalphone corpora they made.

Second, maybe Kaldi is not the best tool for doing such research. You seem to want to remove much of the influence of the models we normally use in speech recognition and Kaldi is built all around these models. You should probably start with something small and simple and build it up from that, rather than try and deconstruct Kaldi, which is a complicated system already. Maybe start with some simple classification using ANNs (MLPs, RNNs) and GMMs and build your project from that.

Reply all

Reply to author

Forward