Websocket online decoding

691 views
Skip to first unread message

ozi samur

unread,
May 1, 2016, 3:33:58 PM5/1/16
to kaldi-help
Hi,

I built my GMM-HMM model and also used this docker container (https://github.com/jcsilva/docker-kaldi-gstreamer-server) and also used dictate.js to create a platform to recognize my voice in web environment. I created it successfully and  actually when I built my HMM model with kaldi, I saw that %5 as WER and it was great and in fact my vocabulary size is so small (200 words). 

When I use microphone to recognize my voice by using dictate.js, gstreamer and kaldi actually so much high WER that I saw . It returns quickly the response while I am still speaking and it returns almost bad results. I could not understand exactly what the problem is. For example if I am using CMU Sphinx Live Recognizer I am getting around %5-10 WER and it is really great but I could not do the same in Kaldi environment. Actually it is not related with Kaldi, may be some misconfiguration for gstreamer, decoding or something like that.

Is there any other settings for GMM-HMM yaml file as specified in (https://github.com/alumae/kaldi-gstreamer-server/blob/master/sample_worker.yaml) ?
Or any other settings for microphone recognizer ?

Thanks.

Daniel Povey

unread,
May 1, 2016, 3:38:13 PM5/1/16
to kaldi-help
Perhaps you could try making audio recordings and recognizing them
offline with Kaldi to see if it's purely an online-recognition
problem.
If the sampling rate is wrong (e.g. if you had to edit some config
file to change the sampling rate) then the results are expected to be
nonsense.
It might possibly be sensitive to the volume of the data, e.g. if it's
very quiet or very loud it might not work well.
Dan
> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

ozi samur

unread,
May 7, 2016, 2:55:07 AM5/7/16
to kaldi-help, dpo...@gmail.com
I tried with the Kõnele Android application and it seems same. But I recognize that, if I speak more than 3-4 seconds (3 or more words) it recognizes perfectly. When I say something short like "Yes I did" it returns unexpected response. While I am trying it same in CMUSphinx it recognizes short sentence as well. Well, I did not use any extra method while training just use the same as in digits tutorial but if you suggest me some method or recipe for Kaldi to recognize especially short utterances it would be perfect.

Thanks.

1 Mayıs 2016 Pazar 22:38:13 UTC+3 tarihinde Dan Povey yazdı:

Daniel Povey

unread,
May 7, 2016, 3:01:18 AM5/7/16
to ozi samur, José Eduardo De Carvalho Silva, Tanel Alumäe, kaldi-help
It looks like you are using wrappers devised by Eduardo Silva (cc'd)
based on the GStreamer stuff by Tanel Alumae (also cc'd). I don't
really know much about what specific things are being used in these --
is there online CMVN? What kind of model is expected, i.e. what
features is it supposed to be trained on? You may not even be
supplying the right kind of model-- I just don't know.

Dan

jcs...@cpqd.com.br

unread,
May 9, 2016, 7:49:47 AM5/9/16
to kaldi-help, ozis...@gmail.com, jcs...@cpqd.com.br, tanel....@phon.ioc.ee, dpo...@gmail.com
Hi,

Ozi, have you already analysed the audio that was saved when you tried the online 
recognition? In the yaml file, the "out-dir" variable tells you where the audio 
(that was recognized) is saved - it's the line 8 in 
If possible, please, try to recognize these audio files with the same models but in 
offline mode. This is similar to something Daniel has already suggested, and may 
help you to identify if you are facing some problems when recording audio.

Other important point: in the yaml file you pointed 
there is a variable called silence-phones. You need to adjust it according to your model.

Eduardo

Tanel Alumäe

unread,
May 9, 2016, 10:10:24 AM5/9/16
to kaldi...@googlegroups.com
How much training data do you have? I would recommend to train online nnet2 models with i-vectors. We use them with our Kõnele Android application and they work very well, even with one word utterances. We use 150+ hours of training data however.

Tanel
Reply all
Reply to author
Forward
0 new messages