How to get SpeechDetected events faster with real-time audio?

Dexter Morgan

unread,

Mar 31, 2011, 3:26:13 PM3/31/11

to

I have two questions regarding real-time speech events:

Question 1: How can we configure the Microsoft Speech Recognition
Engine to fire SpeechDetected events earlier?
Question 2: What is the proper way to feed real-time audio and receive
events like SpeechDetected in real-time? i.e., which SetInputTo*()
method should we use?

Background / Success Story so far:

We've had much success developing a C# application using Microsoft
Speech SDK 10.2 (using the managed Speech libraries:
Microsoft.Speech.*).

Our application is real-time based, so we need to receive events as
close to real-time as possible, particularly SpeechDetected.

Regarding the SetInputTo*() methods:

* At first, we were using SetInputToAudioStream(), but we observed the
behavior that the engine would not send any events until it read the
complete stream, which is unacceptable. If there is a way to use this
API and to receive SpeechDetected events before it reads the entire
stream, then we are all ears as to how to make this work; perhaps we
just need to feed it a special kind of SpeechAudioFormatInfo.

* As a "workaround", we have been using SetInputToWaveStream() and
passing in our own Stream object; our Stream feeds the engine a WAV
header plus real-time audio data, whenever the engine calls Read().
After some special modifications, this works pretty well; the engine
fires SpeechDetected events and other events in real-time, instead of
reading the entire stream ahead of time. However, it still fires
SpeechDetected slowly, about 750ms after the beginning of the
utterance.

For example, we have an utterance that has initial silence, then 500ms
of loud in-grammar speech, then end silence. By monitoring the
engine's call to our Stream's Read() method, we can see that it reads
the full 500ms of speech and another 200ms of silence, and it still
won't fire the SpeechDetected event until it reads even more silence.

Any help in this matter would be appreciated!

Dexter Morgan

unread,

Mar 31, 2011, 4:45:43 PM3/31/11

to

Update: The behavior I described for SetInputToAudioStream() isn't
completely accurate; it's similar to SetInputToWaveStream() and does
the following:
* It makes sure that the Stream is Seekable. (But real-time Streams
are not usually Seekable, so we have to "fake" this.)
* It asks for the length of the Stream. (But real-time Streams do not
have a length, so we have to "fake" this and pretend we have a long
length.)
* It Seek()s to the "end" of the Stream and attempts to read zero
bytes. (But real-time Streams do not have an "end", so we have to
"fake" this.)
* It reads the WAV header plus enough bytes to total 4096 bytes of
initial data (about 250ms). (We "fake" and provide silence here.)

The only way we have found to use the managed Speech API with real-
time streaming data is by using a custom Stream object with the "fake"
behavior above. This doesn't seem like the right solution, but we
don't see any other API's or techniques to use real-time streaming
data.

ilmat...@gmail.com

unread,

Jun 19, 2017, 10:14:13 AM6/19/17

to

Is any chance to share your code. I'm facing the same problem with no results since 2 week

Dexter Morgan

unread,

Jun 29, 2017, 1:03:44 PM6/29/17

to

Sorry, I don't have that code any more, but I have already described what you would need to do.