Can I add dictation function to my app using MS Agent?

Shadow

unread,

Jun 25, 2004, 5:42:56 AM6/25/04

to

Hi, All

I want to add something like "Dictate a text" to my app.
Can I do this using MS Agent?

tnx.

Merlin's Beard

unread,

Jun 25, 2004, 12:21:45 PM6/25/04

to

Not really through the Microsoft Agent programming interface. Agent only
supports what is often referred to as Command and Control speech recognition
where the application must provide a "grammar" or the set of words to listen
for. If you are very creative with the grammar, you could roughly
approximate a limited dictation interface (that is, trying to anticipate
common input).

To use dictation, you would have to write code directly to the SAPI 4
interface. If you download the SAPI 4 SDK (from the Microsoft Speech site,
but don't try the SAPI 5 as it is incompatible), I think it includes an
ActiveX control that enables you to process dictation style input.

However, beforewarned. Processing dictation input is not as easy or nice as
it seems like you would want it to be. First, because dictation speech input
is continually listening it puts a heavy load on your PC's processor.
Second, because dictation engines have no limited grammar, they rely on word
associations based on large sets of data, often something like the Wall
Street Journal text. Other than the relative frequency of one word with
another or some rudimentary grammar rules, there is no context for the
engine to distinguish between similar sounding words. So things like "wreck
a nice beach" and "recognize speech" are easily confusable because they are
acoustically the same and because the words logically fit together. You
would be amazed how ambiguous human speech is acoustically, but we don't
realize it because we use many other contextual cues, from lip position to
emphasis, previous sentence context, etc. to disambiguate things.

And dictation style input is also more susceptible to errors because the
audio channel is open longer for input allowing there to be more potential
disruption from noise or background voices. While it may be easy for people
to distinguish a voice from other audio, SR engines don't quite do this as
well because they rely primarily on the audio and there are many sounds
within the same frequency range as the human voice. And it is especially
hard if the background audio is another human voice. Turn on talk radio or
trying to do speech input in a typical office enviroment and the speech
accuracy degrades greatly.

Hence dictation engines often need a good error correction mechanism to make
it work.

Further, dictation engines typically require the user to provide the
punctuation as part of the utterance.

Finally, dictation input should not be assumed to be more natural input.
Dictation is like writing in that it requires the speaker to organize their
thoughts to communicate in straightforward grammatical fashion. Few of us
speak that way naturally to each other. Conversational speech is full of
ungrammatical constructions, restarts, left out words ( because of context)
and often incomplete sentences, not to mention a social aspect called
"barge-in" where were often speak on top of someone we are conversing with.
Learning to dictate takes practice and most users don't have that skill.

All of this is why most successful deployments of speech use the Command &
Control approach where they limit the range of what the user can say at any
one time.

<Shadow> wrote in message news:1602077741.2...@eltech.com.ua...

Remy Lebeau

unread,

Jun 25, 2004, 2:27:58 PM6/25/04

to

<Shadow> wrote in message news:1602077741.2...@eltech.com.ua...

> Can I do this using MS Agent?

No. MSAgent's Speech Recognition features are only for recognizing
pre-defined commands, not for general dictation. If you want that, then you
need to program to the SAPI interface directly instead.

Gambit

Bob (Almost Impeccable) [utf-42]

unread,

Jun 26, 2004, 3:14:28 AM6/26/04

to

"Merlin's Beard" <nos...@thisaddress.com> wrote

> Processing dictation input is not as easy or nice as
> it seems like you would want it to be. First, because dictation speech
input
> is continually listening it puts a heavy load on your PC's processor.
> Second, because dictation engines have no limited grammar, they rely on
word
> associations based on large sets of data, often something like the Wall
> Street Journal text. Other than the relative frequency of one word with
> another or some rudimentary grammar rules, there is no context for the
> engine to distinguish between similar sounding words.

From the hip, probably wrong, damn the torpedoes:

"speech recognition coprocessor"

form factors -
USB
desktop pc 3.5 inch 'drive bay'
pda connectivity unit
eventual integration into tablets
battery likely required for portable form factors

- include audio processor
- stereophonic microphone input
- Analog to digital converters
- input volume adjustment

Software upgradable dictionary in flash ram
prefix/ suffix matrix, plural analysis

- include sufficient ram to allow signal / spectrum analysis / NLP analysis
entirely within device
- software driver interface to return top 100 or top n sentence / word
probabilities and pass to further analysis software running on the local
processor
- return information to software as an ADO / RDO recordset or similar
collection with the probable word, part of speech, plural as a boolean,
score of probable compatibility with previous words within the sentence
- initial resource
http://www.humnet.ucla.edu/humnet/linguistics/people/schuh/lx001/Web_Assignments/Assig_02/02web_fdbk_02F.html

- footpedal or IRDA interface to activate/deactivate SR
- accept wireless microphone input
- update Microsoft Agent with appropriate drivers to accept text input and
pass on to programs
- value is incalculable for medical transcription community
- msn can benefit
- offer financing in conjunction with msn subscription to prorate or
otherwise reduce cost to consumer
- upgrade msn dashboard to function with device to offer added value
over aol

easier said, then done
think outside the box
not my job
HELP

bob