I am using VB6 at this time to program my applications.
I am also using Microsoft Agent 2.
I got already Speech SDK 4,5,5.1.
Thank You
So unless you address this to a company who already builds synthetic speech
technology or someone with some experience, it is unlikely you will find
anyone in this newsgroup who will be able to supply your request.
However, there are some alternatives:
1. Agent characters will lip-sync to recorded audio (.WAV files). So you
could use Windows Sound Recorder to record yourself (or someone else
speaking) and pass these files as arguments to the Speak statement. While
not as flexible at TTS (since the character can only say what has been
recorded), it can work in situations where the character's output is
pre-determined.
2. The SAPI 4 SDK has a facility for creating your own TTS "voice". You can
record .WAV files and associate them with specific words. These can be
compiled into a voice/mode that the Microsoft TTS engine will play. In Agent
you can specify that mode's ID (using TTSModeID) and try that. This is still
not as flexible as full TTS, but is a bit better than just recorded files.
It is like the effect that you can sometimes get with telephony based
applications where the voice response if made up of combinations of
pre-recorded words, e.g. "you have" + "fifty" + "dollars" + "and" + "ten" +
"cents".
3. Finally, don't worry about TTS. Microsoft Office has shipped characters
"speaking" for years with out any audible output. Text in the balloon is all
that is available for many languages.
Hope that helps.
"Willem Nagel" <acti...@absamail.co.za> wrote in message
news:3e9ad...@news1.mweb.co.za...
Thank you for replying.
1. If you can provide me with the details of a provider with an "Afrikaans"
engin - I will pay for it.
2. I've tried also SAPI 4 developing recording words - but I feel limited to
only the words that had been recorded.
3. I have utilized point 3 with "Afrikaans" but [is] in afrikaans gets
pronounced completely diffenrent, it sounds like (iz) in afrikaans. It seems
that the agent keeps the pronouncing everything with an very high English
Voice.
At this time I feel that I need more info on how to use the "Microsoft
Linguistic Information Sound Editing Tool" and what to do with the
linguistic information that has been generated by it. Or how to utilize it.
More info on tools that might be available on your point one might also
help.
Willem
"Agent Fan" <A...@tsquared.com> wrote in message
news:3e9addae$1...@news.microsoft.com...
"Willem Nagel" <acti...@absamail.co.za> wrote in message
news:3e9b6...@news1.mweb.co.za...
> Hi Agent
>
> Thank you for replying.
> 1. If you can provide me with the details of a provider with an
"Afrikaans"
> engin - I will pay for it.
AF: I don't know if any and suggest that you might find it very hard to
find one because of the work involved. As mentioned it takes A LOT of work
to support any given language. I know Microsoft doesn't support your
language (they only support a limited set of TTS engines and all the ones
posted on the Agent site they licensed from Lernout & Hauspie - who sold
their speech technology to someone). You could check IBM or just query your
favorite search engine. But do NOT be surprised if you don't find one. There
are MANY languages not currently supported with TTS.
> 2. I've tried also SAPI 4 developing recording words - but I feel limited
to
> only the words that had been recorded.
AF: Yes, that is the limitation with recorded speech or the SAPI 4 voice
approach, but if you want spoken audio in Afrikaan, it may be your only
choice.
> 3. I have utilized point 3 with "Afrikaans" but [is] in afrikaans gets
> pronounced completely diffenrent, it sounds like (iz) in afrikaans. It
seems
> that the agent keeps the pronouncing everything with an very high English
> Voice.
AF: If you are getting spoken audio for output, then that is because you ARE
using an American Engine engine. First, Agent ONLY supports those languages
supported by Windows. I don't know if Afrikaan is specifically supported by
Windows or Agent. If it is then there must be some LanguageID setting for
that language and you must set the character's LanguageID property (and
install that language component/dll for support). BUT that will ONLY render
the text output in the balloon in that language. The fact that you are
getting what appears to sound like American English output suggests that you
are using the default setting for the character's LanguageID suggesting that
you are NOT setting the LanguageID at all. If there is an Afrikaan setting
(you will have to check the Agent website), then you will have to set that
value for LanguageID in your program/script code. However, that will mean
the character will not speech audibly any longer unless you had a compatible
TTS engine because setting the LanguageID will tell Agent to look for a TTS
engine that matches that language and if it cannot find one, the character
will not speak audibly.
> At this time I feel that I need more info on how to use the "Microsoft
> Linguistic Information Sound Editing Tool" and what to do with the
> linguistic information that has been generated by it. Or how to utilize
it.
AF: This is NOT likely to help you unless you want to do a lot of work. The
tool you mention only adds some additional phonetic information to your
pre-recorded audio file and more importantly this is primarily set up for
American English. It uses an American English speech recognition engine to
try to come up with this additional phonetic information. So having it try
to parse Afrikaan is not likely to work.
You can use the tool to manually set phonetic information, but that will
take a lot of work and all that will give you is that the character may use
a couple more mouth animations when he/she lip-syncs. But you don't need to
use this tool to get basic lip-sync. Just record your voice speaking and
save to a .WAV file (using Windows Sound Recorder) and pass this filename as
the SECOND parameter in your Speak statement.
However, as you found with the SAPI 4 SDK, this means the character will
ONLY be able speak what you record.
AF: I have one more suggestion that might help a little. You have a little
control of the output of the American English TTS engine if you use the \Map
tag. This allows you to send the speech engine something different that
appears in the word balloon. So for example you could create something that
would display "Afrikaan" in the word balloon, but pass it quasi-phonetically
to the TTS engine as "ah free kon". To find out more about this check the
Agent programming docs on Speech Output tags and look for the \Map tag. Note
that since this tag takes string parameters, how you format your Speak
statement in your code will depend on your programming language. For
example, VB looks for the first two matching double quotes, so you have to
add more double quotes either explicitly or by concatenating the string with
quote character codes. See the intro to the Speech Output tags introduction
for an example.
Als tijdelijk alternatief zou je de TTS3000 Dutch engine (voor SAPI 4,
/MSAgent) kunnen gebruiken. Deze TTS heeft een iet wat Vlaams (Belgisch
Vlaams) accent.
Dat is de enige mogelijkheid omdat er Nooit een SAPi4 compliant Afrikaanse
TTS engine gemaakt zal worden.
MSAgent werkt niet met SAPI 5.1 dus...........
I will check how it sounds.
Cybarber
"Willem Nagel" <acti...@absamail.co.za> schreef in bericht
news:3e9b6...@news1.mweb.co.za...
Cybarber
"Agent Fan" <A...@tsquared.com> schreef in bericht
news:3e9c313c$1...@news.microsoft.com...
> Creating a TTS engine for a language is not a trivial effort, which
explains
> why you can only find about a dozen languages supported right now and even
> if you could, you'd probably find that you'd likely have to pay for it.
Ok. This sounds like more fun than prison, which is a non-trivial pursuit.
Perhaps you'd be willing to provide information or hints which supports
furthering one's education in this area.
> Building a TTS engine first takes writing code that knows how to generate
> phonemes and all their possible combinations, then you have to analyze the
> specific language and create the data used represent all those
combinations.
Phoenemes and all of their combinations? Is this not unlike breaking down
all of the syllables for a given language, or am I thinking of something
different? Like you've got crazy 'ch' variations and whatnot? Do I have to
do this using SAPI? I think I read somewhere in an old naval document that
there are 47 phoenemes for american english, does this sound right?
Data used to represent those combinations... so, a word is going to have a
collection of sounds, and that collection of sounds, we need to calculate
those for every word we'd like to represent through sound, and then we need
to make sounds and then play them back? And what we're doing is defining all
the words in both textual form and phoenetic form and from there we know
which sounds to pick. Am I warm or cold here?
> Next you need to be able to parse a text string and interpret words to
> phonetic representations, which also includes how to handle things like
> capitalization, abbreviations, or other special word forms.
Ok. This is the NLP thing rearing it's head again, right? analyzing the
sentence structure, the placement of commas, exclamation points, etc? My
thinking is that the speech engine should handle phoenetic to audio
conversion, and that a component higher up decides what sounds are there,
but hey, what do I know. Let's see what can be done here.
> So unless you address this to a company who already builds synthetic
speech
> technology or someone with some experience, it is unlikely you will find
> anyone in this newsgroup who will be able to supply your request.
Hey, it's Wednesday, I might as well give it a shot. You want to warn the
speech group people about me, or would it be more fun to allow me to just
pop in there and surprise em?
> Phoenemes and all of their combinations? Is this not unlike breaking down
> all of the syllables for a given language, or am I thinking of something
> different?
Phoenemes != syllables.
Phonemes are the actual individual sounds the human vocal cords are capable
of producing during speech, whereas syllables are groups of phonemes that
can be spoken in the same breathe.
Take the word "hello", for example. It contains two syllables:
he-lo
But it contains 4 phonemes:
hu-eh-l-oh
> Do I have to do this using SAPI?
Not exactly. If you write your own TTS engine, then you have to implement
the backend that SAPI calls into, you don't write your backend to call SAPI
to do the work for you.
> I think I read somewhere in an old naval document that there
> are 47 phoenemes for american english, does this sound right?
Somewhere about that. Straight English (minus any slang and accents and
such) and many other spoken languages are based on the IPA (International
Phonetic Alphabet): http://www.arts.gla.ac.uk/IPA/fullchart.html
> so, a word is going to have a collection of sounds, and that
> collection of sounds, we need to calculate those for every word
> we'd like to represent through sound, and then we need to make
> sounds and then play them back?
Basically, yes. Some engine simple have a collection of pre-recorded sounds
and then calculate which ones needs to be played in sequence. That doesn't
really take into acount things like pauses between sounds, drawn-out sounds,
grammar, stress, pitch, inflection, either. Those should be handled as well
to help ensure more natural-sounding speech.
> And what we're doing is defining all the words in both
> textual form and phoenetic form and from there we
> know which sounds to pick. Am I warm or cold here?
To do that would take considerable resources. What you're suggesting is
basically a 1<->1 look-up table. Do you realize how many words are in the
English language, let along other languages? That would be MASSIVE tables
to mantain.
Most decent engines do try to perform some kind of actual rudamentary
processing on the textual words and then dynamically make educated guesses
as to which sounds to produce. They're not perfect, but each language does
have a given set of basic rules as to how words and grammar should be
recognized.
> Ok. This is the NLP thing rearing it's head again, right? analyzing the
> sentence structure, the placement of commas, exclamation points, etc?
Yup. A definate necessasity for any decent engine.
Gambit
I would only add a few minor comments:
1. As far as NLP goes, I don't think TTS engines typically include much
true NLP. TTS engines mostly appear to scan for punctuation and
capitalization and then have simple rules, such as if you see "Dr." and the
former word is capitalized say "drive", but if the following word is
capitalized, say "doctor". I imagine the internal rules are much more
complex, but I don't think they are really breaking down syntax (which is
typically what traditional NLP code does) into nouns, verbs, prepositional
phrases, etc. If they were, you would probably get better prosody. Arguably
you could say the analysis done is a form of NLP, but seems pretty thin to
me, but frankly I have never seen the inside cost for a TTS engine, so I am
only relating what I know from talking to some folks.
2. TTS engine is typically proprietary code. I don't think anyone (or any
of the major vendors of which there aren't many) publish the source. But you
are correct the general theory of how some of them do it is generally known,
in that, as you suggested, some use phonetic snippets of a recorded speaker
and computationally blend sometime adding stress or other acoustic details.
Others basically take a audio signal and kind of acoustic warp it for form
phonetic information.
3. You are absolutely correct that all SAPI does is provide an interface
(API) into the functions supported by the engine. SAPI has no inherent
information on how to build the synthesis technology itself. Anyone can
create their own programming interface (and I think IBM and others may have
alternate APIs for their engines). SAPI only help provide a more consistent
way to program so developers can swap out engines without changing their
code.
4. Creating speech engines is a tough business. So you find very few
players building and offering this technology; IBM, Microsoft, AT&T (and
their various spin-offs), formerly L&H, Elan, some Japanese companies,
Nuance, Speechworks. And I think even some of them even cross license code.
I imagine that there may be some freeware things out there as well, but
since most speech engines take considerable work, you aren't likely to find
too many that are robust or well supported.
Speech is so effortless for us to use, but is still very hard for a computer
to process, both on the input and output side.
Further, many people forget that our use of language goes far beyond just
acoustic recognition and generations. There are multiple parallel processes
involving how we process language. Speech is still struggling with the most
basic; precise/accuracy, but even if you had 100% accuracy, there is still
the issue of meaning.
A great book on getting a perspective on all this is Hal's Legacy, by David
Stork. It is a compilation of chapters from noted researchers in the various
fields that would be necessary to produce something like the HAL 9000 and
why this future projection of technology has yet to become reality.
"Remy Lebeau" <gamb...@yahoo.com> wrote in message
news:encOkSGB...@TK2MSFTNGP11.phx.gbl...
Yeah, it's pretty amazing what one can do after having took a rudamentary
community college speech class, hehe ;-)
> Speech is so effortless for us to use
Effortless? Considering that it takes several years for a typical human to
learn how to grasp the fundamentals of recognizing the language, let alone
how to then reproduce it ourselves :-) You'd think with all the
advancements in computer technology that they'd have figured out already how
to let a computer actually learn so it can do the same thing. But then, you
know what they say - the human mind is the most advanced computer ever
created. So I guess hardware computers still have a ways to go before they
can match the true processing power of the biological computer :-) If it
takes a human several years to learn a language, then I guess computers have
to take decades instead.
Gambit
With the internet bubble bursting, 9-11, and then the economy, Bill Gates
already lost half his wealth - and here comes Remy Lebeau to write plugins
for netscape and give away proprietary information. Will Microsoft survive?
>TTS engines mostly appear to scan for punctuation and
> capitalization and then have simple rules, such as if you see "Dr." and
the
> former word is capitalized say "drive", but if the following word is
> capitalized, say "doctor".
In my evaluation, this is defuncted. The engine should perform no
interpretation whatsoever. Do the vocal cords decide what to say, or does
the brain? There is no reason whatsoever for my throat to perform
interpretation. When I want to say Drive or Doctor, my gray matter will
provide the instructions. Think outside the voice box.
Now, 'Agent', on the other hand, is sort of a 'dude front end', it's speak
methods might benefit from a degree of interpretation, but for the most
part, I wind up having to work around that stuff and resort to map tags and
phoenetics anyway. Really, they should ditch that quasi-interpretation crap
(yes, it's crap) and put it in the right place, or somewhere it can be
turned on or off.
>I imagine the internal rules are much more
> complex, but I don't think they are really breaking down syntax (which is
> typically what traditional NLP code does) into nouns, verbs, prepositional
> phrases, etc.
That's what I was hoping you would elaborate on. Now I understand your take
on NLP. Structural analysis of the sentence to determine context and
subsequently appropriate enunciation. Thank you.
>If they were, you would probably get better prosody.
Prosody. Prose. Melody of speech. Natural colors of language. Terms you use
frequently. Does not fully compute. It would be greatly appreciated if you
would be willing to elaborate your interpretation of these phrases. They
sound very intuitive but present a very vague picture.
> 2. TTS engine is typically proprietary code. I don't think anyone (or any
> of the major vendors of which there aren't many) publish the source.
There are a few examples on the web, they are largely from defuncted
operating systems or depracated/ancient computer manufacturers. All of them
are cheesy.
> 4. Creating speech engines is a tough business. So you find very few
> players building and offering this technology; IBM, Microsoft, AT&T (and
> their various spin-offs), formerly L&H, Elan, some Japanese companies,
> Nuance, Speechworks. And I think even some of them even cross license
code.
I get so bored sometimes. I might as well try. What you've offered is a good
basis to isolate on.
> I imagine that there may be some freeware things out there as well, but
> since most speech engines take considerable work, you aren't likely to
find
> too many that are robust or well supported.
You are correct, I've looked around quite a bit, there isn't much in the way
of good free/cheap speech engines. The best low bandwidth
(non-concatenated-wave) speech engine is $5000 for an internet license, a
big barrier for the small guy.
> There are multiple parallel processes
> involving how we process language. Speech is still struggling with the
most
> basic; precise/accuracy, but even if you had 100% accuracy, there is still
> the issue of meaning.
It's a bit like peeling an onion. The further in you go, the more likely you
are going wind up crying. Meaning, while fundamental to recognition, is
superflouos for TTS. What needs to occur first is good phoenetic to
verbalization. Next, inflection, evocation of emotional valence, and the
occasional phlegmy esophagus (sp?). Yes, you'd like the autonomous
verbalization of everything, but you've got to put it together in separate
pieces, not a convoluted rollercoaster of potential mistakes.
> A great book on getting a perspective on all this is Hal's Legacy, by
David
> Stork. It is a compilation of chapters from noted researchers in the
various
> fields that would be necessary to produce something like the HAL 9000 and
> why this future projection of technology has yet to become reality.
Another book proclaiming that it can't be done. Do we want to read about a
bunch of researchers explaining that it will require more golf and 4 hour
lunches before we can talk to our computer? That Stork is carrying the same
baby everyone else is. Never underestimate what a kid can do with enough
LEGOs.
Ask not what HAL 9000 can do for you, but what you can do for HAL 9000.
> In my evaluation, this is defuncted. The engine should perform
> no interpretation whatsoever. Do the vocal cords decide what
> to say, or does the brain? There is no reason whatsoever for
> my throat to perform interpretation. When I want to say Drive
> or Doctor, my gray matter will provide the instructions. Think
> outside the voice box.
Hello - in computer terms, the TTS engine *is* the brain. There's nothing
else telling the engine what to do, because it's already at the top (or low,
however you look at it) level of the chain. It takes some input (aka,
"eyes" or "ears"), processes it (itself as the "brain"), then tells the
audio card ("vocals") what to do.
Gambit
"Remy Lebeau" <gamb...@yahoo.com> wrote in message
news:#o2IIocB...@TK2MSFTNGP11.phx.gbl...
AF: If you have some code to break down how to interpret how words should
be pronounced, go for it. Just use the \Map tag. If you use SAPI directly
there may be a more efficient way to drive your own preferred pronunciation.
But for most of us, some higher level code to interpret context and pass on
the appropriate interpretation, so its handy for the TTS engine to try to
make basic assumptions. I think the majority of people using TTS find its
default assumptions about pronunciation. It would take a heck of lot more
work to pre-process spoken output if everyone had to do it themselves. Sure
it would be great if that existed as some separate piece of code, but I am
not aware of anything like that. And I suspect even if it existed it would
still occasionally make mistakes as so much of pronunciation is linked to
understanding context.
Fundamentally, I suspect that current approaches to technology will only be
able to make incremental improvements in TTS and SR because the processing
model is so different than the way we do processing. A more interesting
approach is Lloyd Watts' work where he does auditory processing by modeling
more of the natural structures of mammalian auditory appartus. This gives
him a potentially better way to determine voice from other voices or sounds.
Human speech processing really involves much more just phonetic matching.
(Look up McGurk effect in your favorite search engine.)
> >I imagine the internal rules are much more
> > complex, but I don't think they are really breaking down syntax (which
is
> > typically what traditional NLP code does) into nouns, verbs,
prepositional
> > phrases, etc.
>
> That's what I was hoping you would elaborate on. Now I understand your
take
> on NLP. Structural analysis of the sentence to determine context and
> subsequently appropriate enunciation. Thank you.
AF: NLP typically does little to define context. Most NLP software only
identifies the parts of speech (nouns, verbs, etc.). Some may provide some
rudimentary semantic meaning (like which noun is the subject) or a special
format (like 8:00 a.m. appears to be time). Such things MAY help toward
some contextual understanding and appropriate pronunciation, but not
necessarily. Language understanding appears to involve a lot more including
some form of world knowledge. With the exception of Winograd's belated
SHRDLU, there have been few demonstrations of NLP technology that
demonstrate the combination. And Winograd appeared to abandon his work.
> >If they were, you would probably get better prosody.
>
> Prosody. Prose. Melody of speech. Natural colors of language. Terms you
use
> frequently. Does not fully compute. It would be greatly appreciated if you
> would be willing to elaborate your interpretation of these phrases. They
> sound very intuitive but present a very vague picture.
>
AF: When we speak, it is typically not flat/monotone, but there are pitch
changes and pacing changes that we use to convey meaning. The simplest form
is that our voices typically rise pitch at the end of a sentence to signal
it is a question and decline to signal a statement. But there are all types
of subtle pitch, emphasis, and pace/rate clues we toss in. Listen to a
typicall TTS engine and with a few exception it is exceedingly monotone,
with little variation. That means it not is unappealing to listen to
(boring), but fails to communicate those important bits in how to interpret
what is said.
Tests that I have read about with regard to TTS suggest that the current
accuracy of engines is pretty good, that is, most people can understand what
is being spoken, despite the fact that the production often is uneven.
However, it is that unevenness that often makes the synthetic voice
unnatural (or annoying). Some words are very well spoken, while others are
slurred. This again relates to our expectations of prosody. Human speech
more typically consistent in its quality. So when an engine goes from good
pronunciation of a few words, then slops over the next couple, it sounds
very dissonant. It's almost like out auditory appartus attempts to tune to
the level of quality of the output. And when it keeps jumping around, we end
up more engage in the process of hearing, than in listening.
> If you use SAPI directly
> there may be a more efficient way to drive your own preferred
pronunciation.
HEAD: That's useful to consider rather than reinventing a wheel. Thank you.
> But for most of us, some higher level code to interpret context and pass
on
> the appropriate interpretation, so its handy for the TTS engine to try to
> make basic assumptions. I think the majority of people using TTS find its
> default assumptions about pronunciation. It would take a heck of lot more
> work to pre-process spoken output if everyone had to do it themselves.
Sure
> it would be great if that existed as some separate piece of code, but I am
> not aware of anything like that. And I suspect even if it existed it would
> still occasionally make mistakes as so much of pronunciation is linked to
> understanding context.
HEAD: I'm seeing something in my head I'm not able to put to words or code
just yet. When time is opportune, perhaps I can demonstrate some of the
concepts. I can tell you get what I'm driving at, anyway. I think the SAPI
access may facilitate what I'm driving at, although I'm inclined to examine
things in my own way outside of what exists.
> Fundamentally, I suspect that current approaches to technology will only
be
> able to make incremental improvements in TTS and SR because the processing
> model is so different than the way we do processing.
> A more interesting
> approach is Lloyd Watts' work where he does auditory processing by
modeling
> more of the natural structures of mammalian auditory appartus.
HEAD: Another useful lead. I'm not sure if Mr. Watts believed this, but I've
a theory that better speech output will actually arise from hardware design
rather than solely software. James Earl Jones probably sounds a lot beefier
in person. Modeling replicas of human organs to replicate human capacities c
ertainly does appear to represent advantages. I can imagine much better
processing would be required to implement these models.
> This gives
> him a potentially better way to determine voice from other voices or
sounds.
> Human speech processing really involves much more just phonetic matching.
> (Look up McGurk effect in your favorite search engine.)
HEAD: I may be familiar with the McGurk effect. From vague memory of when I
was studying facial animation and lip syncing, I believe that this was a
situation where certain audible sounds, presented in conjunction with an
inappropriate visual of a mouth formation, would actually cause the observer
to perceive that a different sound was produced. That might be a different
'effect', however. I'll refresh my memory.
> AF: NLP typically does little to define context. Most NLP software only
> identifies the parts of speech (nouns, verbs, etc.). Some may provide some
> rudimentary semantic meaning (like which noun is the subject) or a special
> format (like 8:00 a.m. appears to be time). Such things MAY help toward
> some contextual understanding and appropriate pronunciation, but not
> necessarily. Language understanding appears to involve a lot more
including
> some form of world knowledge. With the exception of Winograd's belated
> SHRDLU, there have been few demonstrations of NLP technology that
> demonstrate the combination. And Winograd appeared to abandon his work.
HEAD: SHRDLU. Gotcha. Will take a look. A shame that work was abandoned. So
many benefits could be derived from the advancement of these technologies.
There are a lot of lonely people out there who might like someone to talk
to.
HEAD: I appreciate that you took the time to elaborate and provide some
points of reference. Hopefully, everyone that is interested can benefit from
the understanding you've provided.