IF & voice technologies

Grankin Andrey

unread,

Jul 14, 2003, 1:48:44 PM7/14/03

to

Looking at discussions about IF, profit from it and it's future, I
didn't see any mention about the voice driven IF and wonder why.
I saw one system which support it ("Harp", if I'm not mistaken), and
some posts about it. Before I found them, I wrote article about future
of IF and showed in it what only speech recognizing and voice output
can turn IF-writing into viable business again (article available in
Russian only).
Possibilities of it are easy to imagine: "target group" can be widely
expanded by children, old, blind, and people who want to save theirs
eyes for something else.
Of course, voice recognizing is still too bad and such games would
take more time for development, because they would require sound
management or voice markup( <shout><volume=1>Don't
<volume=2><speed=75>do<speed=80> it!</shout>,
for example). Markuping can be replaced by something similar to
"motion capture" technology for intonation, speed and volume level.
But all those difficulties are much more easier to overcome than to
make IF popular again in it's present form.
Can somebody explain such small interest to this technologies?
Why none of the Inform, TADS, and Hugo interpretators still don't
support
even text-to-speech engines(or does, but I don't know)?

--
Andrey Grankin

dgr...@cs.csbuak.edu

unread,

Jul 14, 2003, 3:29:01 PM7/14/03

to

Grankin Andrey <gran...@mail.ru> wrote:
> Looking at discussions about IF, profit from it and it's future, I
> didn't see any mention about the voice driven IF and wonder why.
> I saw one system which support it ("Harp", if I'm not mistaken), and
> some posts about it. Before I found them, I wrote article about future
> of IF and showed in it what only speech recognizing and voice output
> can turn IF-writing into viable business again (article available in
> Russian only).

I actually did this as a contracted job. Frotz was to be modified to
support voice input and output. Stuff for voice IO are present in the
latest source code and the docs include some general discussion on the
speech processing code. Unfortunately, the company that comissioned this
project walked out without paying and accused me of saying it's
impossible. The voice input part seems to work fine as long as you have a
complete list of words that the game is expected to hear. This list is
then converted into a dictionary for the voice recognition engine.

> Possibilities of it are easy to imagine: "target group" can be widely
> expanded by children, old, blind, and people who want to save theirs
> eyes for something else.
> Of course, voice recognizing is still too bad and such games would
> take more time for development, because they would require sound
> management or voice markup( <shout><volume=1>Don't
> <volume=2><speed=75>do<speed=80> it!</shout>,

[snip]

I determined that using markups for changing voice characteristics would
be a horribly messy hack for the Z-machine. Glulx seems much more suited
for this sort of thing. This is on my general todo list for when voice IO
for Frotz is complete.

--
David Griffith

Lucian P. Smith

unread,

Jul 14, 2003, 6:08:43 PM7/14/03

to

Grankin Andrey <gran...@mail.ru> wrote in <c51c0fb4.0307...@posting.google.com>:

: Can somebody explain such small interest to this technologies?

: Why none of the Inform, TADS, and Hugo interpretators still don't
: support
: even text-to-speech engines(or does, but I don't know)?

The reason you never hear about this is that computers blind people use
already have text-to-speech software that they use *with* an IF
interpreter like DOS Frotz. There are slight issues with the status bar
(it's not heard unless asked for specifically, or turned off entirely with
a command-line option), but in general, current interpreters already hook
fairly seamlessly into existing general text-to-speech software.

Voice input is a slightly different beast, but again, the technology is
growing so quickly it's probably easier to hook a generic voice input
program to an existing IF interpreter than it is to meld the two together.
The potential advantage would be (like the spell-checker in Nitfol) you
could auto-correct ambiguous input to match something in the game's
vocabulary.

If you want to chat with blind people about what they use, I suggest
starting with Audyssey, at audysseymagazine.org A lot of them play text
adventures.

-Lucian

Rexx Magnus

unread,

Jul 15, 2003, 4:31:21 AM7/15/03

to

On Mon, 14 Jul 2003 22:08:43 GMT, Lucian P. Smith scrawled:

>
> Voice input is a slightly different beast, but again, the technology is
> growing so quickly it's probably easier to hook a generic voice input
> program to an existing IF interpreter than it is to meld the two
> together. The potential advantage would be (like the spell-checker in
> Nitfol) you could auto-correct ambiguous input to match something in the
> game's vocabulary.
>

I've found that the voice input programs that come with the microsoft
speech sdk is actually quite good (and was a few years ago, too). That
will interface with just about anything, be it through APIs or simply
pretending to be a keyboard.

The most annoying thing is though, that speech synthesis hasn't appeared
to come along very far over the past few years.

--
UO & AC Herbal - http://www.rexx.co.uk/herbal

To email me, visit the site.

David Thornley

unread,

Jul 15, 2003, 2:13:06 PM7/15/03

to

In article <Xns93B960F0C77...@130.133.1.4>,

Rexx Magnus <tras...@uk2.net> wrote:
>
>The most annoying thing is though, that speech synthesis hasn't appeared
>to come along very far over the past few years.
>

If I remember where it was a few years ago, it was quite adequate to
transform written text into monotonous speech, and it still is.

The problem is that most people find it difficult to listen to speech
without any sort of inflection, and there's really only two ways to
get inflection into computer-generated speech: either use a lot of
markup or have the computer understand what it is saying.

--
David H. Thornley | If you want my opinion, ask.
da...@thornley.net | If you don't, flee.
http://www.thornley.net/~thornley/david/ | O-

Rexx Magnus

unread,

Jul 15, 2003, 3:53:22 PM7/15/03

to

On Tue, 15 Jul 2003 18:13:06 GMT, David Thornley scrawled:

> If I remember where it was a few years ago, it was quite adequate to
> transform written text into monotonous speech, and it still is.
>
> The problem is that most people find it difficult to listen to speech
> without any sort of inflection, and there's really only two ways to
> get inflection into computer-generated speech: either use a lot of
> markup or have the computer understand what it is saying.

Well, it wasn't so much just that, as the way that most of the words were
pronounced. It's quite difficult for me to write things that sound right -
especially if they're names, or non-dictionary words, as the MS speech SDK
has an american accent, which makes it quite hard to get words to sound
right. Most of the time, the synth voices sound like people talking with
a mouthful of plumstones, which doesn't help a lot either.

Christos Dimitrakakis

unread,

Jul 16, 2003, 7:51:22 AM7/16/03

to

I think that the Speech SDK is based on simple formants for creating
sounds. This is an approach that has been in use since the WWII US
military radios.

More flexible models, such Hidden Markov Models and Neural Networks, might
offer more realism. Probably the easiest thing to do is to train a neural
network as a look-ahead predictor, whose inputs would be the current
sample and a time index and whose output would be the predicted next
value. You can train it to predict a sound and then use it to reproduce
the same sound by hooking its output to it's input and varying the index
input as fast as you'd like in order for the sound to vary with different
speeds. I think there are quite a lot of different algorithms for
synthesis and their implementation is not particularly hard. However, the
problem is collecting and annotating data and then adapting your models
to it. It is a task that might take much longer than actually coding the
stuff...

--
Christos Dimitrakakis

David Thornley

unread,

Jul 16, 2003, 8:44:37 AM7/16/03

to

In article <Xns93B9D47FB71...@130.133.1.4>,

Hmmm, we seem to have been listening to different synthesized speech.
I've heard synthesized speech in which the words were individually
very understandable, but when saying anything of length became
hard to understand and tiring to listen to.

Grankin Andrey

unread,

Jul 16, 2003, 3:45:48 PM7/16/03

to

Rexx Magnus <tras...@uk2.net> wrote in message news:<Xns93B9D47FB71...@130.133.1.4>...

I use to listen books befor sleeping (excellent soporific :)
sometimes.
I can say, what speaking is nice with use of good engine and API.
This API has dictionary, in which you can write right pronouncments of
words.
Some people made a dictionaries near of 40.000 words and program with
them
speaking much better (even in such hard language as Russian!).
It must be many such dictionaries for different English dialects and
you should look for them if program spelling wrong.

Daniel Dawson

unread,

Jul 19, 2003, 12:02:59 AM7/19/03

to

You pick up and read article <Xns93B960F0C77...@130.133.1.4>,

written by Rexx Magnus <tras...@uk2.net>. It says:
>The most annoying thing is though, that speech synthesis hasn't appeared
>to come along very far over the past few years.

Have you ever heard the speech synthesis that was virtually built into the old
Amiga? Now, I don't listen to a lot of speech synthesis, but IMHO, the sound
was pretty clear. In that respect, nothing else I've heard sounds much
better. But again, I haven't heard a lot. What do you think?

--
| Email: Daniel Dawson <ddawson at icehouse.net> ifMUD: DanDawson |
| Web: http://www.icehouse.net/ddawson/ X-Blank: intentionally blank |

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 80,000 Newsgroups - 16 Different Servers! =-----

Josh Vanderhoof

unread,

Jul 19, 2003, 1:36:45 PM7/19/03

to

dda...@nospam-icehouse.net (Daniel Dawson) writes:

> You pick up and read article <Xns93B960F0C77...@130.133.1.4>,
> written by Rexx Magnus <tras...@uk2.net>. It says:
> >The most annoying thing is though, that speech synthesis hasn't
> >appeared to come along very far over the past few years.
>
> Have you ever heard the speech synthesis that was virtually built
> into the old Amiga? Now, I don't listen to a lot of speech
> synthesis, but IMHO, the sound was pretty clear. In that respect,
> nothing else I've heard sounds much better. But again, I haven't
> heard a lot. What do you think?

SoftVoice (www.text2speech.com) is still selling the Amiga speech
synthesizer. The only one that sounds better to me is AT&T Natural
Voices (naturalvoices.att.com).

There is also Festival, which is free/open source.
(http://www.cstr.ed.ac.uk/projects/festival/)
It sounds more natural than the Amiga, but less clear.

Rexx Magnus

unread,

Jul 19, 2003, 5:12:33 PM7/19/03

to

On Sat, 19 Jul 2003 04:02:59 GMT, Daniel Dawson scrawled:

> Have you ever heard the speech synthesis that was virtually built into
> the old Amiga? Now, I don't listen to a lot of speech synthesis, but
> IMHO, the sound was pretty clear. In that respect, nothing else I've
> heard sounds much better. But again, I haven't heard a lot. What do you
> think?

Yes, I much preferred the amiga's speech synthesis. Quite why they took
the apps out of the 3.0 and onwards OS, I'll never know. They left the
libraries in though.

Daniel Dawson

unread,

Jul 20, 2003, 4:50:58 AM7/20/03

to

You pick up and read article <m37k6eju4y.fsf@y.z>, written by

Josh Vanderhoof <joshvan...@ml1.net>. It says:
>There is also Festival, which is free/open source.
>(http://www.cstr.ed.ac.uk/projects/festival/)
>It sounds more natural than the Amiga, but less clear.

Perhaps. I admit it's been a while since I listened to the Amiga speech, but I
do remember first hearing M$'s speech synthesis and immediately thinking how
typically unnatural it sounded.

Certainly Festival sounds more natural than that, although it has its problems,
such as pausing in the wrong places and even occasionally garbling its
pronunciation. For instance, I input the sentence "In my humble opinion, the
speech synthesis on the Amiga was quite clear." But it comes out sounding like
"In *my* humble opinion ... the speech synthesis ... n the *Miga* was quite
clear."

Then again, I'm using Debian, and you probably know how they like to test
things extensively before calling them 'stable'. Maybe Festival has improved
since this version?

Rexx Magnus

unread,

Jul 20, 2003, 6:21:17 AM7/20/03

to

On Sat, 19 Jul 2003 17:36:45 GMT, Josh Vanderhoof scrawled:

>
> SoftVoice (www.text2speech.com) is still selling the Amiga speech
> synthesizer. The only one that sounds better to me is AT&T Natural
> Voices (naturalvoices.att.com).
>

The ATT one is quite impressive!

Josh Vanderhoof

unread,

Jul 20, 2003, 3:13:42 PM7/20/03

to

dda...@nospam-icehouse.net (Daniel Dawson) writes:

> Certainly Festival sounds more natural than that, although it has
> its problems, such as pausing in the wrong places and even
> occasionally garbling its pronunciation. For instance, I input the
> sentence "In my humble opinion, the speech synthesis on the Amiga
> was quite clear." But it comes out sounding like "In *my* humble
> opinion ... the speech synthesis ... n the *Miga* was quite clear."

You may have better luck saving the output in a file and then playing
that. I vaguely remember having a problem with Festival not being
fast enough for real-time output. It's been a while since I've used
it though.

Christos Dimitrakakis

unread,

Jul 21, 2003, 7:14:20 AM7/21/03

to

Hm, examples on a small-vocabulary database of numbers with some systems I
have hear give a word error rate of around 10%. Some examples of wrong
sentences:

obtained: four eighteen two zero
desired: four eight two two zero

obtained: six three
desired: fifteen

obtained: seven two
desired: seventy two

obtained: oh three five nine
desired: oh two eight five nine

As you can see, some examples pretty off. This is a vocabulary of 33 words
only. Results tend to get worse with larger vocabularies.

I have not seen any stand-alone recognition systems, but I imagine they
must have some kind of vocabulary/dictionary so that they won't just
output garbage. That might make it a bit difficult for the IF author to
add words that would normally not be recognised by the system. (i.e. the
word 'Frotz'). I guess there are also systems that try and construct
general rules about possible words rather than explicit vocabularies.
These would be preferable in an IF setting.

Also, perhaps it is not entirely possible to get rid of the prompt text,
since it would provide feedback to the player as to what was
actually understood by the program. This is essential, since speech
recognition is not so good.

--
Christos Dimitrakakis

a. deubelbeiss

unread,

Jul 24, 2003, 7:05:26 PM7/24/03

to

Christos Dimitrakakis <oleth...@oohay.com> wrote in message news:<pan.2003.07.16.13....@oohay.com>...
[...]
> >
> I think that the [MS] Speech SDK is based on simple formants for creating

> sounds. This is an approach that has been in use since the WWII US
> military radios.

The MS homepage didn't want to tell me anything specific about what
the SDK exactly is, but the only Microsoft speech synthesizer I'm
aware of (called Whistler and included in sapi4.0 and following) is a
concatenation job, not a formant synth.
http://research.microsoft.com/srg/ssproject.aspx
describes the one I mean.

Of course, you never know. Maybe they accidentally developed two.

Also, Warning: I don't really know much about speech synthesis beyond
a few buzzwords.

Christos Dimitrakakis

unread,

Jul 25, 2003, 9:57:35 AM7/25/03

to

I think concatenation means modelling diphones, i.e. the switching from
one phone to the other. As for formant models, even very simple ones work
quite well. MS-research publications on speech synthesis are varied and
include research in many different ways to do the job. There are the
formant models and the related frequency domain models and there are also
the filtering models (which can also be translated into frequency domain
models, I guess)

I could not download any of the papers as all my clicks resulted in a
copyright notice by the IEEE/ICSA, telling me I could look but not copy.
After that, nothing was actually downloaded. So, fuck them, I'm not going
to waste any more of my time on those losers.

--
Christos Dimitrakakis

Christos Dimitrakakis

unread,

Jul 25, 2003, 10:01:23 AM7/25/03

to

On Fri, 25 Jul 2003 15:57:35 +0200, Christos Dimitrakakis wrote:

> On Fri, 25 Jul 2003 01:05:26 +0200, a. deubelbeiss wrote:
>
>> Christos Dimitrakakis <oleth...@oohay.com> wrote in message
>

> I could not download any of the papers as all my clicks resulted in a
> copyright notice by the IEEE/ICSA, telling me I could look but not copy.
> After that, nothing was actually downloaded. So, fuck them, I'm not
> going to waste any more of my time on those losers.
>

(The links contain a working link to the .pdf which you can snip
out and use, but the horrible copyright notices make me want to puke.)

--
Christos Dimitrakakis

Al Terer

unread,

Jul 26, 2003, 4:18:13 PM7/26/03

to

On 14 Jul 2003 10:48:44 -0700, gran...@mail.ru (Grankin Andrey)
wrote:

>Looking at discussions about IF, profit from it and it's future, I
>didn't see any mention about the voice driven IF and wonder why.

<snip>

>I wrote article about future
>of IF and showed in it what only speech recognizing and voice output
>can turn IF-writing into viable business again (article available in
>Russian only).

<snip>

>Why none of the Inform, TADS, and Hugo interpretators still don't
>support
>even text-to-speech engines(or does, but I don't know)?

Most people on this thread seem to be focused on computer generated
speech, but what about something like books on tape, or even radio
dramas? The descriptions, messages, and dialog could be pre-recorded
by voice actors. Music and sound effects could also be present. Of
course, this would take a lot more work than an ordinary adventure,
and would almost have to be some sort of collaboration.

Imagine: you open the game file, and hear haunting music. A hollow
voice says, "Welcome to Adventure."
"Go north," you tell your computer. And so on...

I'd buy it.

RHanke

unread,

Jul 27, 2003, 10:17:51 AM7/27/03

to

AT> Most people on this thread seem to be focused on computer generated
AT> speech, but what about something like books on tape, or even radio
AT> dramas? The descriptions, messages, and dialog could be pre-recorded
AT> by voice actors. Music and sound effects could also be present. Of
AT> course, this would take a lot more work than an ordinary adventure,
AT> and would almost have to be some sort of collaboration.
AT>
AT> Imagine: you open the game file, and hear haunting music. A hollow
AT> voice says, "Welcome to Adventure."
AT> "Go north," you tell your computer. And so on...

Hi,

nice to see somebody mentioning this. This has been my dream since back
in my schooldays to eliminate all visual components and do an all-audio
game (but back then with 8-bit 4-voice Paula sound and a 7 MHz processor,
nope Sir).

I've been working for some time on a cross-platform system that's eventually
supposed to make exactly what you mentioned possible.

It's a little motivation boost to see somebody dreaming of this too.
I've got some hopes for it, as some of my friends (who would never play
text adventures) actually like this idea, too. But it's definitely a
*much* more difficult task than having a computer "read out" text in
a computer-generated voice. Which should probably be the next step.

Thanks!

And now you may go ahead and call me crazy for even trying ...

Harry

unread,

Jul 27, 2003, 1:03:30 PM7/27/03

to

On Sat, 26 Jul 2003 15:18:13 -0500, Al Terer <ter...@yahoo.com> made
the world a better place by saying:

)?
>
>Most people on this thread seem to be focused on computer generated
>speech, but what about something like books on tape, or even radio
>dramas? The descriptions, messages, and dialog could be pre-recorded
>by voice actors. Music and sound effects could also be present. Of
>course, this would take a lot more work than an ordinary adventure,
>and would almost have to be some sort of collaboration.
>
>Imagine: you open the game file, and hear haunting music. A hollow
>voice says, "Welcome to Adventure."
>"Go north," you tell your computer. And so on...
>
>I'd buy it.
>
>
>

It sounds like a great way to play a game. It would need considerable
investments, though: You need some pretty good voice actors. And that
is not something the gaming industry is well known for. (There are
exceptions like MGS and MGS2 but most games have crappy voices)

Still, they idea of a 'radio play' in which you play the lead is very
very cool.
-------------------------
"Hey, aren't you Gadget?"
"I was."

http://www.haha.demon.nl
(To send e-mail, remove SPAMBLOCK from address)