misc/RFC: TTS+MIDI, phonetics

cr88192

unread,

Jul 4, 2009, 1:27:28 PM7/4/09

to

well, I am posting this where I think it may be relevant...

basically, this was part of a misc idea that came up, and I went and beat
together the code for it (AKA: I don't expect it to amount to much).

the idea was that I would combine together a speech synthesizer/TTS engine
and a MIDI synth, and see if I could get much "interesting" from it (such as
combining music and a synth'ed voice, singing TTS, ...).

in general, it was created by mashing together 2 pieces of code I had
written before, for which I had noticed some internal similarity:
a TTS-engine / speech synth (where mostly I had used diphone, but had
experimented some with formant);
a MIDI synth, where in my case I had used wavetable synth.

the TTS engine had had some of the usual front-end machinery, such as text
normalization, phonetic dictionary handling/lookup, ..., so I kept this.

the MIDI synth is, well, a MIDI synth...

combining them, however, forced a good deal of alteration to the machinery
for both.
particularly, many pieces of functionality from the TTS engine (such as
"voices") was absorbed into the MIDI synth, and wavetables/patches are
essentially relative to the voice, ...

however, the MIDI synth still plays midi-files, as before.
as is, the voice patches override GM patches, but I am likely to move the
voice patches to bank 2 (banks 0 and 1 being GM and GM2).

the TTS frontend has been reworked mostly so that it produces short MIDI
fragments, which basically rework the phonetic information into a stream of
MIDI commands (the frontend has control over matters such as voice frequency
and timing, ...).

these commands mostly work in terms of a voice-derived wavetable, and AFAIK
the process is a variant of formant synthesis, although I don't actually
simulate the voice signal (mostly I use loops derived from various vowel and
consonant sounds, as well as a few non-looping patches).

mostly this is because it is a lot easier to get a convincing 'ah' or 'eh'
by deriving it from an actual voice, and by using several recorded
frequencies in an attempt to cover the vocal range (similar to how multiple
recordings of an instrument at different notes are used internally in the
wavetable...).

nothing prevents me from using purely synthetic voices, only that I don't
see as much need at present...

dipthongs are currently synthed, but this doesn't sound very good, and I
have doubts about using recorded diphthongs (mostly timing/frequency
issues...). however, I am not sure of a good mechanism to synthesize them
(simply blending between the adjacent sounds is not very good...).

at this point, mostly still battling basic comprehensibility issues...

otherwise, it may be worth noting that for composing the MIDI, I am using a
textual representation of the command-stream, mostly as this is a little
easier to compose (via sprintf/...) than would be a binary representation.

an issue though is that of how to best represent a combination of text and
MIDI information (for the input).

one possibility is to just use an odd syntax to just sort of "stuff in" MIDI
commands, but this seems not very good. another uncertainty is how to best
represent commands to the voice (such as "speak in this particular note",
"speak at this rate", ...).

I guess of uncertainty is the issue commonly seen in singing things, where
people will sing part of a word at one note, and then sing another part at
another note, ...

as-is, breaking up a word like this would confuse the dictionary, and to
address this would require representing the words in phonetic form, ...

another issue:
for the phonetic form, is the IPA really necessary?... (internally, I don't
use IPA, rather a customized ASCII-based notation, vaguely similar to SAMPA
but currently without non-letter chars, and in many places different as I
didn't know about SAMPA originally...).

actually, personally I would rather change the notation some (reorganizing
some of the letters, ...), but the main issue I guess is that I would have
to rework my dictionary (may be a worthwhile tradeoff, in the past it would
have been more difficult), ...

I guess a partial issue is what is the most "ideal" notation for photetic
transcriptions?...
(part of my "ideal" I guess is the avoidance of non-ASCII characters, and
preferably avoidance of any special characters as well...).

current thinking:
a-z: typical "base sounds"
A-Z: typical "alternate sounds"
ax-zx (excluding xx): additional alternate sounds, or, as a case-insensitive
alternate to upper-case forms (for example, in filenames, ...).
Ax-Zx: yet more alternate sounds
aX-zX: yet more
AX-ZX: yet more

this allows 156 sounds, although... as is I had yet to exceed the prior
limit of 56 (lower+upper case), though this is probably because I am
generally being far less precise than the IPA?...

56 sounds could be done with:
a-z
A-Z | ax-zx

this would be in contrast to my current notation, which uses 'x' as a prefix
(for a similar purpose):
xa-xz, xA-xZ, ... (and in which xa!=A, as is, I have to use an alternate
notation in filenames, ...).

the other major changes would be reorganizing some of the letter assignments
(from my current notation) to be more "traditional"...

(actually, I may use SAMPA partly as a template, trying mostly to add an
alternate notation, AKA: without special symbols and more flexible WRT case,
mostly so that it is safer to mix with file names, and with other syntactic
elements which may also need to use these non-letter characters...).

it is uncertain if it should remain as a mixed-case notation, or be forced
into being a case-insensitive notation. my current bias is to keep it as
case-sensitive, but allow certain alternate forms, mostly for file-naming
(forcing a full case-insensitive notation is likely to just make things
ugly...).

or such...

--
BGB: Hobbyist Programmer (Specialty: 3D, Compilers, VMs)
Site: http://cr88192.dyndns.org/

luserXtrog

unread,

Jul 5, 2009, 3:31:04 AM7/5/09

to

On Jul 4, 12:27 pm, "cr88192" <cr88...@hotmail.com> wrote:

> I guess of uncertainty is the issue commonly seen in singing things, where
> people will sing part of a word at one note, and then sing another part at
> another note, ...

Could you use the bender for this?

--
lxt

[Jongware]

unread,

Jul 5, 2009, 6:32:42 AM7/5/09

to

"cr88192" <cr8...@hotmail.com> wrote in message
news:h2o3e2$f08$1...@news.albasani.net...

> another issue:
> for the phonetic form, is the IPA really necessary?... (internally, I don't
> use IPA, rather a customized ASCII-based notation, vaguely similar to SAMPA
> but currently without non-letter chars, and in many places different as I
> didn't know about SAMPA originally...).

If you ever want to distribute this as a useful application (who knows!), proper
IPA support would be nice. SAMPA is a poor man's approximation of the IPA set,
and it has a fair set of strange decisions ... however, it /can/ be entered with
any keyboard.
And, as always, you are free to decide for a scheme for yourself.
The con is that you cannot /mix/ these approaches -- your own scheme could
suddenly pop up and mess up an IPA phrase. Perhaps you could prefix each phrase
with a unique identifier:
"=hElo world"
where the '=' indicates using your private system.

As for needing more than the standard set of a..z/A..Z, SAMPA proves (for me :-)
that throwing in even more ASCII characters for each unique sound doesn't really
help. Perhaps you can get by with multi-character strings, although you should
try to avoid 'ax', 'ex', 'ox' for sounds that have nothing to do with 'echh' --
"th" is easier to parse as a soft theta than "tx". All you need to do is finding
a way to incorporate multi-character phonemes /without/ having them pop up
unadvertently :-) -- bracket them? (F.e., "[th]eta" wise) Do you have a list of
problematic phonemes?

Interesting project!

[Jw]

cr88192

unread,

Jul 5, 2009, 11:23:58 AM7/5/09

to

"luserXtrog" <mij...@yahoo.com> wrote in message
news:cad33091-64c4-4c64...@s31g2000yqs.googlegroups.com...

bender?...

not sure which feature this is exactly (not sure of any MIDI command with
that name...).

part of the problem though is that commands tend to be represented
sequentially, and it would be problematic to represent a command in the
middle of the word without breaking up the word.

I guess potentially a kind of prefix command could be used, but then the
question would be "where in the word to change the note?".

one idea I partly thought up is this:
a word ends with '-', which indicates a word break.

^C4 merr- ^D4 ily ^E4 they ^A3 went ^G4 a- ^F4 long ^C4 their ^E4 way

so, then it can join and look up the word, and try to guess where to break
it again in the phonetic transcription...

^C4 *mer ^D4 *ily ...

this transformation could be a little awkward though, as my TTS frontend is
essentially structured around a stack machine...

cr88192

unread,

Jul 5, 2009, 12:11:21 PM7/5/09

to

"[Jongware]" <IdontW...@hotmail.com> wrote in message
news:23d84$4a508143$3ec348e5$25...@news.chello.nl...

> "cr88192" <cr8...@hotmail.com> wrote in message
> news:h2o3e2$f08$1...@news.albasani.net...
>> another issue:
>> for the phonetic form, is the IPA really necessary?... (internally, I
>> don't
>> use IPA, rather a customized ASCII-based notation, vaguely similar to
>> SAMPA
>> but currently without non-letter chars, and in many places different as I
>> didn't know about SAMPA originally...).
>
> If you ever want to distribute this as a useful application (who knows!),
> proper
> IPA support would be nice. SAMPA is a poor man's approximation of the IPA
> set,
> and it has a fair set of strange decisions ... however, it /can/ be
> entered with
> any keyboard.
> And, as always, you are free to decide for a scheme for yourself.
> The con is that you cannot /mix/ these approaches -- your own scheme could
> suddenly pop up and mess up an IPA phrase. Perhaps you could prefix each
> phrase
> with a unique identifier:
> "=hElo world"
> where the '=' indicates using your private system.
>

ok.
typically I have used '*' for phonetic fragments, maybe I could use '*' for
my notation, and '[...]' for SAMPA?...

*helowerld
*DIsIzqfrexz

of course, this would mean either supporting both in my backend (duplicated
code/effort), or doing a transcription...

(however, a transcription approach could also be made to handle IPA, where
it would be transcribed...).

note that, without brackets, my TTS engine tends to assume it is a normal
word, either looking it up in the dictionary or trying to invoke phonics
magic...

> As for needing more than the standard set of a..z/A..Z, SAMPA proves (for
> me :-)
> that throwing in even more ASCII characters for each unique sound doesn't
> really
> help. Perhaps you can get by with multi-character strings, although you
> should
> try to avoid 'ax', 'ex', 'ox' for sounds that have nothing to do with
> 'echh' --
> "th" is easier to parse as a soft theta than "tx". All you need to do is
> finding
> a way to incorporate multi-character phonemes /without/ having them pop up
> unadvertently :-) -- bracket them? (F.e., "[th]eta" wise) Do you have a
> list of
> problematic phonemes?
>

ok, in my newer notation ax/ex/ox/... ended up being assigned to dipthongs
(I freed up A/E/I/O/U for use as vowels, which had before contained both
vowels and dipthongs).

'Ax'/'Ex'/... could be used for dipthongs instead, but I had used 'ax'/...
for this.

at first, I figured I could make dipthongs implicit, but then realized a
bigger problem:
I would need a notation to indicate when not to use dipthongs.

q/Q is "redefined" in my notation as a vowel (allowing 12 base vowels, as
well as 12 "extended" vowels, several of which are used as dipthongs).

as is, I currently have about 10 base vowels (I started with 8, but with
thinking came up with 2 more...).

'x' (in SAMPA) has been moved to 'K'.

under the current scheme, "soft theta" (I assume 'voiced th' is meant by
this) is 'D'.

in my case, words are either pure photetic or pure textual. partial
bracketing is not done as this would confuse the current processing
machinery...

I decided on keeping the system proper as case-sensitive, and essentially
use a mangling hack to map it to a case-insensitive form.

I guess the major alternative is to continue using my prior notation
externally (essentially, a variant of the cmudict/Festival notation...).

> Interesting project!
>

maybe, just something random in my case...

Richard Heathfield

unread,

Jul 6, 2009, 12:47:12 AM7/6/09

to

cr88192 said:
> "luserXtrog" <mij...@yahoo.com> wrote...

> "cr88192" wrote:
>
>>> I guess of uncertainty is the issue commonly seen in singing
>>> things, where
>>> people will sing part of a word at one note, and then sing
>>> another part at
>>> another note, ...
>>
>> Could you use the bender for this?
>
> bender?...
>
> not sure which feature this is exactly (not sure of any MIDI
> command with that name...).

Presumably he is referring to pitch-bend. But I don't think that's
your problem. It seems to me that your problem is one of clean
syntax design, or at least you hint as much in your OP. Pitch-bend
may or may not be helpful as a solution to the problem of
representing variable intonation, but it isn't going to solve your
syntax problem for you.

<snip>

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Forged article? See
http://www.cpax.org.uk/prg/usenet/comp.lang.c/msgauth.php
"Usenet is a strange place" - dmr 29 July 1999

cr88192

unread,

Jul 6, 2009, 1:12:06 AM7/6/09

to

"Richard Heathfield" <r...@see.sig.invalid> wrote in message
news:FcednfaRaruhHczX...@bt.com...

> cr88192 said:
>> "luserXtrog" <mij...@yahoo.com> wrote...
>> "cr88192" wrote:
>>
>>>> I guess of uncertainty is the issue commonly seen in singing
>>>> things, where
>>>> people will sing part of a word at one note, and then sing
>>>> another part at
>>>> another note, ...
>>>
>>> Could you use the bender for this?
>>
>> bender?...
>>
>> not sure which feature this is exactly (not sure of any MIDI
>> command with that name...).
>
> Presumably he is referring to pitch-bend. But I don't think that's
> your problem. It seems to me that your problem is one of clean
> syntax design, or at least you hint as much in your OP. Pitch-bend
> may or may not be helpful as a solution to the problem of
> representing variable intonation, but it isn't going to solve your
> syntax problem for you.
>

yeah, pretty much...

there was the idea of using MIDI lyric events (and a binary MIDI input), but
the problem here is how to key the lyrics to the music (apart from assuming
extra data be included, but this would IMO defeat the point of lyric
events...).

however, one syntax idea that came to mind is to allow using '-' as a word
break, such that a word break could be given, and notes changed.

^C4 merr- ^D4 ily ...

another issue I realize now is one of timing:
not only does one care the rate and frequency of the words, but also when
the words are said.

this opens up yet another set of awkward design issues (such as the possible
need for timestamps, ...).

so, yes, the "combined whole" is starting to look a little more complex than
either MIDI or TTS by themselves...

some of the issues could be addressed with certain features I had thought
up, such as asynchronous MIDI-stream joiners, but timestamps is an issue in
its own right.

a very simple trick though could be to add explicit breaks along
quarter-note boundaries, where a command is given that serves to re-align
the TTS engine to the next note.

however, this leaves an issue of what to do if/when a synthed fragment goes
over a note, where likely having it take 2-notes would not be the intended
result (potentially throwing the lyrics out of sync with the beat, ...).

(it probably doesn't help much that I don't really know "music theory"
either...).

and so on...

Fred Bloggs

unread,

Jul 31, 2009, 7:36:30 AM7/31/09

to

"cr88192" <cr8...@hotmail.com> wrote in message
news:h2o3e2$f08$1...@news.albasani.net...

> well, I am posting this where I think it may be relevant...
>
> basically, this was part of a misc idea that came up, and I went and beat
> together the code for it (AKA: I don't expect it to amount to much).
>
>
> the idea was that I would combine together a speech synthesizer/TTS engine
> and a MIDI synth, and see if I could get much "interesting" from it (such
as
> combining music and a synth'ed voice, singing TTS, ...).
>
>

It may put you off perhaps, but you could model your MIDI plus text input to
what these guys are doing commercially.
http://www.soundsonline-europe.com/Symphonic-Choirs-PLAY-Edition-pr-EW-182.h
tml. This is MIDI notes plus a special text input program. It works well.
There are also three tutorials on you tube that show how to use it.

SysExJohn.