How to use phonemes and other v1beta1 features in google-text-to-speech

Lena Maria Schimmel

unread,

Sep 28, 2021, 6:21:08 AM9/28/21

to Google Cloud Developers

I'm trying to use the SSML <phoneme> tag. The current documentation looks like it should just work (see here and here). However, there used to be a page which stated that this is a v1beta1 feature. Its 404 now, but there's an archived version. So my first question probably is whether <phoneme> is v1beta1 only or if it's been back-ported to v1.

I'm using the Java client library to access the service. The documentation does not explicitly state it, but I guess to use v1beta1 I just have to change all my imports, e.g. from

import com.google.cloud.texttospeech.v1.TextToSpeechClient;

to

import com.google.cloud.texttospeech.v1beta1.TextToSpeechClient;

and that should work?

I also tried to use the <phoneme> tag on the demo page. As pointed out by this StackOverflow question (which sadly never got an answer), the demo page accesses the v1beta1 service URL, but strips out some SSML tags like <voice>. I can confirm that <phoneme> is also removed before the request is sent to the server.

Whereas <voice> works with my Java client, <phoneme> still does not. In the synthesized speech, only the text content of the element is spoken. For example, this input:

<speak>As you can hear, <phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">this tag is ignored</phoneme>.</speak>

is spoken as: "As you can hear, this tag is ignored."

Lena Maria Schimmel

unread,

Sep 29, 2021, 4:43:44 AM9/29/21

to Google Cloud Developers

After some experimentation, it seems that the <phoneme> tag is working when I use the Java client lib, but only for en-US voices:

Working as expected:

en-US

Not working (text is synthesized without an error message, but the <phoneme> tag is ignored as described in the previous post):

en-GB
en-AU
en-IN
es-ES
es-US
de-DE
fr-FR
ko-KR
it-IT
(I didn't bother to test all available languages)

I would think that all these languages should support <phoneme>, because they are all listed on the "Supported phonems and levels of stress" page. For most of those languages I tried both the Standard and Wavenet variants of at least one voice.

Lena Maria Schimmel

unread,

Sep 29, 2021, 5:30:53 AM9/29/21

to Google Cloud Developers

Sorry, everyone, for monologizing. I found out that the feature is, indeed, working as documented. But I think the documentation could or should be improved.

Most of the time, I tested with the example <phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme> from the documentation. I did not expect that the example ˌmænɪˈtoʊbə would fail by design with just about any language but en-US, as itcontains phonemes which are only supported in en-US. It would be nice if this was stated clearly on the documentation page. It's quite a bit of effort to go through it phoneme by phoneme, looking it up in the table manually, just to find out that none of en-GB, en-AU and en-IN does not have the "o" or "oʊ" phoneme.

Also did not expect that using an unsupported phoneme would fail the way it does. I understand that the API has no way to report warnings, as SynthesizeSpeechResponse does not have any field to fit it in, and that
falling back to the text content of the phoneme element is a best-effort way to provide a useful audio content that does not confuse the end user. From the perspective of a software developer, or the person who authors the SSML content, it's not optimal and not on par with what we are used from compiles, interpreters, linters, etc. that give us detailed feedback on why something does not work as expected.

Reply all

Reply to author

Forward