Kendra Voice Text To Speech

0 views

Skip to first unread message

Nathen Paisley

unread,

Aug 5, 2024, 3:18:35 AM8/5/24

to procenrati

Applythe same voice design principles as you do when constructing a typical Alexa response. Be brief, speak and write naturally, prompt with guidance for the user, use conversation markers, and so forth. See Alexa Design Guide.

The lang tag can be used on its own or nested in the voice tag to control how Amazon Polly voices speak. Use the lang tag with a corresponding voice of the same language for the best results, as shown here. See lang tag.

Alexa skill developers have a limit of 10,000 characters for a TTS (text to speech) response in their skill. With 10,000 characters, you can generate up to approximately 10 minutes of continuous audio stream with Amazon Polly and Alexa voices for use in the Alexa skills. However, responses should generally be brief for the best customer experience. See the one-breath test in the Alexa Design Guide.

Optionally, adjust for acoustic differences among different Alexa and Amazon Polly voices. Developers should keep in mind that Alexa and Amazon Polly voices may vary in the pitch, rate, timbre, and volume since they are different voices. Acoustic differences among different voices can be adjusted using different SSMLs tags developers should consider using them to provide a customer experience consistent with the use cases in their Alexa skill. For example,

You can use any of the supported Amazon Polly voices in your Alexa responses, for part or all of the response. Be mindful of the customer experience if you combine voices from different locales in your skill responses.

When the your skill returns a response to a request, you provide text that the Alexa service converts to speech. Alexa automatically handles normal punctuation, such as pausing after a period, or speaking a sentence ending in a question mark as a question.

However, sometimes you might want additional control over how Alexa generates the speech from the text in your response. For example, you might want a longer pause within the speech, or you might want Alexa to read a string of digits as a standard telephone number. The Alexa Skills Kit provides this type of control with Speech Synthesis Markup Language (SSML) support.

SSML is a markup language that provides a standard way to mark up text for the generation of synthetic speech. The Alexa Skills Kit supports a subset of the tags defined in the SSML specification. For the list of supported tags, see Supported SSML Tags.

To use SSML, construct your output speech with the supported SSML tags. When you send a response from your service, you must indicate that the speech is in SSML rather than plain text. If you construct the JSON response directly, provide the marked-up text in the outputSpeech property and set the type to SSML instead of PlainText. Use the ssml property instead of text for the marked-up text:

In the JSON output for the SSML, either escape quotation marks within the output, or use an appropriate mix of single and double quotation marks. The following example wraps the response in double quotation marks and uses single quotation marks for attributes.

If you use Alexa Presentation Language (APL) for audio, you can use the Speech component to render SSML. Set the content property to the SSML text, enclosed with tags. Set the contentType property to SSML.

You can combine most supported tags with each other to apply multiple effects on the speech. For instance, this example uses both the and tags. This tells Alexa to speak the entire string in an "excited" voice, and speak the provided number as individual digits:

Applies different speaking styles to the speech. The styles are curated text-to-speech voices that use different variations of intonation, emphasis, pausing, and other techniques to match the speech to the content. For example, the news style makes Alexa's voice sound like what you might expect to hear in a TV or radio newscast, and was built primarily for customers to listen to news articles and other news-based content.

The tag causes Alexa to express emotion when speaking. The emotion effects are useful for stories, games, news updates and other narrative content. For instance, in a game, you might use the "excited" emotion for correct answers and the "disappointed" emotion for incorrect answers.

The tag lets you provide the URL for an MP3 file that the Alexa service can play. Use the tag to embed short, pre-recorded audio within your response. For example, you could include sound effects alongside your text-to-speech responses, or provide a response that uses a voice associated with your brand.

The MP3 files you use to provide audio must be hosted on an endpoint that uses HTTPS. The endpoint must provide an SSL certificate signed by an Amazon-approved certificate authority. Many content hosting services provide this. For example, you could host your files at a service such as Amazon Simple Storage Service (Amazon S3) (an Amazon Web Services offering).

You aren't required to authenticate the requests for the audio files. Therefore, you must not include any customer-specific or sensitive information in these audio files. For example, building a custom MP3 file in response to a user's request, and including sensitive information within the audio, isn't allowed.

For optimal performance, Amazon recommends that you host your MP3 files for SSML responses in close proximity to where your skill is hosted. For example, if the Lambda function for your skill is hosted in the US West (Oregon) region, you will get better performance if you upload your MP3s to a US West (Oregon) S3 bucket.

Alexa supports SSML tags that point toward HTTP Live Streaming (HLS) streams, provided that the audio data conforms to the listed specifications. Due to the streaming approach that Alexa uses, there is no benefit to using HLS streams instead of statically served MP3 files. Furthermore, unlike with statically served MP3 files, an SSML response that contains an HLS stream that violates the 240-second duration limit fails silently. This silent failure means that the playback stops before the limit is hit, no error message is generated on the customer device, and the skill doesn't receive an error request. If your skill uses SSML responses that contain HLS streams, make sure that you take particular care to test the audio returned in its responses.

Use to specify the language model and rules to speak the tagged content as if it were written in the language specified by the xml:lang attribute. Words and phrases in other languages usually sound better when enclosed with the tag. This is useful for short phrases in other languages, such as the names of restaurants or shops.

Alexa adapts the pronunciation to use the sounds available in the original language of the skill, so it might not sound exactly like a native speaker. To achieve a more natural voice than what you get with the tag alone, use the tag and the tag together. With the , you can select a voice customized for a specific language. Make sure that the language of the tagged text matches the attribute, and that the is specific to the language of the text.

With the tag, Alexa uses French pronunciation with sounds available in English for a "French-like" pronunciation. A perfect French pronunciation would include an uvular trill (/R/) in the word "adore." The French-like English pronunciation achieved with the tag uses the corresponding /r/ sound instead.

These symbols provide full coverage for the sounds of Arabic (SA). Other languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list for Arabic (SA) skills is discouraged, as it may result in suboptimal speech synthesis.

These symbols provide full coverage for the sounds of English (AU). Other languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list for English (AU) skills is discouraged, as it may result in suboptimal speech synthesis.

These symbols provide full coverage for the sounds of English (Canada). Other languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list for English (Canada) skills is discouraged, as it may result in suboptimal speech synthesis.

These symbols provide full coverage for the sounds of English (India). Other languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list for English (India) skills is discouraged, as it may result in suboptimal speech synthesis.

These symbols provide full coverage for the sounds of English (UK). Other languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list for English (UK) skills is discouraged, as it may result in suboptimal speech synthesis.

These symbols provide full coverage for the sounds of English (US). Other languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list for English (US) skills is discouraged, as it may result in suboptimal speech synthesis.

These symbols provide full coverage for the sounds of French (CA). Other languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list for French (CA) skills is discouraged, as it may result in suboptimal speech synthesis.

These symbols provide full coverage for the sounds of French (FR). Other languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list for French (FR) skills is discouraged, as it may result in suboptimal speech synthesis.

These symbols provide full coverage for the sounds of German. Other languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list for German skills is discouraged, as it may result in suboptimal speech synthesis.

These symbols provide full coverage for the sounds of Hindi (IN). Other languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list for Hindi (IN) skills is discouraged, as it may result in suboptimal speech synthesis.