Microsoft Speech Platform - Runtime Languages (version 12)

3 views

Skip to first unread message

Micol Cohn

unread,

Aug 3, 2024, 5:33:00 PM8/3/24

to plexrimaje

The Microsoft Speech platform is used by Voice Elements for Text-To-Speech (TTS) and for Speech Recognition. Many languages are supported. License to use Microsoft Speech for TTS and Speech Recognition is included with your Windows OS license.

Microsoft stopped development on the Microsoft Speech platform in 2012. Instead of processing text-to-speech (TTS) or speech recognition (SR) on-premises, Microsoft now steers its customers to use their cloud services on Azure. Those services and other similar services on the cloud can provide excellent SR and TTS and can work in conjunction with the Voice Elements platform. However, since there is no charge for the Microsoft Speech Platform, we continue to support it as our go-to default facility for TTS and SR.

You should have this installed on your server in order to perform speech recognition functions within Voice Elements. Voice Elements has built out support for Microsoft Speech Platform, as long as you use Microsoft compatible grammar files. These are easy to create using the methods outlined in this article: Create Microsoft Speech Compatible Grammar Files

The Microsoft Speech Platform relies on different language packs in order to provide speech recognition capabilities for different languages. Microsoft Speech Platform supports 18 different languages and accents. You can download some of the more popular languages using the links below. For additional options, please contact Inventive Labs Technical Support.

The SDK is the tookit provided by Microsoft to use the Microsoft Speech Platform. All of this functionality is built into Voice Elements. You will not need to have this installed, unless you would like to use it to create Microsoft Compatible Grammar files.

After completing the steps above you can enable speech recognition before using the Play and PlayTTS methods by setting SpeechRecognitionEnabled to true before calling them. For information on how to use the speech recognition demo application for testing, see Test Speech Recognition with Voice Elements.

Please note that SpeechRecognitionNumberOfPorts should be set to a number that is equal to or less than the number of Speech Recognition Ports for which you are licensed. You can check your license entitlements in the Voice Elements Dashboard.

Neural Text-to-Speech (Neural TTS), part of Speech in Azure Cognitive Services, enables you to convert text to lifelike speech for more natural user interactions. Neural TTS has powered a wide range of scenarios, from audio content creation to natural-sounding voice assistants, for customers from all over the world. For example, the BBC, Progressive and Motorola Solutions are using Azure Neural TTS to develop conversational interfaces for their voice assistants in English speaking locales. Swisscom and Poste Italiane are adopting neural voices in French, German and Italian to interact with their customers in the European market. Hongdandan, a non-profit organization, is using neural voices in Chinese to make their online books audible for the blind people in China.

By September 2020, we extended Neural TTS to support 49 languages/locales with 68 voices. At the same time, we continue to receive customer requests for more voice choices and more language support globally.

Today, we are excited to announce that Azure Neural TTS has extended its global support to five new languages: Maltese, Lithuanian, Estonian, Irish and Latvian, in public preview. At the same time, Neural TTS Container is generally available for customers who want to deploy neural voice models on-prem for specific security requirements.

Five new voices and languages are introduced to the Neural TTS portfolio. They are: Grace in Maltese (Malta), Ona in Lithuanian (Lithuania), Anu in Estonian (Estonia), Orla in Irish (Ireland) and Everita in Latvian (Latvia). These voices are available in public preview in three Azure regions: EastUS, SouthEastAsia and WestEurope.

Built on top of LRSpeech and the multi-lingual multi-speaker transformer TTS model (called UNI-TTS), we have designed the offline model training pipeline and the online inference pipeline for the low-resource TTS. Three key innovations contribute to the significant agility gains with this approach.

First, by leveraging the parallel speech data (the pairing speech audios and the transcript) collected during the speech recognition development, the LR-UNI-TTS training pipeline greatly reduces the data requirements for refining the base model in the new language. Previously, the high-quality multi-speaker parallel data has been critical in extending TTS to support a new language. The TTS speech data is more difficult to collect as it requires the data to be clean, the speaker carefully selected, and the recording process well controlled to ensure the high audio quality.

Second, by applying the cross-lingual speaker transfer technology with the UNI-TTS pipeline, we are able to leverage the existing high-quality data in a different language to produce a new voice in the target language. This saves the effort to find a new professional speaker for the new languages. Traditionally, the high-quality parallel speech data in the target language is required, which easily takes months for the voice design, voice talent selection, and recording.

Lastly, the LR-UNI-TTS approach uses characters instead of phonemes as the input feature to the models, while the high-resource TTS pipeline is usually composed of a multi-step text analysis module that turns text into phonemes, costing long time to build.

In specific, at the offline training stage, we have leveraged a few hundred hours of the speech recognition data to further refine the UNI-TTS model. It can help the base model to learn more prosody and pronunciation patterns for the new locales. The speech recognition data is usually collected in daily environments using PC or mobile devices, unlike the TTS data which is normally collected in the professional recording studios. Although the SR data can be much lower-quality than the TTS data, we have found LR-UNI-TTS can benefit from such data effectively.

With this approach, the high-quality parallel data in the new language which is usually required for the TTS voice training becomes optional. If such high-quality parallel data is available, it can be used as the target voice in the new language. If no high-quality parallel data is available, we can also choose a suitable speaker from an existing but different language and transfer it into the new language through the cross-lingual speaker transfer-learning capability of UNI-TTS.

At the runtime, a lightweight text analysis is designed to preprocess the text input with sentence separation and text normalization. Compared to the text analysis component of the high-resource language pipelines, this module is greatly simplified. For instance, it does not include the pronunciation lexicon or letter-to-sound rules which are used in high-resource languages. The normalized text characters are generated by the lightweight text analysis component. During this process, we also leverage the text normalization rules from the speech recognition development, which saves the overall cost a lot.

Similar to other TTS voices, the quality of the low-resource voices created in the new languages are measured using the Mean Opinion Score (MOS) tests and intelligibility tests. MOS is a widely recognized scoring method for speech naturalness evaluation. With MOS studies, participants rate speech characteristics such as sound quality, pronunciation, speaking rate, and articulation on a 5-point scale, and an average score is calculated for the report. Intelligibility test is a method to measure how intelligible a TTS voice is. With intelligibility tests, judges are asked to listen to a set of TTS samples and mark out the unintelligible words to them. Intelligibility rate is calculated using the percentage of the correctly intelligible words among the total number of words tested (i.e., the number of intelligible words/the total number of words tested * 100%). Normally a usable TTS engine needs to reach a score of > 98% for intelligibility.

* Note: MOS scores are subjective and not directly comparable across languages. The MOS of the mt-MT voice is relatively lower but reasonable in this case considering that the human recordings used as the training data for this voice also gots a lower MOS.

LR-UNI-TTS has paved the way for us to extend Neural TTS to more languages for the global users more quickly. Most excitingly, LR-UNI-TTS can potentially be applied to preserve the languages that are disappearing in the world today, as pointed out in the guiding principles of XYZ-code.

With the five new languages released in public preview, we welcome user feedback as we continue to improve the voice quality. We are also interested to partner with passionate people and organizations to create TTS for more languages. Contact us (mstts[at]microsoft.com) for more details.

Together with the preview of these five new languages, we are happy to share that the Neural TTS Container is now GA. With Neural TTS Container, developers can run speech synthesis with the most natural digital voices in their own environment for specific security and data governance requirements. Learn more about how to install Neural TTS Container and visit the Frequently Asked Questions on Azure Cognitive Services Containers.

The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.

In general, all versions of the API have been designed such that a software developer can write an application to perform speech recognition and synthesis by using a standard set of interfaces, accessible from a variety of programming languages. In addition, it is possible for a 3rd-party company to produce their own Speech Recognition and Text-To-Speech engines or adapt existing engines to work with SAPI. In principle, as long as these engines conform to the defined interfaces they can be used instead of the Microsoft-supplied engines.