Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Speech Synthesis Demo

1 view

Skip to first unread message

Melva Souder

unread,

Dec 5, 2023, 9:12:30 PM12/5/23

Speech recognition involves receiving speech through a device's microphone, which is then checked by a speech recognition service against a list of grammar (basically, the vocabulary you want to have recognized in a particular app.) When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further actions can be initiated as a result.

To show simple usage of Web speech recognition, we've written a demo called Speech color changer. When the screen is tapped/clicked, you can say an HTML color keyword, and the app's background color will change to that color.

Speech Synthesis Demo
Download https://urllio.com/2wIBn7

As mentioned earlier, Chrome currently supports speech recognition with prefixed properties, therefore at the start of our code we include these lines to feed the right objects to Chrome, and any future implementations that might support the features without a prefix:

The next thing to do is define a speech recognition instance to control the recognition for our application. This is done using the SpeechRecognition() constructor. We also create a new speech grammar list to contain our grammar, using the SpeechGrammarList() constructor.

We then add the SpeechGrammarList to the speech recognition instance by setting it to the value of the SpeechRecognition.grammars property. We also set a few other properties of the recognition instance before we move on:

After grabbing references to the output and the HTML element (so we can output diagnostic messages and update the app background color later on), we implement an onclick handler so that when the screen is tapped/clicked, the speech recognition service will start. This is achieved by calling SpeechRecognition.start(). The forEach() method is used to output colored indicators showing what colors to try saying.

Once the speech recognition is started, there are many event handlers that can be used to retrieve results, and other pieces of surrounding information (see the SpeechRecognition events.) The most common one you'll probably use is the result event, which is fired once a successful result is received:

The last two handlers are there to handle cases where speech was recognized that wasn't in the defined grammar, or an error occurred. The nomatch event seems to be supposed to handle the first case mentioned, although note that at the moment it doesn't seem to fire correctly; it just returns whatever was recognized anyway:

To show simple usage of Web speech synthesis, we've provided a demo called Speak easy synthesis. This includes a set of form controls for entering text to be synthesized, and setting the pitch, rate, and voice to use when the text is uttered. After you have entered your text, you can press Enter/Return to hear it spoken.

In the final part of the handler, we include an pause event to demonstrate how SpeechSynthesisEvent can be put to good use. When SpeechSynthesis.pause() is invoked, this returns a message reporting the character number and name that the speech was paused at.

Deliver a better voice experience for customer service with voicebots on Dialogflow that dynamically generate speech, instead of playing static, pre-recorded audio. Engage with high-quality synthesized voices that give callers a sense of familiarity and personalization.

Fine-tune synthesized speech audio to fit your scenario. Define lexicons and control speech parameters such as pronunciation, pitch, rate, pauses, and intonation with Speech Synthesis Markup Language (SSML) or with the audio content creation tool.

The Web Speech API adds voice recognition (speech to text) and speech synthesis (text to speech) to JavaScript. The post briefly covers the latter, as the API recently landed in Chrome 33 (mobile and desktop). If you're interested in speech recognition, Glen Shires had a great writeup a while back on the voice recognition feature, "Voice Driven Web Apps: Introduction to the Web Speech API".

In my Google I/O 2013 talk, "More Awesome Web: features you've always wanted" (www.moreawesomeweb.com), I showed a Google Now/Siri-like demo of using the Web Speech API's SpeechRecognition service with the Google Translate API to auto-translate microphone input into another language:

Unfortunately, it used an undocumented (and unofficial API) to perform the speech synthesis. Well now we have the full Web Speech API to speak back the translation! I've updated the demo to use the synthesis API.

This demo presents the LPCNet architecture that combines signal processing and deep learning to improve the efficiency of neural speech synthesis. Neural speech synthesis models like WaveNet have recently demonstrated impressive speech synthesis quality. Unfortunately, their computational complexity has made them hard to use in real-time, especially on phones. As was the case in the RNNoise project, one solution is to use a combination of deep learning and digital signal processing (DSP) techniques. This demo explains the motivations for LPCNet, shows what it can achieve, and explores its possible applications.

Back in the 70s, some people started investigating how to model speech. They realized they could approximate vowels as a a series of regularly-spaced (glottal) impulses going through a relatively simple filter that represents the vocal tract. Similarly, unvoiced sounds can be approximated by white noise going through a filter.

So intelligible speech, but by no means good quality. With a lot of effort, it's possible to create slightly more realistic models, leading to codecs like MELP and codec2. This is what codec2 sounds like on the same sample:

WaveNet can synthesize speech with much higher quality than other vocoders, but it has one important drawback: complexity. Synthesizing speech requires tens of billions of floating-point operations per second (GFLOPS). This is too high for running on a CPU, but modern GPUs are able to achieve performance in the TFLOPS range, so no problem, right? Well, not exactly. GPUs are really good at executing lots of operations in parallel, but they're not very efficient on WaveNet for two reasons:

The WaveRNN architecture (by the authors of WaveNet) comes in to solve problem 1) above. Rather than using a stack of convolutions, it uses a recurrent neural network (RNN) that can be computed in fewer steps. The use of an RNN with sparse matrices also reduces complexity, so that we can consider implementing it on a CPU. That being said, its complexity is still in the order of 10 GFLOPS, which is about two orders of magnitude more than typical speech processing algorithms (e.g. codecs).

In the results above, the complexity of the synthesis ranges between 1.5 GFLOPS for the network of size 61, and 6 GFLOPS for the network of size 203, with the medium size network of size 122 having a complexity around 3 GFLOPS. This is easily within the range of what a single smartphone core can do. Or course, the lower the better, so we're still looking at ways to further reduce the complexity for the same quality.

Comparing the speech synthesis quality of LPCNet with that of WaveRNN+. This demo will work best with a browser that supports Ogg/Opus in HTML5 (Firefox, Chrome and Opera do), but if Opus support is missing the file will be played as FLAC, WAV, or high bitrate MP3.
eebf2c3492

0 new messages