Neural Text to Speech (Neural TTS), a powerful speech synthesis capability of Cognitive Services on Azure, enables you to convert text to lifelike speech which is close to human-parity. Since its launch, we have seen it widely adopted in a variety of scenarios by many Azure customers, from voice assistants like the customer service bot like BBC and Poste Italiane, to audio content creation scenarios like Duolingo.
Voice quality, which includes the accuracy of pronunciation, the naturalness of prosody such as intonation and stress patterns, and the fidelity of audio, is the key reason that customers are migrating from the traditional TTS voices to neural voices. Today we are glad to share that we have upgraded our Neural TTS voices with a new-generation vocoder, called HiFiNet, which results much higher audio fidelity while significantly improving the synthesis speed. This is particularly beneficial to customers whose scenario relies on hi-fi audios or long interactions, including video dubbing, audio books, or online education materials.
Our recent updates on Azure Neural TTS voices include a major upgrading of the vocoder. The voice fidelity has been improved significantly and audio quality defects such as glitches and small noises are largely reduced. Our tests show that this new vocoder generates audios without hearable quality loss from the recordings of training data (more details are introduced later). In addition, it can synthesize speech much faster than our previous version of the product. All these benefits are achieved through a new-generation neural vocoder, called HiFiNet.
Vocoder is a major component in speech synthesis, or text-to-speech. It turns an intermediate form of the audio, which is called acoustic feature, into audible waveform. Neural vocoder is a specific vocoder design which uses deep learning networks and is a critical module of Neural TTS.
In Azure TTS system, neural voice models are trained using human voice recordings as training data with deep learning networks. As part of the training, a vocoder is built with the goal to generate high quality audio output close to the original recordings from the training data. In the meantime, it needs to run fast enough to produce at least 24,000 samples per seconds, i.e. with a sampling rate of 24khz, which is the default sampling rate of Azure Neural TTS voice models.
Leveraging the state-of-art research on vocoders, we designed the training pipeline for HiFiNet, the new-generation Neural TTS vocoder, and applied it to create neural voice models in Azure Neural TTS. This pipeline is built with one simple goal: produce machine-generated audio waves (synthetic speech) that is indistinguishable from its original waves (human recordings) in a high speed.
To understand the benefit of HiFiNet, we conducted a number of tests in many aspects which yielded positive results. Our tests show that the HiFiNet vocoder significantly improves the audio quality of the Neural TTS voice output, compared to our previous version of the product.
CMOS (Comparative Mean Opinion Score) is a well accepted method in the speech industry for comparing the voice quality of two TTS systems. A CMOS test is similar to an A/B testing, where participants listen to different pairs of audio samples generated by two systems and provide their subjective opinions on how A compares to B. Normally in one test, we recruit 30-60 anonymous testers with qualified language expertise to evaluate around 50 pairs of audio samples side by side. The result is reported as CMOS gap, which measures the average of the difference in the opinion score between the two systems. In the cases where the absolute value of a CMOS gap is =0.1, then one system is reported better than the other. If the absolute value of a CMOS gap is >=0.2, we say one system is significantly better than the other.
In general, the audio quality, especially the fidelity is obviously improved. On average, across all languages, the HiFiNet vocoder achieves a CMOS gain higher than 0.2 compared to the previous vocoder, which means the improvement is hearable for users.
In particular, HiFiNet also has better robustness than the previous version of vocoder. Audio defects are largely reduced in the generated waves with HiFiNet. Our tests show that with the previous production vocoder, in 100 test samples, our testers can hear about 10 defects like beep, click sound, fidelity loss. Although most of them are not obvious, it can still be annoying if it keeps happening in a long audio or multi-round voice interactions. Now, these defects are no longer reported with the HiFiNet audios, under the same test procedure with the same test sets.
In addition, we have conducted tests to compare the human recording audio quality and the computer-generated audio quality with HiFiNet. To make the comparison more accurate and more focused on the vocoder itself, we use the acoustic features extracted directly from human recordings instead of the TTS-predicted acoustic features so the acoustic differences are controlled and only the vocoder is evaluated in CMOS tests. Participants are asked to give their scores for different pairs of the generated waves and human recordings. Our result shows the CMOS gap of the audios produced by HiFiNet compared to human recordings is -0.05, which means the difference is hardly hearable and the audio quality is on par.
With the output of 24khz sampling rate, on M60 GPU, through carefully optimized CUDA implementation, the vocoder RTF is around 0.01, which means the HiFiNet system can generate an audio 10 second-long in 0.1 second. This speed is almost 3x of our previous production vocoder.
Currently we support up to 24khz sampling rate on Azure Neural TTS service with 68 neural voice models available. In some highly sophisticated scenarios like audio dubbing, higher fidelity output like 48khz sampling rate makes a world of difference.
Below snippet from an audio spectrum shows the difference between 48hz sampling rate and 24khz. Audios with 48khz sampling rate get a higher frequency responding range which keeps more sophisticated details and nuances of the sound. Such high sampling rate creates challenges on both voice quality and inference speed.
In our exploration, HiFiNet can handle both challenges well. According to our experiments, HiFiNet vocoder on 48khz sampling rate can be trained to achieve even higher quality with reasonable inference speed.
I'm a pro user, and have recently wanted to expand my skills into more detailed audio editing. I appreciate that my project must conform to a particular project frame rate, but now that we are all mostly working in digital, and are able to export audio straight to recording facilities, is there any way for us to edit audio without having to stick to frame based resolution?
In other words, while I accept I must edit video at 25fps, etc, there doesn't seem to be any need to do this for audio. Ideally, I want to edit video at the frame rate, and audio at the 'bit' level (like on Pro Tools, or any other audio waveform editing software).
I don't know. But I think it's about time, as the idea of audio and video needing to be married by timecode seems obsolete. I can't see any reason for keeping audio frame-based. Unless it's a technical impossibility to separate the two.
The Audacity sample rate in the lower left window has to match the sample rate of the Zoom. You can change either one. If Audacity is running at the video sample rate of 48000 and the Zoom is running at 44100, the show may be at the wrong pitch. The normal audacity rate is 44100, the audio CD standard. Consult the Zoom instructions how to change it to match.
In this mode, Teams supports a 32kHz sampling rate at 128kbps when network bandwidth allows. The internal audio processing is optimized for reproducing music with high fidelity. When network bandwidth is insufficient, the bitrate can be reduced to as low as 48kbps and Teams still produces good-quality audio.
You're also provided the options to turn off echo cancellation, noise suppression, and gain control when the environment is professionally managed, e.g., high-quality headphones are used without audio feedback, the environment has low background noise, and the microphone input is managed at optimal levels.
Digital audio is a representation of sound recorded or converted into a digital signal. During the analog to digital conversion process, amplitudes of an analog sound wave are captured at a specified sample rate and bit depth and converted into data a computer software can read.
The main difference between sound and digital audio is that digital audio is a series of amplitude values used to reconstruct the original analog sound wave whereas as analog sound is a continuous signal with infinite amplitude values at any one point in time. Digital audio is like playing connect-the-dots, whereas real sound is the full original image.
The analog-to-digital conversion process is called quantization and it's very similar to the way cameras capture video. A video camera reconstructs a continuous moment in time by capturing thousands of consecutive images per second, called frames. The higher the frame rate, the smoother the movie. In digital audio, an anlog-to-digital converter captures thousands of audio samples per second at a specified sample rate and bit depth to reconstruct the original signal. The higher the sample rate and bit depth, the higher the audio resolution.
Signals above the Nyquist frequency are not recorded properly by audio-to-digital converters (ADCs), becoming mirrored back across the Nyquist frequency and introducing artificial frequencies in a process called aliasing.
To prevent aliasing, audio-to-digital converters are often preceded by low-pass filters that eliminate frequencies above the Nyquist frequency before audio reaches the converter. This will prevent unwanted super-high frequencies in the original audio from causing aliasing. Early filters could taint the audio, but this problem is being minimized as better technology is introduced.
7c6cff6d22