sammeith raimundus holane

0 views

Skip to first unread message

Andree Vandestreek

unread,

Aug 2, 2024, 8:09:54 PM8/2/24

to taskcranzarla

Speech can be an effective, natural, and enjoyable way for people to interact with your Windows applications, complementing, or even replacing, traditional interaction experiences based on mouse, keyboard, touch, controller, or gestures.

Speech-based features such as speech recognition, dictation, speech synthesis (also known as text-to-speech or TTS), and conversational voice assistants (such as Cortana or Alexa) can provide accessible and inclusive user experiences that enable people to use your applications when other input devices might not suffice.

Build speech-enabled apps on the modern platform for Windows 10 (and later) applications and games, on any Windows device (including PCs, phones, Xbox, HoloLens, and more), and publish them to the Microsoft Store.

The Microsoft Speech platform is used by Voice Elements for Text-To-Speech (TTS) and for Speech Recognition. Many languages are supported. License to use Microsoft Speech for TTS and Speech Recognition is included with your Windows OS license.

Microsoft stopped development on the Microsoft Speech platform in 2012. Instead of processing text-to-speech (TTS) or speech recognition (SR) on-premises, Microsoft now steers its customers to use their cloud services on Azure. Those services and other similar services on the cloud can provide excellent SR and TTS and can work in conjunction with the Voice Elements platform. However, since there is no charge for the Microsoft Speech Platform, we continue to support it as our go-to default facility for TTS and SR.

You should have this installed on your server in order to perform speech recognition functions within Voice Elements. Voice Elements has built out support for Microsoft Speech Platform, as long as you use Microsoft compatible grammar files. These are easy to create using the methods outlined in this article: Create Microsoft Speech Compatible Grammar Files

The Microsoft Speech Platform relies on different language packs in order to provide speech recognition capabilities for different languages. Microsoft Speech Platform supports 18 different languages and accents. You can download some of the more popular languages using the links below. For additional options, please contact Inventive Labs Technical Support.

The SDK is the tookit provided by Microsoft to use the Microsoft Speech Platform. All of this functionality is built into Voice Elements. You will not need to have this installed, unless you would like to use it to create Microsoft Compatible Grammar files.

After completing the steps above you can enable speech recognition before using the Play and PlayTTS methods by setting SpeechRecognitionEnabled to true before calling them. For information on how to use the speech recognition demo application for testing, see Test Speech Recognition with Voice Elements.

Please note that SpeechRecognitionNumberOfPorts should be set to a number that is equal to or less than the number of Speech Recognition Ports for which you are licensed. You can check your license entitlements in the Voice Elements Dashboard.

The challenge I ran into was in just how to create the SSML content that was needed. SSML, which stands for Speech Synthesis Markup Language, is a standard for identifying just how speech should be spoken. Examples of this would be:

The section highlighted in GREEN is our locale. The BLUE section contains our service name mapping. The locale must always be matched with the same service name mapping from the row it came from. The double quotes are also equally important.

I invite you to follow the Scripting Guys on Twitter and Facebook. If you have any questions, send email to them at scri...@microsoft.com, or post your questions on the Official Scripting Guys Forum.

In this blog post, we will share the story about how we discovered a critical stack corruption bug that has existed in Windows for more than 20 years (CVE-2023-36719). The bug was found in a core Windows OS library which is used by countless software products but is most notably reachable from within the sandbox of all Chromium-based browsers, exposing a sandbox escape which is reachable from the web via JavaScript.

We will walk through the process on how we chose the target and the method that led us to discover this vulnerability, as well as some thoughts on how this could lead to browser and full system compromise.

In line with our goal to make Microsoft Edge the most secure browser for Windows, we decided to target a platform-specific feature in the Web Speech API, which offers speech synthesis functionality to aid user experience and accessibility and ends up using the internal Microsoft Speech API.

The IPC functionality is implemented by Mojo, a collection of runtime libraries integrated into Chromium, that provides a platform-agnostic abstraction of common primitives, such as message passing, shared memory and platform handles.

In this case, the JavaScript call initiated by the web page will result in an IPC call from the renderer process to the SpeechSynthesis Mojo Interface, which is implemented by the browser process: speech_synthesis.mojom - Chromium Code Search

Effectively, calling the speak JavaScript API crosses a privilege boundary, and any bug that we trigger in the browser code may allow an attacker to gain code execution outside of the context of the sandbox.

As emphasized before, a bug in this code might lead to a potential browser sandbox escape exploit. Moreover, being reachable directly from JavaScript, this would be one of those rare cases in which a sandbox escape could be achieved without compromising the Renderer process.

Now that we have our target, we wanted to begin hunting for exploitable bugs. For our first approach, we decided to go for black box fuzzing to get something running as quickly as possible and get feedback on how to iterate later.

We found out that, overall, the best choice for our case was Jackalope, a fuzzer developed by Google Project Zero which supports coverage-guided black box fuzzing without much overhead and comes with a grammar engine which is especially useful to fuzz our SSML/SAPI parser.

Since we did not have any experience with the COM framework and how to interact with the ISpVoice service, we asked Bing Chat to write a harness for us, it sped up our development and worked flawlessly.

However, we noticed that the coverage measured by Jackalope (Offsets) reached a plateau very quickly in just a few days as you can see from the comparison screenshot, and consequently corpus size also stopped growing.

The bug in question is reachable through the function ISpVoice::Speak (SAPI 5.3), which takes as input an XML string that can support both tags specific to Microsoft SAPI format, and SSML (Speech Synthesis Markup Language) which is the standard for speech synthesis used by many other engines.

Trying to reproduce the bug on a vulnerable version of sapi.dll results in a crash due to invalid rip value and completely corrupted stack trace. This immediately points to a possible stack buffer overflow.

Our overflow allows us to write additional XMLATTRIB structures past the end of the XMLTAG.Attrs array. Serializing the members of this structure means we can effectively write content in chunks of 24 bytes as shown:

This is exactly what happened (accidentally) with our first PoC: the 11th attribute PART gets parsed into an XMLATTR structure with eAttr value higher than 10. That value overwrites NumAttrs and then the next attribute (12th attribute) actually overwrites directly the saved rip without touching the stack cookies.

Since this bug allows us to corrupt almost anything we want on the stack, even at deeper stack frames, we could look for more possible targets to overwrite, other than saved rip, and there are indeed some object pointers in the stack that we could overwrite, but we need to keep some limitations in mind.

This makes it very hard (impossible?) to craft valid pointers within the controlled content. Moreover, apart from null bytes, there are other invalid byte sequences that will not be accepted as valid wide strings and would break the parsing early.

Alternatively, we could overwrite a value in the stack that is aligned with our controlled len field. In this case, since len is of type int, only the least significant 4 bytes will be overwritten, leaving the most significant part untouched.

After the initial quick success and given the necessity to fix the first bug to keep fuzzing further, we decided to spend some time figuring out how to re-compile our target library sapi.dll with ASAN (Address SANitizer) and code coverage to test more effectively.

Having ASAN enables us to catch potential heap errors that would otherwise have been missed by PageHeap sanitizer (that we used during our first iteration). Plus, it gives detailed information for stack corruptions as well right-away, dramatically reducing debugging time in case we found more of those.

To make our fuzzer simpler to deploy on other infrastructures such as OneFuzz, we decided to use Detours to hook LoadLibraryEx and replace the DLL path loaded from the system registry with the path of our instrumented DLL instead of modifying the registry.

In this blog post, we have shown how we discovered and reported a critical bug on Windows that was reachable from the browser using the Web Speech API. We have explained how we used the SAPI XML grammar to craft malicious inputs that could trigger memory corruption in the sapi.dll library, which runs unsandboxed in the browser process.

Bugs like the one discussed here are particularly rare since, when exploited, lead to full compromise of the browser process without the need of additional bugs (normally you would need a chain of at least two bugs to gain control over the browser process).

However, some browsers took slightly different approaches that might spare them from this bug. Firefox, for instance, does use sapi.dll, but it strips away XML tags from the input, at the cost of losing all the features that the SSML standard provides.