This feature adds on-device speech recognition support to the Web Speech API, allowing websites to ensure that neither audio nor transcribed speech are sent to a third-party service for processing. Websites can query the availability of on-device speech recognition for specific languages, prompt users to install the necessary resources for on-device speech recognition, and choose between on-device or cloud-based speech recognition as needed.
None
Does this intent deprecate or change behavior of existing APIs, such that it has potentially high risk for Android WebView-based applications?
None
None
Initially supported on Windows, Mac, and Linux with ChromeOS support to follow.
Shipping on desktop | 135 |
Open questions about a feature may be a source of future web compat or interop issues. Please list open issues (e.g. links to known github issues in the project for the feature specification) whose resolution may introduce web compat/interop risk (e.g., changing to naming or structure of the API in a non-backward-compatible way).
https://github.com/WebAudio/web-speech-api/pull/122Contact emails
ev...@google.comExplainer
https://github.com/WebAudio/web-speech-api/pull/122
Specification
https://webaudio.github.io/web-speech-apiSummary
This feature adds on-device speech recognition support to the Web Speech API, allowing websites to ensure that neither audio nor transcribed speech are sent to a third-party service for processing. Websites can query the availability of on-device speech recognition for specific languages, prompt users to install the necessary resources for on-device speech recognition, and choose between on-device or cloud-based speech recognition as needed.
Blink component
Blink>SpeechSearch tags
speech, recognition, local, offline, on-deviceTAG review
NoneTAG review status
PendingRisks
Interoperability and Compatibility
None
Gecko: Positive Discussed at TPAC 2024 with representatives from Mozilla including Paul Adenot
WebKit: Positive Discussed at TPAC 2024 with representatives from Apple including Eric Carlson.
Web developers: Positive Commonly requested feature. Examples: https://webwewant.fyi/wants/55/ https://github.com/WebAudio/web-speech-api/issues/108 https://stackoverflow.com/questions/49473369/offline-speech-recognition-in-browser https://www.reddit.com/r/html5/comments/8jtv3u/offline_voice_recognition_without_the_webspeech/
Other signals:WebView application risks
Does this intent deprecate or change behavior of existing APIs, such that it has potentially high risk for Android WebView-based applications?
None
Debuggability
None
Will this feature be supported on all six Blink platforms (Windows, Mac, Linux, ChromeOS, Android, and Android WebView)?
NoInitially supported on Windows, Mac, and Linux with ChromeOS support to follow.
Is this feature fully tested by web-platform-tests?
No
Flag name on about://flags
NoneFinch feature name
InstallOnDeviceSpeechRecognition,OnDeviceWebSpeechAvailable,OnDeviceWebSpeechRequires code in //chrome?
FalseEstimated milestones
Shipping on desktop 135 Anticipated spec changes
Open questions about a feature may be a source of future web compat or interop issues. Please list open issues (e.g. links to known github issues in the project for the feature specification) whose resolution may introduce web compat/interop risk (e.g., changing to naming or structure of the API in a non-backward-compatible way).
https://github.com/WebAudio/web-speech-api/pull/122Link to entry on the Chrome Platform Status
https://chromestatus.com/feature/6090916291674112?gate=4683906480340992This intent message was generated by Chrome Platform Status.
--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/677c7f0e.2b0a0220.2e82a8.01f6.GAE%40google.com.
Adding to Yoav’s feedback about the spec:
I also wonder if this should have a TAG review, especially given the privacy/fingerprinting implications of websites being able to query which on-device models are available.
-- Dan Clark
* Are the resources downloaded partitioned per top-level site? What should typical download sizes be?
Links to the minutes would be helpful. Filing official positions would be even better.
Why not? Is it tested otherwise?
It’s implied that installOnDeviceSpeechRecognition() happens synchronously. Making this a blocking call seems problematic since it could involve a fetch and a download. I’d expect it to return a Promise (https://www.w3.org/TR/design-principles/#promises). And onDeviceWebSpeechAvailable should probably also be async since it could involve reading data from disk.
The SpeechRecognitionMode "ondevice-only" value is only defined by a comment in the IDL stating that it “Returns an error if on-device speech recognition is not available”. What specifically returns an error? SpeechRecognition.start() doesn’t return any value, and in other error conditions the behavior is to fire SpeechRecognitionErrorEvent. Also, what should the behavior be if SpeechRecognitionMode is changed after start() has already been called?
* Are the resources downloaded partitioned per top-level site? What should typical download sizes be?This depends on the browser--for Chrome on Windows/Mac/Linux, there's only one instance of each on-device speech recognition language pack and each language pack is ~60MB. The spec doesn't necessarily dictate how the downloads are handled, only that websites should be allowed to trigger a download (or request a download) of a language.
Links to the minutes would be helpful. Filing official positions would be even better.Why not? Is it tested otherwise?Oops, I forgot to check that box. This feature is testable by web-platform-tests.It’s implied that installOnDeviceSpeechRecognition() happens synchronously. Making this a blocking call seems problematic since it could involve a fetch and a download. I’d expect it to return a Promise (https://www.w3.org/TR/design-principles/#promises). And onDeviceWebSpeechAvailable should probably also be async since it could involve reading data from disk.Totally agree--the implementation of those two APIs on Chrome return promises. I'll make sure the spec reflects this.The SpeechRecognitionMode "ondevice-only" value is only defined by a comment in the IDL stating that it “Returns an error if on-device speech recognition is not available”. What specifically returns an error? SpeechRecognition.start() doesn’t return any value, and in other error conditions the behavior is to fire SpeechRecognitionErrorEvent. Also, what should the behavior be if SpeechRecognitionMode is changed after start() has already been called?Ah yeah, I'll update that comment to clarify that it fires a SpeechRecognitionErrorEvent. Updating the SpeechRecognitionMode after start() has been called has no effect on the existing session. This is consistent with how other SpeechRecognition attributes work (i.e. lang, maxAlternatives, etc.). This isn't explicitly stated anywhere in the spec, so I'll file a spec issue to clarify this as well.As for mitigating privacy and fingerprinting risks, we've been collaborating with the team building the Translator API feature which also has the ability to download and detect language packs. Because the risks between these two features are nearly identical, on-device speech recognition language pack downloads will follow the same pattern and use the same permissions UI as on-device translation language packs. Here are some helpful links:Privacy Design Doc
On Tue, Jan 7, 2025 at 9:50 PM Evan Liu <ev...@google.com> wrote:* Are the resources downloaded partitioned per top-level site? What should typical download sizes be?This depends on the browser--for Chrome on Windows/Mac/Linux, there's only one instance of each on-device speech recognition language pack and each language pack is ~60MB. The spec doesn't necessarily dictate how the downloads are handled, only that websites should be allowed to trigger a download (or request a download) of a language.This seems like it'd require at very least some extra considerations as part of the Privacy & Security section of the spec.It would also be good to have that be explicitly an implementation-defined decision.+Domenic Denicola who's been working on similar privacy models related to translations, and can potentially advise you on the best path there.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/677c7f0e.2b0a0220.2e82a8.01f6.GAE%40google.com.
--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.
--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.
On 1/7/25 3:49 PM, 'Evan Liu' via blink-dev wrote:
As for mitigating privacy and fingerprinting risks, we've been collaborating with the team building the Translator API feature which also has the ability to download and detect language packs. Because the risks between these two features are nearly identical, on-device speech recognition language pack downloads will follow the same pattern and use the same permissions UI as on-device translation language packs. Here are some helpful links:Privacy Design Doc
Should we update the Privacy considerations in the spec to describe these risks?
Adding to Yoav’s feedback about the spec:
- It’s implied that installOnDeviceSpeechRecognition() happens synchronously. Making this a blocking call seems problematic since it could involve a fetch and a download. I’d expect it to return a Promise (https://www.w3.org/TR/design-principles/#promises). And onDeviceWebSpeechAvailable should probably also be async since it could involve reading data from disk.
- The SpeechRecognitionMode "ondevice-only" value is only defined by a comment in the IDL stating that it “Returns an error if on-device speech recognition is not available”. What specifically returns an error? SpeechRecognition.start() doesn’t return any value, and in other error conditions the behavior is to fire SpeechRecognitionErrorEvent. Also, what should the behavior be if SpeechRecognitionMode is changed after start() has already been called?
I also wonder if this should have a TAG review, especially given the privacy/fingerprinting implications of websites being able to query which on-device models are available.
Have you written web platform tests for it? Have a link?
Should we update the Privacy considerations in the spec to describe these risks?
this needs an async API, likely with a streams design.
Privacy Design Doc
I don't think that's a link..
I also wonder if this should have a TAG review, especially given the privacy/fingerprinting implications of websites being able to query which on-device models are available.
As a TAG member, I think a TAG review would probably result in useful feedback for this API. Please do send one.
So are you OK with adding unprefixing to this intent (or if you prefer, a new one that this is blocked on)?
It would be helpful if you wrote a short explainer.
We are looking for the spec and WPTs to match the implementation before approving
One more question, it looks like the latest spec has not been published to the gh-pages branch yet. Can you please make sure that your changes are visible here?
It would be nice to speak with someone privately, as I may be able to add some additional insight.
So are you OK with adding unprefixing to this intent (or if you prefer, a new one that this is blocked on)?Yeah, I think that's a great idea! I'm also in favor of tracking usage of the prefixed version with the goal of possibly dropping it entirely in the future.
It would be helpful if you wrote a short explainer.I've sent out a PR adding an explainer for on-device speech recognition: https://github.com/WebAudio/web-speech-api/pull/133
We are looking for the spec and WPTs to match the implementation before approvingI've sent out a PR updating the spec to match the WPTs, which return Promises that resolve to booleans for the two new methods: https://github.com/WebAudio/web-speech-api/pull/132
One more question, it looks like the latest spec has not been published to the gh-pages branch yet. Can you please make sure that your changes are visible here?Dominique Hazael-Massieux is currently working on this--the change should be auto-published once this PR is merged: https://github.com/WebAudio/web-speech-api/pull/129