Intent to Remove: speechSynthesis.speak without user activation

Charles Harrison

unread,

Sep 13, 2018, 12:04:07 PM9/13/18

to blink-dev, mlam...@chromium.org, Dominic Mazzoni

Primary eng (and PM) emails

cshar...@chromium.org

Link to “Intent to Deprecate” thread

Deprecated in M70, slated for removal in M71.

https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/XpkevOngqUs/-x9POpCMAQAJ

Summary

After deprecation, the plan is to cause speechSynthesis.speak to immediately fire a “not-allowed” error if specific autoplay rules are not satisfied. This will align it with other audio APIs in Chrome. Briefly, we will only allow speak() to succeed if the current frame, or any of its ancestors, has ever had user activation.

Tentative WPT: link

Demo link:

https://cr.kungfoo.net/speech/immediately-speak.html

Motivation

The SpeechSynthesis API is actively being abused on the web. Since other autoplay avenues are starting to be closed, abuse is moving to the Web Speech API, which doesn't follow autoplay rules.

Investigations show that this is primarily happening on Android, via full page ads or fake system warnings.

Interoperability and Compatibility Risk

Compat risk is medium on Android and low on all other platforms (from combination of UKM and UseCounter data).

Edge: No signals

Firefox: No signals

Safari: This change will match Safari on iOS

Web developers: No signals

Alternative implementation suggestion for web developers

Web developers should use a play button if they want to use speak().

Usage information from UseCounter

Our UseCounters are only available in M69, which only recently hit stable.

Chromestatus page: https://www.chromestatus.com/metrics/feature/timeline/popularity/2473

Broken out by platform, it is clear that only Android has significant counts.

Android: ~.05% of PageVisits

Non-Android: ~.0008% of PageVisits

Android has a concerning amount of breakage, so we dug down into UKM data to find what kind of sites are going to break. See internal go/speech-autoplay-deprecation-data for exact queries used.

~55% of page visits were to origins not known by Google’s web crawlers and thus were not included in the analysis.
I also manually went through ~30 popular origins that autoplay speak():

I was able to find a full URL to the site for about 65% of them
Of those 65%, almost all of the URLs pointed to full-page ads or very sketchy software install prompts.
~12% of the 20 origins had URLs that spoke to me (generally telling me to install some software it is advertising)

You can find the list internally here.

Entry on the feature dashboard

https://www.chromestatus.com/features/5687444770914304

Rick Byers

unread,

Sep 13, 2018, 12:23:49 PM9/13/18

to Charles, blink-dev, Mounir Lamouri, Dominic Mazzoni, Glen Shires, sole...@google.com

Thanks for the analysis, LGTM1.

IIRC there's no Chrome team which owns the speech APIs at the moment, but /cc a few folks who have been involved in speech API discussions in the past in case there's some risk / use-case we're missing here. But still, I assume a site could always use <iframe allow=autoplay> if they need to opt-in to a frame speaking without a user gesture (same as for any other audio), so I suspect this to be a pretty minor risk on top of all the other autoplay work.

Rick

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CADjAqN6Te6twYs3V-W5ZzeCREFS1jRLup9j-UvmO-E%3DyO1JA-A%40mail.gmail.com.

Dominic Mazzoni

unread,

Sep 13, 2018, 12:31:11 PM9/13/18

to Rick Byers, Charles, blink-dev, Mounir Lamouri, Glen Shires, sole...@google.com

My understanding is that speech recognition is what's currently unowned, correct me if I'm wrong.

This is for speech synthesis. It's understaffed but the accessibility team is supporting it and hoping to put more resources into it in the future. We fully support this change, as abuse of this API is potentially a big problem and this seems like a good balance that won't impact legitimate use.

Philip Jägenstedt

unread,

Sep 13, 2018, 1:21:52 PM9/13/18

to Glen Shires, Dominic Mazzoni, Rick Byers, Charles Harrison, blink-dev, Mounir Lamouri, Fredrik Solenberg

LGTM2

Yep, I've gotten the spec set up on the usual current-day setup with Bikeshed and autopublishing and have tried to fix some trivial things. I am not really working on the implementation though.

Thank you also for the spec work and updating the tests here:

https://wpt.fyi/results/speech-api

Aside: As you can see, this test suite is not very green. For Chrome, I think it's because there's some very global state involved, and in manual testing you quickly get the API into an unusable state where it does nothing. The renamed test in https://github.com/web-platform-tests/wpt/pull/12992 repros this easily so when that's merged I'll file a bug.

On Thu, Sep 13, 2018 at 7:05 PM Glen Shires <gsh...@google.com> wrote:

Philip Jägenstedt provides support for speech recognition in Web Speech API.
I continue to support server-side.

Glen Shires

unread,

Sep 13, 2018, 1:27:44 PM9/13/18

to Dominic Mazzoni, Philip Jägenstedt, rby...@chromium.org, cshar...@chromium.org, blink-dev, mlam...@chromium.org, Fredrik Solenberg

Philip Jägenstedt provides support for speech recognition in Web Speech API.

I continue to support server-side.

On Thu, Sep 13, 2018 at 9:31 AM Dominic Mazzoni <dmaz...@chromium.org> wrote:

Daniel Bratell

unread,

Sep 14, 2018, 4:56:49 AM9/14/18

to Glen Shires, 'Philip Jägenstedt' via blink-dev, Philip Jägenstedt, Dominic Mazzoni, Rick Byers, Charles Harrison, Mounir Lamouri, Fredrik Solenberg

LGTM3

/Daniel

To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CAARdPYdvuE%3DnGCfdegvb-MBefKYbL_FfQDjuvhk9Wgfdy%2B%2BMTw%40mail.gmail.com.

--

/* Opera Software, Linköping, Sweden: CEST (UTC+2) */

Rick Byers

unread,

Sep 14, 2018, 8:16:45 AM9/14/18

to Dominic Mazzoni, Charles, blink-dev, Mounir Lamouri, Glen Shires, Fredrik Solenberg

On Thu, Sep 13, 2018 at 12:31 PM Dominic Mazzoni <dmaz...@chromium.org> wrote:

My understanding is that speech recognition is what's currently unowned, correct me if I'm wrong.

This is for speech synthesis. It's understaffed but the accessibility team is supporting it and hoping to put more resources into it in the future. We fully support this change, as abuse of this API is potentially a big problem and this seems like a good balance that won't impact legitimate use.

Makes sense, sorry for the mis-representation!

So, from the accessibility team perspective, any concerns with this breaking change? Eg. any use cases you know will break, outreach that should be done or anything? It definitely seems like something we need to do, but it's still valuable to have an idea of the cost and think about whether we should be trying to mitigate it at all.

Dominic Mazzoni

unread,

Sep 14, 2018, 12:25:33 PM9/14/18

to Rick Byers, Charles, blink-dev, Mounir Lamouri, Glen Shires, Fredrik Solenberg

On Fri, Sep 14, 2018 at 5:16 AM Rick Byers <rby...@chromium.org> wrote:

On Thu, Sep 13, 2018 at 12:31 PM Dominic Mazzoni <dmaz...@chromium.org> wrote:
My understanding is that speech recognition is what's currently unowned, correct me if I'm wrong.

This is for speech synthesis. It's understaffed but the accessibility team is supporting it and hoping to put more resources into it in the future. We fully support this change, as abuse of this API is potentially a big problem and this seems like a good balance that won't impact legitimate use.

Makes sense, sorry for the mis-representation!

So, from the accessibility team perspective, any concerns with this breaking change? Eg. any use cases you know will break, outreach that should be done or anything? It definitely seems like something we need to do, but it's still valuable to have an idea of the cost and think about whether we should be trying to mitigate it at all.

There are a number of Chrome extensions that offer speech and those will be unaffected. There are some educational apps for Chrome OS - those that are packaged apps will be unaffected, and for web-based educational software, we don't see any reason that requiring user interaction would be a burden. We have a lot of contacts in this space and we'll be sure to reach out proactively and closely monitor this space.

merli...@gmail.com

unread,

Nov 19, 2018, 12:04:52 PM11/19/18

to blink-dev, rby...@chromium.org, cshar...@chromium.org, mlam...@chromium.org, gsh...@google.com, sole...@google.com

I have web based educational software (vocabulary learning) that requires auto-speak functionality. The tool needs to auto-play the word when it appears so the student hears it at least once. While we do have them click a button if they wish to hear it again, it is more user friendly for it to auto-play.

I propose that there should be a way to tell Chrome / speechSynthesis that auto-playing is acceptable for a particular session. The flow for this would be:

1. Go to website

2. Click a button

3. Button sends flag to Chrome - "hey auto-speak is okay for this website now"

4. Auto speak can now be used

5. Leave website

6. Leaving sends flag to Chrome - "hey auto-speak is off again"

merli...@gmail.com

unread,

Nov 19, 2018, 12:04:52 PM11/19/18

to blink-dev, mlam...@chromium.org, dmaz...@chromium.org

A few questions on this (I only just learned about it)

1. If speechSynthesis.speak is used without the user's consent, but then they later click a button to activate it, will it be activated or is it blocked permanently?

2. Is it possible for the user to give consent for future uses of speechSynthesis.speak?

My context for #2 is that I have a study tool which allows users to learn vocabulary. Part of this tool is that it auto plays the word when it appears. When they choose to start the learning session, they are consenting for audio to play for that session. Is this possible?

If #2 is not possible, then I propose that it be made possible because there are likely other instances of similar use where consent is given previously.

Charles Harrison

unread,

Nov 19, 2018, 1:39:02 PM11/19/18

to merli...@gmail.com, blin...@chromium.org, mlam...@chromium.org, Dominic Mazzoni

Hey merlinpatt, Some answers inline

On Mon, Nov 19, 2018 at 11:04 AM <merli...@gmail.com> wrote:

A few questions on this (I only just learned about it)

1. If speechSynthesis.speak is used without the user's consent, but then they later click a button to activate it, will it be activated or is it blocked permanently?

Subsequent calls to speak() should succeed after the page receives activation. You can test this on https://cr.kungfoo.net/speech/immediately-speak.html

The page immediately tries to speak(), but there is a button that should work afterwards.

2. Is it possible for the user to give consent for future uses of speechSynthesis.speak?

My context for #2 is that I have a study tool which allows users to learn vocabulary. Part of this tool is that it auto plays the word when it appears. When they choose to start the learning session, they are consenting for audio to play for that session. Is this possible?

If #2 is not possible, then I propose that it be made possible because there are likely other instances of similar use where consent is given previously.

As far as I understand it, this use-case should be supported.

Currently, autoplay allows same-domain navigations to persist activation, so something like this should work:

1. Navigate to https://cr.kungfoo.net/speech/

2. Click "immediately speak" which navigates to https://cr.kungfoo.net/speech/immediately-speak.html

3. The page should be allowed to speak immediately.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/faa8d0cd-3c71-45cd-9f78-a4d48b2fa9fd%40chromium.org.

Merlin Patterson (they / them)

unread,

Nov 19, 2018, 2:53:26 PM11/19/18

to cshar...@chromium.org, blin...@chromium.org, mlam...@chromium.org, dmaz...@chromium.org

I may be misunderstanding your example but that seems to be auto-speaking once after clicking to a new page.

In my app, a word will be auto played multiple times in a session. Will multiple auto plays be allowed?

If it helps, here's what my app does roughly.

1. User logs in

2. They click "study vocabulary" for a set of words

3. They get taken to a new page with a React App

4. In the React App, the words go through multiple levels and appear on screen.

5. As each word appears, that word gets auto-played. This means auto-play will occur at least once per word.

Charles Harrison

unread,

Nov 19, 2018, 3:13:44 PM11/19/18

to merli...@gmail.com, blin...@chromium.org, mlam...@chromium.org, Dominic Mazzoni

Hey Merlin,

In a given page load, if speak() is allowed it should be always allowed for the duration of that web page. You only need one gesture.

Additionally, for new page loads on the same domain, the activation will be persisted, and all calls to speak() should be allowed if there was ever a gesture on the subsequent page loads on that domain.

In your example, I think you should be fine as long as the link to "study vocabulary" and the vocabulary React App are on the same domain. You should be getting deprecation warnings in M70 and failures in M71 if this isn't the case.

pwo...@gmail.com

unread,

Apr 28, 2019, 9:58:20 PM4/28/19

to blink-dev, mlam...@chromium.org, dmaz...@chromium.org

It crossed my mind that the HID scanner is triggering the event that should speak using "keystrokes" into the input control. This should look like user intervention to the browser, so whatever is wrong is not the fault of this change. My apologies, although I still think it should be a setting.

pwo...@gmail.com

unread,

Apr 28, 2019, 9:58:20 PM4/28/19

to blink-dev, mlam...@chromium.org, dmaz...@chromium.org

This should be controlled by a setting. I use Chromium on Raspberry Pis to accept input from HID barcode scanners. There is no screen, mouse or keyboard. Users configure operator, workstation and application by scanning QR codes at the start of a session. I was using speech synth to prompt for session config in response to scanning of a (non-config) barcode prior to session configuration.

M70 broke this.

"I can't think of a use-case" is not the same as "there are no use cases".

On Friday, 14 September 2018 02:04:07 UTC+10, Charles Harrison wrote:

Reply all

Reply to author

Forward