Intent to ship: Web Speech API

Marcos Caceres

unread,

Oct 6, 2019, 9:55:23 PM10/6/19

to

As of October 11th, the Emerging Technologies Team intend to turn "Web Speech API" on by default in *Nightly only* on Mac, Windows, and Linux. It has been developed behind the "media.webspeech.recognition.*" and "media.webspeech.synth" preference.

Other UAs shipping this or intending to ship it are Chrome, and Safari (speech synth only, not recognition).

Bug to turn on by default: https://bugzilla.mozilla.org/show_bug.cgi?id=1244237

This feature was previously discussed in this "Intent to prototype" thread:
https://groups.google.com/d/msg/mozilla.dev.platform/uM3NzS3hKkk/KsWBbf0BRIEJ

What's new since 2014?:

- The updated implementation more closely aligns with Chrome's implementation - meaning we get better interop across significant sites.
- adds the ability to do speech recognition on a media stream.
- speech is processed in our cloud servers, not on device.

Henri Sivonen

unread,

Oct 7, 2019, 4:55:00 AM10/7/19

to Marcos Caceres, dev-platform

On Mon, Oct 7, 2019 at 5:00 AM Marcos Caceres <mcac...@mozilla.com> wrote:
> - The updated implementation more closely aligns with Chrome's implementation - meaning we get better interop across significant sites.

What site can one try to get an idea of what the user interface is like?

> - speech is processed in our cloud servers, not on device.

What should one read to understand the issues that lead to this change?

--
Henri Sivonen
hsiv...@mozilla.com

Gijs Kruitbosch

unread,

Oct 7, 2019, 5:10:50 AM10/7/19

to Marcos Caceres

On 07/10/2019 02:55, Marcos Caceres wrote:
> - speech is processed in our cloud servers, not on device.

Is this the case for both recognition and synthesizing? It's not clear
from this concise description.

Also, hasn't window.speechSynthesis been shipped before now? It's used
from e.g. reader mode's "narrate" functionality, and has been for quite
a while, including on release...

~ Gijs

Jonathan Kew

unread,

Oct 7, 2019, 5:32:18 AM10/7/19

to dev-pl...@lists.mozilla.org

On 07/10/2019 09:53, Henri Sivonen wrote:
> On Mon, Oct 7, 2019 at 5:00 AM Marcos Caceres <mcac...@mozilla.com> wrote:

>> - speech is processed in our cloud servers, not on device.
>

> What should one read to understand the issues that lead to this change?

+1. This seems like a change of direction which has *huge* implications
for issues like availability (the feature doesn't work if my device is
offline?), privacy (my device is sending microphone input to the
cloud?), and cost (how much of my expensive metered data does this
gobble up?) that need to be openly considered and discussed.

The original "Intent to prototype" seemed to be about an entirely
device-local feature, which means it had fundamentally different
characteristics.

Thanks,

JK

Marcos Caceres

unread,

Oct 8, 2019, 4:09:42 AM10/8/19

to

(Apologies for top-posting. I've asked the folks from ET to reply to the questions - Andre said he will respond soon! I was just helping them post the Intent, but I'm personally not involved with the implementation so I can't answer these really good questions... I'm just helping with our process stuff :)).

Marcos Caceres

unread,

Oct 8, 2019, 10:35:48 PM10/8/19

to

On Monday, October 7, 2019 at 12:55:23 PM UTC+11, Marcos Caceres wrote:
> As of October 11th, the Emerging Technologies Team intend to turn "Web Speech API" on by default in *Nightly only* on Mac, Windows, and Linux. It has been developed behind the "media.webspeech.recognition.*" and "media.webspeech.synth" preference.
>

Note that because this is only being pref'ed on in Nightly, it should be considered a kind of "intent to experiment". This is to allow the ET team to get a better understanding of what need to be fixed to get better interop and what needs to be fixed in the spec. Concerns with the current spec around outlined in:

https://github.com/mozilla/standards-positions/issues/170

Collaboration with Google folks is ongoing to address some of those at the spec level.

Andre Natal

unread,

Oct 12, 2019, 5:29:55 AM10/12/19

to dev-pl...@lists.mozilla.org, Janice Von Itter

Hello everyone,

sorry for the delay, but besides the patch itself we were working in an FAQ
to address all the questions raised in this thread along others we got from
other teams.

We tried to capture everything here [1], so please if you don't see your
question addressed in this document, just give us a shout either here in
the thread or directly.

Also see below the actual phab [2] and the bug [3] for more information.

[1]
https://docs.google.com/document/d/1BE90kgbwE37fWoQ8vqnsQ3YMiJCKJSvqQwa463yCN1Y/edit?ts=5da0f63f#

[2] https://phabricator.services.mozilla.com/D26047

[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1248897

Thanks,

Andre

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>

--

--
Thanks,

Andre

Gijs Kruitbosch

unread,

Oct 12, 2019, 10:23:27 AM10/12/19

to Andre Natal, dev-pl...@lists.mozilla.org, Janice Von Itter

The document says "The WebSpeech API allows websites to enable speech
input within their experiences." and that it is emphatically NOT
"Text-to-speech/narration".

This doesn't correspond to the original email here:

> As of October 11th, the Emerging Technologies Team intend to turn "Web
> Speech API" on by default in *Nightly only* on Mac, Windows, and Linux. It
> has been developed behind the "media.webspeech.recognition.*" and
> "media.webspeech.synth" preference.

What is the "synth" part if not speech synthesis ie TTS ?

~ Gijs

tom...@gmail.com

unread,

Oct 12, 2019, 10:38:32 AM10/12/19

to

The link to the FAQ is posted in the public group, in a thread meant for audiences outside MoCo. Please consider opening the doc to be readable to everyone, or at least copy the questions which already have answers (that you consider "done") in a reply to this thread.

Thanks
Tomislav

On Saturday, October 12, 2019 at 11:29:55 AM UTC+2, Andre Natal wrote:
> sorry for the delay, but besides the patch itself we were working in an FAQ
> to address all the questions raised in this thread along others we got from
> other teams.
>
>

> [1] https://docs.google.com/document/d/1BE90kgbwE37fWoQ8vqnsQ3YMiJCKJSvqQwa463yCN1Y/edit?ts=5da0f63f#

Andre Natal

unread,

Oct 12, 2019, 6:14:48 PM10/12/19

to tom...@gmail.com, dev-pl...@lists.mozilla.org, Janice Von Itter

The doc should be open now, please let us know if you still can't access it.

Fabrice Desre

unread,

Oct 12, 2019, 7:53:23 PM10/12/19

to dev-pl...@lists.mozilla.org

Hi André :)

The links to the last 3 docs seem to not be publicly accessible:
- Are you adding voice commands to Firefox?
-> mana is not public.

What’s next?
-> private google doc.

Have a question not addressed here?
-> private slack channel.

On 10/12/19 3:14 PM, Andre Natal wrote:
> The doc should be open now, please let us know if you still can't access it.
>
> On Sat, Oct 12, 2019, 4:40 PM <tom...@gmail.com> wrote:
>

Andre Natal

unread,

Oct 12, 2019, 7:59:45 PM10/12/19

to Gijs Kruitbosch, dev-pl...@lists.mozilla.org, Janice Von Itter

I believe there was a slight misunderstanding. The current work being made
is on the recognition part of the API only. The synthesis part landed a
while ago, and is already enabled by default. You can find some
documentation here:

https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API

On Sat, Oct 12, 2019 at 4:23 PM Gijs Kruitbosch <gkrui...@mozilla.com>
wrote:

--

--
Thanks,

Andre

Andre Natal

unread,

Oct 12, 2019, 8:07:30 PM10/12/19

to Fabrice Desre, dev-pl...@lists.mozilla.org, Janice Von Itter

Hi Fabrice,

Thanks for letting us know. I'm cc'ing jvo here so she can open them.

On Sun, Oct 13, 2019, 1:53 AM Fabrice Desre <fab...@desre.org> wrote:

> Hi André :)
>
> The links to the last 3 docs seem to not be publicly accessible:
> - Are you adding voice commands to Firefox?
> -> mana is not public.
>
> What’s next?
> -> private google doc.
>
> Have a question not addressed here?
> -> private slack channel.
>
> On 10/12/19 3:14 PM, Andre Natal wrote:
> > The doc should be open now, please let us know if you still can't access
> it.
> >
> > On Sat, Oct 12, 2019, 4:40 PM <tom...@gmail.com> wrote:
> >

Henri Sivonen

unread,

Oct 14, 2019, 4:39:50 AM10/14/19

to Andre Natal, dev-platform, Janice Von Itter

On Sat, Oct 12, 2019 at 12:29 PM Andre Natal <ana...@mozilla.com> wrote:
> We tried to capture everything here [1], so please if you don't see your
> question addressed in this document, just give us a shout either here in
> the thread or directly.

...
> [1]
> https://docs.google.com/document/d/1BE90kgbwE37fWoQ8vqnsQ3YMiJCKJSvqQwa463yCN1Y/edit?ts=5da0f63f#

Thanks. It doesn't address the question of what the UI in Firefox is
like. Following the links for experimenting with the UI on one's own
leads to https://mdn.github.io/web-speech-api/speech-color-changer/ ,
which doesn't work in Nightly even with prefs flipped.

(Trying that example in Chrome shows that Chrome presents the
permission prompt as a matter of sharing the microphone with
mdn.github.io as if this was WebRTC, which suggests that mdn.github.io
decides where the audio goes. Chrome does not surface that, if I
understand correctly how this API works in Chrome, the audio is
instead sent to a destination of Chrome's choosing and not to a
destination of mdn.github.io's choosing. The example didn't work for
me in Safari.)

--
Henri Sivonen
hsiv...@mozilla.com

Andre Natal

unread,

Oct 14, 2019, 7:56:20 PM10/14/19

to Henri Sivonen, dev-platform, Janice Von Itter

Hi Henri,

the API isn't available in Nightly yet since the code wasn't fully reviewed
neither merged yet. You can follow its progress here [1] and here [2]. If
you want to try it before it's merged, just apply the patch [2] to an
updated gecko-dev branch, switch on the
*media.**webspeech.**recognition.**enable
*and *media.**webspeech.**recognition.**force_enable *flags, and browse to
that page again.

Regarding the UI, yes, the experience will be exactly the same in our case:
the user will get a prompt asking for permission to open the microphone
(I've attached a screenshot below [3]), but in our case the audio will be
sent to the endpoint set in the *media.webspeech.service.endpoint *pref,
which the user will be allowed to change (differently than Chrome). But if
that's unset, it will send it to Mozilla's own server which is defaulted to
in the code.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1248897#c18
[2] https://phabricator.services.mozilla.com/D26047
[3]
https://www.dropbox.com/s/fkyymiyryjjbix5/Screenshot%202019-10-14%2016.13.49.png?dl=0

--

--
Thanks,

Andre

Andre Natal

unread,

Oct 14, 2019, 8:02:39 PM10/14/19

to dev-platform

I changed the subject of this thread to properly fits the current intent.

We also moved the FAQ from gdocs to mozilla wiki [1] with more current and
updated info. I've just added more content about offline recognition and
how to use deep speech to the ones interested.

Let's just use that wiki as the main source of info about this work.

Thanks

Andre

[1] https://wiki.mozilla.org/Web_Speech_API_-_Speech_Recognition

--

--
Thanks,

Andre

Henri Sivonen

unread,

Oct 15, 2019, 6:27:07 AM10/15/19

to Andre Natal, dev-platform, Janice Von Itter

On Tue, Oct 15, 2019 at 2:56 AM Andre Natal <ana...@mozilla.com> wrote:
> Regarding the UI, yes, the experience will be exactly the same in our case: the user will get a prompt asking for permission to open the microphone (I've attached a screenshot below [3])

...
> [3] https://www.dropbox.com/s/fkyymiyryjjbix5/Screenshot%202019-10-14%2016.13.49.png?dl=0

Since the UI is the same as for getUserMedia(), is the permission bit
that gets stored the same as for getUserMedia()? I.e. if a site
obtains the permission for one, can it also use the other without
another prompt?

If a user understands how WebRTC works and what this piece of UI meant
for WebRTC, this UI now represents a different trust decision on the
level of principle. How intentional or incidental is it that this
looks like a getUserMedia() use (audio goes to where the site named in
the dialog decides to route it) instead of surfacing to the user that
this is different (audio goes to where the browser vendor decides to
route it)?

--
Henri Sivonen
hsiv...@mozilla.com

Johann Hofmann

unread,

Oct 16, 2019, 7:40:52 AM10/16/19

to Henri Sivonen, Andre Natal, dev-platform, Janice Von Itter

Putting on my hat as one of the people maintaining our permissions UI, I
generally agree with Henri that it would be nice to have a slightly
different UI for this use-case, i.e. as far as I can see the presented
origin does not in fact get access to the user's microphone and it's a bit
unclear what "Remember this decision" actually does. It makes no sense to
set the "microphone" permission on that site, in the same way that it makes
no sense to derive from a permanent "microphone" permission for some site
that the user intends to submit their voice data to a third party. I feel
like this feature needs to store a separate permanent permission.

A perfect permissions UX may not be achievable or intended for an MVP of
this feature, so I would recommend at least hiding the checkbox (to avoid
setting the "microphone" permission) and prompting every time until a
better solution can be found.

Let me know if you need any help with that :)

Johann

Andre Natal

unread,

Oct 16, 2019, 10:32:30 AM10/16/19

to dev-platform

I changed the subject of this thread to fit properly the current intent.

We also moved the FAQ from Gdocs to mozilla wiki [1] with more current and
updated info. I've also added more content about offline recognition and
how to use deep speech to the ones interested about it.

Let's just use the wiki as the main source of info about this work.

Thanks

Andre

[1] https://wiki.mozilla.org/Web_Speech_API_-_Speech_Recognition

On Mon, Oct 14, 2019 at 4:20 PM Andre Natal <ana...@mozilla.com> wrote:

>
> Hi Henri,
>
> the API isn't available in Nightly yet since the code wasn't fully
> reviewed neither merged yet. You can follow its progress here [1] and here
> [2]. If you want to try it before it's merged, just apply the patch [2] to
> an updated gecko-dev branch, switch on the *media.**webspeech.*
> *recognition.**enable *and *media.**webspeech.**recognition.**force_enable
> *flags, and browse to that page again.
>

> Regarding the UI, yes, the experience will be exactly the same in our
> case: the user will get a prompt asking for permission to open the

> microphone (I've attached a screenshot below [3]), but in our case the
> audio will be sent to the endpoint set in the *media.webspeech.service.endpoint
> *pref, which the user will be allowed to change (differently than
> Chrome). But if that's unset, it will send it to Mozilla's own server which
> is defaulted to in the code.
>
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1248897#c18
> [2] https://phabricator.services.mozilla.com/D26047
> [3]

> [image: Screenshot 2019-10-14 16.13.49.png]

Gijs Kruitbosch

unread,

Oct 16, 2019, 10:32:33 AM10/16/19

to Andre Natal, dev-pl...@lists.mozilla.org, Janice Von Itter

The document says "The WebSpeech API allows websites to enable speech
input within their experiences." and that it is emphatically NOT
"Text-to-speech/narration".

This doesn't correspond to the original email here:

> As of October 11th, the Emerging Technologies Team intend to turn "Web
> Speech API" on by default in *Nightly only* on Mac, Windows, and Linux. It
> has been developed behind the "media.webspeech.recognition.*" and
> "media.webspeech.synth" preference.

What is the "synth" part if not speech synthesis ie TTS ?

~ Gijs

On 12/10/2019 10:29, Andre Natal wrote:

Daniel Veditz

unread,

Oct 16, 2019, 1:44:18 PM10/16/19

to Johann Hofmann, Henri Sivonen, Andre Natal, dev-platform, Janice Von Itter

On Wed, Oct 16, 2019 at 4:40 AM Johann Hofmann <jhof...@mozilla.com> wrote:

> as far as I can see the presented origin does not in fact get access to
> the user's microphone

The site doesn't get raw audio, but does get text representing what the
browser thinks it heard. It's the same kind of privacy risk as raw audio
for most people (though less opportunity for creative abuses like trying to
track what TV show you're watching).

-Dan Veditz

> and it's a bit
> unclear what "Remember this decision" actually does. It makes no sense to
> set the "microphone" permission on that site, in the same way that it makes
> no sense to derive from a permanent "microphone" permission for some site
> that the user intends to submit their voice data to a third party. I feel
> like this feature needs to store a separate permanent permission.
>
> A perfect permissions UX may not be achievable or intended for an MVP of
> this feature, so I would recommend at least hiding the checkbox (to avoid
> setting the "microphone" permission) and prompting every time until a
> better solution can be found.
>
> Let me know if you need any help with that :)
>
> Johann
>

> On Tue, Oct 15, 2019 at 12:27 PM Henri Sivonen <hsiv...@mozilla.com>
> wrote:
>
> > On Tue, Oct 15, 2019 at 2:56 AM Andre Natal <ana...@mozilla.com> wrote:

> > > Regarding the UI, yes, the experience will be exactly the same in our
> > case: the user will get a prompt asking for permission to open the

> > microphone (I've attached a screenshot below [3])
> > ...
> > > [3]
> >
> https://www.dropbox.com/s/fkyymiyryjjbix5/Screenshot%202019-10-14%2016.13.49.png?dl=0
> >
> > Since the UI is the same as for getUserMedia(), is the permission bit
> > that gets stored the same as for getUserMedia()? I.e. if a site
> > obtains the permission for one, can it also use the other without
> > another prompt?
> >
> > If a user understands how WebRTC works and what this piece of UI meant
> > for WebRTC, this UI now represents a different trust decision on the
> > level of principle. How intentional or incidental is it that this
> > looks like a getUserMedia() use (audio goes to where the site named in
> > the dialog decides to route it) instead of surfacing to the user that
> > this is different (audio goes to where the browser vendor decides to
> > route it)?
> >
> > --
> > Henri Sivonen
> > hsiv...@mozilla.com

Johann Hofmann

unread,

Oct 17, 2019, 11:19:02 AM10/17/19

to Daniel Veditz, Henri Sivonen, Andre Natal, dev-platform, Janice Von Itter

Right, I can see the threat model being similar, but technically we're
marrying two separate things under the same UI and more importantly the
same permission name. This will instantly cause trouble once one of the two
features changes its requirements or behavior in a way that's incompatible
with the other, which I think is not unimaginable here.

Hence my recommendation to avoid using the same permission name right now
and using a separate UI as soon as that can be prioritized.