ntent to Implement and Ship: WebAudio: Add buffering/latency hint via playbackCategory

Raymond Toy

unread,

Dec 3, 2015, 7:18:03 PM12/3/15

to blink-dev

Contact emails

hong...@chromium.org, rt...@chromium.org

Spec

http://webaudio.github.io/web-audio-api/

http://webaudio.github.io/web-audio-api/#BaseAudioContext

http://webaudio.github.io/web-audio-api/#idl-def-AudioContextPlaybackCategory

Summary

Add an optional property bag argument for the AudioContext constructor to specify the playback category allowing the developer to give a hint on the desired buffering/latency.

Motivation

Currently, WebAudio will use the lowest latency possible for the audio device for the best interactive behavior. However, for some use-cases such as media playback, this causes unnecessary power and/or CPU utilization. The playbackCategory is a hint from the developer that such a latency is not required, allowing the developer to tradeoff latency for power/cpu. Chrome will make the actual selection internally based on the category.

Also, currently WebAudio's low latency can interfere with WebRTC, causing glitches in some cases. The "balanced" category allows WebAudio to interoperate with WebRTC without introducing glitches.

Interoperability and Compatibility Risk

Compatibility risk is low because this is backward compatible change. Old applications will get still get the lowest latency as will new if the playback category is not specified.

Interoperability risk is moderate because the actual latency used is left up to the browser to determine. But this is true today, even without the playback category; the actual latency has never been specified.

Ongoing technical constraints

No technical constraints, but choosing the correct buffering for each category may need some tweaking for the various platforms.

Will this feature be supported on all six Blink platforms (Windows, Mac, Linux, Chrome OS, Android, and Android WebView)?

Yes.

OWP launch tracking bug

http://crbug.com/564276

Link to entry on the feature dashboard

https://www.chromestatus.com/features/5678699475107840

Requesting approval to ship?

Yes

Philip Jägenstedt

unread,

Dec 4, 2015, 3:33:08 AM12/4/15

to Raymond Toy, blink-dev

What will the 3 playbackCategory states actually map to, is it simply three different constants for the buffer length?

How will "Balance audio output latency and stability/power consumption" work, and what does "stability" refer to?

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.

Hongchan Choi

unread,

Dec 4, 2015, 12:00:29 PM12/4/15

to Philip Jägenstedt, Raymond Toy, blink-dev

> What will the 3 playbackCategory states actually map to, is it simply three different constants for the buffer length?

That is correct.

> How will "Balance audio output latency and stability/power consumption" work

This 'balanced' category is for the use case that does not require the lowest possible buffer size, but the reasonable audio latency for the realtime communication. (i.e. WebRTC) The actual latency can be varied between the browser vendors.

> and what does "stability" refer to?

I believe this is sort of a hand-wavy expression of 'no-glitch-in-the-audio-stream' so it means the larger buffer size, but I think we need to make some clarification on the spec.

Raymond Toy

unread,

Dec 4, 2015, 12:07:53 PM12/4/15

to Hongchan Choi, Philip Jägenstedt, blink-dev

On Fri, Dec 4, 2015 at 9:00 AM, Hongchan Choi <hong...@chromium.org> wrote:

> What will the 3 playbackCategory states actually map to, is it simply three different constants for the buffer length?

That is correct.

It's a bit more than that. It should cause the audio device (and hardware) to call back less often (but with request for more data) so it can sleep longer when the category is not "interactive".

> How will "Balance audio output latency and stability/power consumption" work

This 'balanced' category is for the use case that does not require the lowest possible buffer size, but the reasonable audio latency for the realtime communication. (i.e. WebRTC) The actual latency can be varied between the browser vendors.

> and what does "stability" refer to?

I believe this is sort of a hand-wavy expression of 'no-glitch-in-the-audio-stream' so it means the larger buffer size, but I think we need to make some clarification on the spec.

Yeah. Blame the editors for not catching that because I don't really know what that is really supposed to mean, but Hongchan is probably correct in his interpretation.

Philip Jägenstedt

unread,

Dec 4, 2015, 2:12:08 PM12/4/15

to Raymond Toy, Hongchan Choi, blink-dev

On Fri, Dec 4, 2015 at 6:07 PM, 'Raymond Toy' via blink-dev <blin...@chromium.org> wrote:

On Fri, Dec 4, 2015 at 9:00 AM, Hongchan Choi <hong...@chromium.org> wrote:
> What will the 3 playbackCategory states actually map to, is it simply three different constants for the buffer length?

That is correct.

It's a bit more than that. It should cause the audio device (and hardware) to call back less often (but with request for more data) so it can sleep longer when the category is not "interactive".

> How will "Balance audio output latency and stability/power consumption" work

This 'balanced' category is for the use case that does not require the lowest possible buffer size, but the reasonable audio latency for the realtime communication. (i.e. WebRTC) The actual latency can be varied between the browser vendors.

> and what does "stability" refer to?

I believe this is sort of a hand-wavy expression of 'no-glitch-in-the-audio-stream' so it means the larger buffer size, but I think we need to make some clarification on the spec.

Yeah. Blame the editors for not catching that because I don't really know what that is really supposed to mean, but Hongchan is probably correct in his interpretation.

So, you have just sent this Intent to Implement (and Ship), but is this really the API you would like to implement? It seems to me that an API that maps directly to how you will implement it would be better, i.e. an API where you say that it's OK to have x seconds of total delay, defaulting to zero, and you also have a way to see what the actual delay is going to be, as the UA may adjust it to fit in some min/max bounds. Picking three values of x to correspond to three vague labels with no way of telling what the resulting delay is doesn't sound like fun to work with.

Philip

Raymond Toy

unread,

Dec 4, 2015, 2:35:13 PM12/4/15

to Philip Jägenstedt, Hongchan Choi, blink-dev, Chris Wilson

+cwilso, in case he's not reading blink-dev

All good and valid comments.

The actual discussion is here: https://github.com/WebAudio/web-audio-api/issues/348, especially comments starting at https://github.com/WebAudio/web-audio-api/issues/348#issuecomment-53919330

In a nutshell, we originally proposed a float value to specify the actual buffer size, which the UA could adjust if needed to meet whatever internal requirements. This was frowned upon and people wanted something more descriptive and less precise.

Philip Jägenstedt

unread,

Dec 4, 2015, 3:00:31 PM12/4/15

to Raymond Toy, Hongchan Choi, blink-dev, Chris Wilson

Thanks! In that thread, Chris Wilson is making the argument I would. The only criticism that I find convincing is where Jer Noble (Apple) says "setting the buffer size to a high value could conceivably be counterproductive (to power consumption) on certain UAs." However, I think we can solve this:

If you say new AudioContext({ acceptableLatency: 10 }) or similar and 10 seconds would be detrimental for power consumption, then clamp it to the optimal latency for the platform, and have a new AudioContext.actualLatency or similar to report the clamped value. Would this match how you intend to implement the current API proposal under the hood?

As with imageSmoothingQuality, I'm quite skeptical about vague hints, and I think it's worth avoiding if at all possible. In this case it does seem possible, because internally "interactive" maps to the lowest reasonable latency and "playback" to the highest reasonable latency, so a UA could clamp to that same range with a more explicit API.

There are references to TPAC discussion in that thread, so if there's extra context I'm missing, please clue me in :)

Philip

Raymond Toy

unread,

Dec 4, 2015, 3:55:12 PM12/4/15

to Philip Jägenstedt, Hongchan Choi, blink-dev, Chris Wilson

On Fri, Dec 4, 2015 at 12:00 PM, Philip Jägenstedt <phi...@opera.com> wrote:

On Fri, Dec 4, 2015 at 8:35 PM, 'Raymond Toy' via blink-dev <blin...@chromium.org> wrote:
+cwilso, in case he's not reading blink-dev

On Fri, Dec 4, 2015 at 11:12 AM, Philip Jägenstedt <phi...@opera.com> wrote:
On Fri, Dec 4, 2015 at 6:07 PM, 'Raymond Toy' via blink-dev <blin...@chromium.org> wrote:

On Fri, Dec 4, 2015 at 9:00 AM, Hongchan Choi <hong...@chromium.org> wrote:
> What will the 3 playbackCategory states actually map to, is it simply three different constants for the buffer length?

That is correct.

It's a bit more than that. It should cause the audio device (and hardware) to call back less often (but with request for more data) so it can sleep longer when the category is not "interactive".

> How will "Balance audio output latency and stability/power consumption" work

This 'balanced' category is for the use case that does not require the lowest possible buffer size, but the reasonable audio latency for the realtime communication. (i.e. WebRTC) The actual latency can be varied between the browser vendors.

> and what does "stability" refer to?

I believe this is sort of a hand-wavy expression of 'no-glitch-in-the-audio-stream' so it means the larger buffer size, but I think we need to make some clarification on the spec.

Yeah. Blame the editors for not catching that because I don't really know what that is really supposed to mean, but Hongchan is probably correct in his interpretation.

So, you have just sent this Intent to Implement (and Ship), but is this really the API you would like to implement? It seems to me that an API that maps directly to how you will implement it would be better, i.e. an API where you say that it's OK to have x seconds of total delay, defaulting to zero, and you also have a way to see what the actual delay is going to be, as the UA may adjust it to fit in some min/max bounds. Picking three values of x to correspond to three vague labels with no way of telling what the resulting delay is doesn't sound like fun to work with.

All good and valid comments.

The actual discussion is here: https://github.com/WebAudio/web-audio-api/issues/348, especially comments starting at https://github.com/WebAudio/web-audio-api/issues/348#issuecomment-53919330

In a nutshell, we originally proposed a float value to specify the actual buffer size, which the UA could adjust if needed to meet whatever internal requirements. This was frowned upon and people wanted something more descriptive and less precise.

Thanks! In that thread, Chris Wilson is making the argument I would. The only criticism that I find convincing is where Jer Noble (Apple) says "setting the buffer size to a high value could conceivably be counterproductive (to power consumption) on certain UAs." However, I think we can solve this:

If you say new AudioContext({ acceptableLatency: 10 }) or similar and 10 seconds would be detrimental for power consumption, then clamp it to the optimal latency for the platform, and have a new AudioContext.actualLatency or similar to report the clamped value. Would this match how you intend to implement the current API proposal under the hood?

At the time, this was exactly how I was thinking we would implement it. Clamp to reasonable values, but otherwise take whatever was suggested if we could. (This was only a thought experiment; I never looked into the actual implementation details.)

Although it's not stated anywhere, I think we were also going to provide some useful hints on "good" values for typical use cases. Something like use 0 for lowest latency (clamped to the lowest), 10-20 ms for webrtc/communications-type applications, and maybe 1 sec for music playback.

As with imageSmoothingQuality, I'm quite skeptical about vague hints, and I think it's worth avoiding if at all possible. In this case it does seem possible, because internally "interactive" maps to the lowest reasonable latency and "playback" to the highest reasonable latency, so a UA could clamp to that same range with a more explicit API.

There are references to TPAC discussion in that thread, so if there's extra context I'm missing, please clue me in :)

The TPAC meeting minutes are here. (Search for latency). The discussion was basically picking names for the categories.

Although it's already been decided in the group, nothing prevents us from opening another issue on this with your objections. There are issues where we've gone back and forth many times with completely opposite decisions each time. :-)

Philip Jägenstedt

unread,

Dec 4, 2015, 4:19:47 PM12/4/15

to Raymond Toy, Hongchan Choi, blink-dev, Chris Wilson

I'm really interested to hear what you would ideally like to implement, and if you think it's worth revisiting the issue in the Web Audio WG. I'm not involved in the WG, and wouldn't want to create a fuss if that would delay the time to ship and you and other Web Audio experts don't think it's important.

If you're on the fence about it, you don't need any LGTMs to experiment a bit with the implementation to see what kind of API would work. If you have an implementation ready to ship that makes technical sense on all of our platforms, that should weight pretty heavily in the Web Audio WG, or so I would hope.

Philip

Harald Alvestrand

unread,

Dec 7, 2015, 1:13:19 AM12/7/15

to Philip Jägenstedt, Raymond Toy, Hongchan Choi, blink-dev, Chris Wilson

In the very similar discussion we had between the W3C Audio WG and the W3C Media Capture TF, we ended up with a "constraint" on latency, expressed in milliseconds.

By tying into the constraints general mechanism, this allows the requester to specify:

- "I MUST have latency lower than this"

- "I MUST have latency higher than this" (altough it's hard to see where that's useful)

- "I think this is a good latency for my app, give me what you think is reasonable"

- "Give me this exact latency, or fail" (not likely to be successful... too much uncertainty).

We found use cases for both the first and the third case, and some handwaving that indicated people might actually use the second.

It would be nice to find that the Web Audio extension to do the same thing has compatible semantics.

--

Olga Sharonova

unread,

Dec 10, 2015, 4:17:53 AM12/10/15

to blink-dev, phi...@opera.com, rt...@google.com, hong...@chromium.org, cwi...@google.com

From the perspective of Chrome audio issues we are dealing with (that require mixing audio from different sources), these two look like something we'd like to have:

- "I MUST have latency lower (or not higher) than this"

- "I think this is a good latency for my app, give me what you think is reasonable"

We'd like to avoid returning anything like "AudioContext.actualLatency", since it may change dynamically within the constraints provided by the user.

Olga

Philip Jägenstedt

unread,

Dec 10, 2015, 5:20:55 AM12/10/15

to Olga Sharonova, blink-dev, Raymond Toy, Hongchan Choi, Chris Wilson

So, I imagined that you would have to know the actual latency in order to when the samples you're currently producing with a ScriptProcessorNode will actually reach the output, in case you're synthesizing some music with no interactivity (thus no need for low latency) but want to synchronize some graphics with it. Now I see there's a playbackTime attribute on AudioProcessEvent, maybe that's what you would use?

Raymond Toy

unread,

Jan 11, 2016, 1:21:03 PM1/11/16

to Philip Jägenstedt, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

Based on the comments in this thread and especially how the MediaCapture specifies latency, I will bring this up in the next WebAudio teleconf (this week). I think we should make WebAudio match MediaCapture in this aspect.

Raymond Toy

unread,

Jan 14, 2016, 5:27:56 PM1/14/16

to Philip Jägenstedt, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

In today's WebAudio teleconf, it was decided to keep things as they are. The variety of devices out there with differing latencies makes it really hard to do something meaningful and understandable with explicit latency numbers. It was pointed out the the latency for MediaCapture is a different issue that is used to constrain the device selection, which is different from the issue that WebAudio wants for the playbackCategory.

Henrik Grunell

unread,

Jan 20, 2016, 8:28:09 AM1/20/16

to Raymond Toy, Philip Jägenstedt, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

To be clear here, does "keep things as they are" mean we're going for the playback category approach?

/Henrik

Raymond Toy

unread,

Jan 21, 2016, 12:21:25 PM1/21/16

to Henrik Grunell, Philip Jägenstedt, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

On Wed, Jan 20, 2016 at 5:27 AM, Henrik Grunell <gru...@chromium.org> wrote:

To be clear here, does "keep things as they are" mean we're going for the playback category approach?

Sorry for not being more explicit. Yes, the consensus on the call was to keep the enum for the playback category. Explicit numerical latency value is not allowed because of the huge variety of devices out there.

Raymond Toy

unread,

Jan 22, 2016, 1:33:55 PM1/22/16

to Henrik Grunell, Philip Jägenstedt, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

Can we get a decision on this intent? Or directions on what needs to be fixed?

Philip Jägenstedt

unread,

Jan 25, 2016, 10:03:37 AM1/25/16

to Raymond Toy, Henrik Grunell, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

Are there any minutes from the teleconf? It was clear at the outset of this discussion that there are devices with differing latencies, but how does this translate to concrete implementation difficulties? Chromium will run across many different devices, and will have to somehow map the three enum values to some use of the underlying APIs, which presumably do not represent this with a matching 3-value enum. What extra information will Chromium have internally that web developers cannot be trusted to use wisely?

Other API owners, please weigh in :)

Raymond Toy

unread,

Jan 25, 2016, 1:24:32 PM1/25/16

to Philip Jägenstedt, Henrik Grunell, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

The meeting minutes are here: https://www.w3.org/2016/01/14-audio-minutes.html; search for issue/692, about half way through. It's not very well captured. The notes say I said it wasn't very practical; it's true, but I was just repeating why the enum was chosen way back when. I think the best discussino of the choice is actually in the issue comments themselves: https://github.com/WebAudio/web-audio-api/issues/348.

For desktop, webaudio has always chose the lowest latency value, which is more or less a fixed, pre-determined number. For OSX, we choose a latency of 128 samples. (Well it was upped to 256 for webrtc a while ago). For linux, it 512, based on experimentation. For Windows, it's 10 ms when using WASAPI, because that appeared to be the lowest we could get without glitching. The current plan was to continue to use these values for the lowest (interactive) setting. For communications, we would chose something that would be appropriate for webrtc (around 10-20 ms?). For playback/high latency, we'd choose something larger, which we haven't determined yet.

For Android, this is much more complicated. If the device supports the latency api, we would ask for the lowest latency. Unfortunately, we don't actually use it but round it up to some higher value. At the time, a Galaxy Nexus would report a latency of 144 samples, but we couldn't actually run webaudio glitch free at that value. I think we clipped the value to 256 or so. For some devices like the original Nexus 7, there is no low latency path, and we'd query for the optimum buffer size, which turned out to be something like 3072 samples (70 ms at 44.1kHz). For such devices, I do not know how we'd map the enums to anything that can match the user's expectations.

Philip Jägenstedt

unread,

Jan 25, 2016, 10:02:40 PM1/25/16

to Raymond Toy, Henrik Grunell, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

Thanks Raymond,

The teleconf minutes are indeed very sparse, so going back to the issue thread, I think that Jer's comment captures it well:

Even so, the script is not equipped with enough information to make an informed decision. This API seems to suggest that "moar latency == moar efficiency", when that relationship is definitely not linear and may not even be true.
Here's what I bet would happen if this API was standardized: your average WebAudio using page author would tune the latency value for his favorite browser and device, regardless of what affect that value has on the performance of other browsers and other devices.
Instead, with a more declarative API, each UA could pick a latency value which fits the local maxima for performance while meeting the general requirements for the selected "class" of playback.

Is it true that "the script is not equipped with enough information to make an informed decision" and if so, what information is lacking?

That "your average WebAudio using page author would tune the latency value for his favorite browser" is a risk, but if the requested latency is clamped to the range that actually works well on the specific device, wouldn't it work at least as well as the current defaults?

My concern continues to be that we're making the web platform less capable and more "magical" than native platforms. Since there are are already differing latency limits on the supported platforms, some determined by experimentation, why cannot those be made the lower clamping limits for an API with numeric latency?

I'll poke other API owners to weigh in.

Harald Alvestrand

unread,

Jan 26, 2016, 10:20:49 AM1/26/16

to Philip Jägenstedt, Raymond Toy, Henrik Grunell, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

My instinct says that there will be two classes of users:

- Those who don't care very much. They'll toss random values at the API, and be happy with what they get when it works well for them.

- Those who care a lot, and understand what a specific delay means. The first question they'll ask is "what delay will low/medium/high mean on THIS platform, and what API do I query to figure that out?"

For neither category do I see that the low/medium/high style of setting has any advantage over a numerical setting.

For the last category, they will also want to ask "so what delay did I actually get?". No matter what the API for asking for a delay category looks like.

Dimitri Glazkov

unread,

Jan 26, 2016, 11:54:30 AM1/26/16

to Harald Alvestrand, Philip Jägenstedt, Raymond Toy, Henrik Grunell, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

It sounds like this discussion had diverged from the original "intent to ship" logistics and progressed into a discussion that should probably be conducted in a spec bug?

:DG<

Raymond Toy

unread,

Jan 26, 2016, 2:11:45 PM1/26/16

to Dimitri Glazkov, Harald Alvestrand, Philip Jägenstedt, Henrik Grunell, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

I will try to bring this up again on this Thursday's webaudio teleconference.

But yes, it seems the discussion is probably now best done on the webaudio issues tracker.

Philip Jägenstedt

unread,

Jan 26, 2016, 11:30:21 PM1/26/16

to Dimitri Glazkov, Harald Alvestrand, Raymond Toy, Henrik Grunell, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

Yes, this has certainly gone more into design review than we should hope for on blink-dev. I have no interest in blocking useful work or bending the design to my will, but I would like to understand why a hint-based API is the best the web platform can have here, if in fact it is. I've posted on the GitHub issue, summarizing my doubts and asking for clarification.

It might be useful to start implementing to inform the spec discussion, and that (as usual) doesn't require any LGTM.

Raymond Toy

unread,

Mar 25, 2016, 3:06:42 PM3/25/16

to Philip Jägenstedt, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

We discussed this item at our teleconf yesterday. The general idea is to keep the categories but also allow the user to specify the actual latency. It's up to the browser to do something useful with the number. An additional attribute is added to the context to indicate what the actual latency chosen by the browser.

I have an action item to submit a pull request to the spec to show what the spec will actually say.

I'll update this thread and the intent once the final text is agreed upon.

Raymond Toy

unread,

May 16, 2016, 3:57:52 PM5/16/16

to Philip Jägenstedt, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, Olga Sharonova, blink-dev, Hongchan Choi, Chris Wilson

The text has landed and you can find it here, in general, and more specifically the meaning of latencyHint and the actual baseLatency.

I think this captures the general desire to allow hints (AudioContextLatencyCategory) while also allowing advanced users to specify the desired latency numerically.

We've also been discussing the implementation details with olka@, who has suggested that the "balanced" and "playback" categories are "soft" in that they are hints which can change depending on what the browser might be doing at the time. We didn't consider this possibility in the spec proposal. This wouldn't be a problem except that we now expose the actual latency achieved via the attribute AudioContext.baseLatency. We're not sure how to signal the user, if necessary, if the latency has changed.

Olga Sharonova

unread,

May 18, 2016, 12:24:22 PM5/18/16

to Raymond Toy, Philip Jägenstedt, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

baseLatency definition says: "This represents the number of seconds of processing latency incurred by the AudioContext in handling audio through the graph. It does not include any additional latency that might be caused by any other processing between the output of the AudioDestinationNode and the audio hardware."

And latencyHint definition says: "Identify the type of playback, which affects tradeoffs between audio output latency and power consumption. [...] The actual latency used is given by AudioContext's baseLatency attribute."

But "tradeoffs between audio output latency and power consumption" happen exactly "between the output of the AudioDestinationNode and the audio hardware".

So, is the statement "we now expose the actual latency achieved via the attribute AudioContext.baseLatency" is really correct?

Or do we only expose WebAudio graph latency still, and latency hint interpretation by browser (which is done at the lower layers under/after the graph) does not affect that expose latency in any way?

What does "The actual latency" atually mean in latencyHint definition?

Olga

Raymond Toy

unread,

May 18, 2016, 3:57:23 PM5/18/16

to Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

I think this issue is fairly confusing, even to me, because "latency" means many different things.

There are at least two uses of latencies in webaudio. The first, given by baseLatency, is the latency introduced by how the graph is processed. The graph is traversed, producing 128 frames of samples for each traversal. However, on some systems like Linux, we traverse the graph 4 times to produce 512 frames at once. These are buffered and output at the right time and the system can be idle until more data is needed. This should save some power and possibly also allow some parts of the audio system to be in a low power mode. (I think; I'm not 100% sure on this part because it's outside of webaudio).

On Android, the graph might get traversed even more times, possibly 16 or more.

Let's assume 16 consecutive traversals. Consider a user interaction like pressing a key to play a note. If we've already generated those 16 buffers (about 43 ms of data at 48kHz), the note might have been delayed by up to 43 ms if the key were pressed just after generating all the data. This is the latency in the baseLatency. (I think the example in the spec seems wrong and needs to be fixed. There used to be double buffering within webaudio itself, but I'm not sure of that anymore either; code has changed a bit in this area since I last look at it.) If this block processing were not done, the note would have been delayed by at most 128 frames (2.7ms); this would a very nice interactive system. The processing that introduces 43 ms of delay makes for a pretty terrible interactive system, especially for music.

The other latency in the system is how long it takes from the time a block of audio was generated to the time it's actually heard. This includes all of the delays introduced by any HW between what webaudio has produced until it reaches the speaker. Some Android devices have a DSP in the audio chain that can't be turned off which also adds like 10ms. If you are using bluetooth speakers, you get a bunch more delay. This is the delay/latency that the other intent for the timestamp is trying to capture. Developers want to be able to synchronize the visuals on the screen with the sounds you hear. That needs to include all of these delays to get things synchronized.

Does this make sense? I find these quite confusing too and confuse my self all the time.

Hongchan Choi

unread,

May 18, 2016, 4:21:56 PM5/18/16

to Raymond Toy, Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Chris Wilson

Thanks Raymond for clarifying the situation nicely. I agree with the explanation, but would like to add one more point.

The "other latency", aka "outputLatency", is not possible from my perspective. We can do our best to convey the estimated system-induced latency, that is only the estimate and I am afraid that the number would be useless. Thus this will lead to a pile of complaints from developers. For example, what if the bluetooth speaker has some unknown DSP path in the box? What if the multichannel audio interface has its own DSP mixer in the signal path after Chrome audio stack? What if the spatial headphone has a special algorithm to process additional 3D effect that takes at least 60ms? There is no way to evaluate the precise latency up to the reproduction unit without actually measuring the latency acoustically.

One can argue that "there is a platform API for this so we can do that!" but what we get is just some random number from the system that we cannot verify. I don't think we can say to developers that "it's not our fault, because we get this from OS."

Olga Sharonova

unread,

May 19, 2016, 8:10:47 AM5/19/16

to Raymond Toy, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

Thank you, Raymond.

As I understand it, the audio going through the graph to AudioDestination is processed somehow, and each node can introduce a delay as well. Does baseLatency include that time?

My impression is that this definition of baseLatency relies on the fact that WebAudio has some partial knowledge of what system rebuffering down the road looks like. As Hongchan said, the actual system can have multiple layers of rebuffering and other processing delays, and baseLatency represents only the rebuffering taking place at the edge of WebAudio and audio rendering logic. Is this understanding correct?

If the actual latency is (baseLatency + X), where X is unknown, how is it actually used to synchronize the visuals on the screen with the sounds?

Olga

Raymond Toy

unread,

May 19, 2016, 9:24:22 AM5/19/16

to Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

On Thu, May 19, 2016 at 5:10 AM, Olga Sharonova <ol...@google.com> wrote:

Thank you, Raymond.

As I understand it, the audio going through the graph to AudioDestination is processed somehow, and each node can introduce a delay as well. Does baseLatency include that time?

No, although people have asked for it. There is an issue filed so that the spec should at least so how much latency can be introduced for each node, but even that can be hard. Consider a biquad filter which has its cutoff frequency being automated. That dynamically changes the delay. Cycles in the graph make things very complicated too.

So baseLatency is supposed to be the inherent delay (if any) between the time the graph has been rendered until it get sent out to the browser, more or less.

My impression is that this definition of baseLatency relies on the fact that WebAudio has some partial knowledge of what system rebuffering down the road looks like. As Hongchan said, the actual system can have multiple layers of rebuffering and other processing delays, and baseLatency represents only the rebuffering taking place at the edge of WebAudio and audio rendering logic. Is this understanding correct?

Yes.

If the actual latency is (baseLatency + X), where X is unknown, how is it actually used to synchronize the visuals on the screen with the sounds?

This is where the output time stamp comes in. It produces an estimate of baseLatency + X and matches that with Performance.Now time. Knowing these two values, the user can adjust the timing of the visuals to the audio time. This is explained in great (and complicated) detail in webaudio issue 12.

I agree with Hongchan about the difficulty in determining X. However Paul Adenot (from Mozilla) mentioned at the last F2F that Mozilla has done some work and they do get good estimates. However, he also mentioned that while some bluetooth headsets return correct values, some totally lie so that they're not even close to the correct time. I don't know how to solve that problem.

Olga Sharonova

unread,

May 19, 2016, 9:52:40 AM5/19/16

to Raymond Toy, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

So baseLatency is supposed to be the inherent delay (if any) between the time the graph has been rendered until it get sent out to the browser, more or less.

How our decision to move rebuffering out of AudioDestination node to the browser rendering logic affects this statement? Will baseLatency always be 128, or what we name baseLatency then?

If the actual latency is (baseLatency + X), where X is unknown, how is it actually used to synchronize the visuals on the screen with the sounds?

This is where the output time stamp comes in. It produces an estimate of baseLatency + X and matches that with Performance.Now time. Knowing these two values, the user can adjust the timing of the visuals to the audio time. This is explained in great (and complicated) detail in webaudio issue 12.

Thanks for the reference, it's a lot of stuff to process :)

Isn't estimate of (baseLatency + X) the same as estimating just some Y which does not require baseLatency knowledge? Why would the user need to know baseLatency itself?

Raymond Toy

unread,

May 19, 2016, 10:05:44 AM5/19/16

to Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

On Thu, May 19, 2016 at 6:51 AM, Olga Sharonova <ol...@google.com> wrote:

So baseLatency is supposed to be the inherent delay (if any) between the time the graph has been rendered until it get sent out to the browser, more or less.

How our decision to move rebuffering out of AudioDestination node to the browser rendering logic affects this statement? Will baseLatency always be 128, or what we name baseLatency then?

Yeah, we need to look into this more. I'm pretty sure webaudio used to have a double buffer internally so if the callback size was 2048 (linux), the buffer was therefore 4096 frames and you had a latency of 2048 frames. Of course it's been improved and is now 512. With the FIFO that we now have, I think the audio latency is actually 128, but the "interactive event" latency is still 512 because the response to an event like a key press is delayed by up to 512 frames since we process the graph for 512 frames all at once instead of doing 4 graphs of 128 separated in time.

If the actual latency is (baseLatency + X), where X is unknown, how is it actually used to synchronize the visuals on the screen with the sounds?

This is where the output time stamp comes in. It produces an estimate of baseLatency + X and matches that with Performance.Now time. Knowing these two values, the user can adjust the timing of the visuals to the audio time. This is explained in great (and complicated) detail in webaudio issue 12.

Thanks for the reference, it's a lot of stuff to process :)
Isn't estimate of (baseLatency + X) the same as estimating just some Y which does not require baseLatency knowledge? Why would the user need to know baseLatency itself?

This is how the output time stamp is spec'd. It's just Y. But internally we would need to compute Y using baseLatency. I think.

If baseLatency is the interactive event delay, the user may want to know.

We should probably ask Mozilla and Microsoft how they're interpreting baseLatency.

And I need to look through our own implementation to figure out what our latency components actually are.

Olga Sharonova

unread,

May 19, 2016, 10:21:39 AM5/19/16

to Raymond Toy, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

Thanks for the reference, it's a lot of stuff to process :)
Isn't estimate of (baseLatency + X) the same as estimating just some Y which does not require baseLatency knowledge? Why would the user need to know baseLatency itself?

This is how the output time stamp is spec'd. It's just Y. But internally we would need to compute Y using baseLatency. I think.

If baseLatency is the interactive event delay, the user may want to know.

If this value can't be used directly by the user, why the user would care about its change?

Output time stamp changes dynamically with baseLatency change, right? Isn't this knowledge enough?

We should probably ask Mozilla and Microsoft how they're interpreting baseLatency.

Yes, that would be very useful to know, agree.

Raymond Toy

unread,

May 19, 2016, 10:44:48 AM5/19/16

to Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

On Thu, May 19, 2016 at 7:20 AM, Olga Sharonova <ol...@google.com> wrote:

Thanks for the reference, it's a lot of stuff to process :)
Isn't estimate of (baseLatency + X) the same as estimating just some Y which does not require baseLatency knowledge? Why would the user need to know baseLatency itself?

This is how the output time stamp is spec'd. It's just Y. But internally we would need to compute Y using baseLatency. I think.

If baseLatency is the interactive event delay, the user may want to know.

If this value can't be used directly by the user, why the user would care about its change?

I think the intent was that if the user can request a latency and the browser can use a different value, we should at least inform the user what was actually used by the browser. I can imagine the user changing his graph or processing based on this value. I don't actually know what, though.

Output time stamp changes dynamically with baseLatency change, right? Isn't this knowledge enough?

Yes, the output time stamp is allowed to dynamically change. I think it's expected to if you switch output devices (headphones vs speakers). This might be enough.

Raymond Toy

unread,

Nov 30, 2016, 12:42:54 PM11/30/16

to Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

Time to revive this intent to implement and ship. (Should we continue on this thread or create a new one?)

The spec has been updated (a while ago) to include both a category enum and an explicit (double) value.

We (well, andrew.m...@soundtrap.com, really) have started to implement this.

Recall that his is not the outputTimeStamp issue, which is a separate (but related) concept.

This is really intended to allow control of the internal buffering in WebAudio to allow for low-latency music, medium latency for comm like for webrtc, and for higher latency suitable for lower power music playback.

Olga

/Henrik

To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.

Raymond Toy

unread,

Nov 30, 2016, 12:57:08 PM11/30/16

to Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

Just realized there's an issue with the spec on this. These are currently defined for a BaseAudioContext. But these dont' make sense for an OfflineAudioContext, so these should be defined only for an AudioContext.

Filed https://github.com/WebAudio/web-audio-api/issues/1097 for this.

On Wed, Nov 30, 2016 at 9:42 AM, Raymond Toy <rt...@google.com> wrote:

Time to revive this intent to implement and ship. (Should we continue on this thread or create a new one?)

The spec has been updated (a while ago) to include both a category enum and an explicit (double) value.

We (well, andrew.macpherson@soundtrap.com, really) have started to implement this.

Chris Harrelson

unread,

Dec 6, 2016, 8:34:42 PM12/6/16

to Raymond Toy, Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

Ok. We'll wait for that to be resolved then?

On Wed, Nov 30, 2016 at 12:57 PM, 'Raymond Toy' via blink-dev <blin...@chromium.org> wrote:

Just realized there's an issue with the spec on this. These are currently defined for a BaseAudioContext. But these dont' make sense for an OfflineAudioContext, so these should be defined only for an AudioContext.

Filed https://github.com/WebAudio/web-audio-api/issues/1097 for this.

On Wed, Nov 30, 2016 at 9:42 AM, Raymond Toy <rt...@google.com> wrote:

Time to revive this intent to implement and ship. (Should we continue on this thread or create a new one?)

The spec has been updated (a while ago) to include both a category enum and an explicit (double) value.

We (well, andrew.m...@soundtrap.com, really) have started to implement this.

Raymond Toy

unread,

Dec 15, 2016, 1:50:29 PM12/15/16

to Chris Harrelson, Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

Just an update.

In today's WG teleconf, it was generally agreed that the latency items should be moved to AudioContext from BaseAudioContext. However, only three people were on the call. There's a pull request for this change, but it has not yet been approved to merge.

Raymond Toy

unread,

Jan 10, 2017, 4:04:56 PM1/10/17

to Chris Harrelson, Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

The pull request to move the baseLatency stuff from the BaseAudioContext to the AudioContext has just landed.

I think we're set, spec-wise.

Chris Harrelson

unread,

Jan 12, 2017, 1:17:23 PM1/12/17

to Raymond Toy, Olga Sharonova, Dimitri Glazkov, Harald Alvestrand, Henrik Grunell, blink-dev, Hongchan Choi, Chris Wilson

LGTM1 as long as all the concerns mentioned earlier in the thread are resolved (which sounds like the case!).

Reply all

Reply to author

Forward