Intent to Implement: Media Session

Philip Jägenstedt

unread,

Jun 8, 2015, 11:26:06 AM6/8/15

to blink-dev

Contact emails

da...@opera.com

phi...@opera.com

Spec

https://mediasession.spec.whatwg.org/ (edited by ri...@opera.com)

Summary

Enable web developers to control platform-level audio focus and customize the UI in the lock screen and notification area.

The Media Session spec covers a lot of ground, but in this first round of implementation we're focused on the subset which will allow web developers to customize the platform UI, as we believe this is what will be most interesting in web apps for music or podcast-type content. Getting their feedback early will help shape the next steps.

Motivation

Multiple mobile browsers, including iOS Safari, Opera for Android and Samsung's S browser have UI features to help the user find which tab is currently playing, showing some basic information like a title and a play/pause button. Chrome for Android also has a notification for tabs playing audio, but with no playback controls.

These are non-web-exposed heuristics with very limited control afforded to the developer. When native apps play audio, they're able to integrate with the platform UI in a much richer way, providing metadata like artist/album/title and artwork.

Compatibility Risk

The biggest concern we have with media session is the ability to implement it on a wide range of platforms, summarized in "figure out the coupling between audio focus/session, audio playback and remote control events." Jer Noble from Apple clarified some things about iOS, and David built an iOS test app to reveal some interesting things.

In the current spec, a media session has one or more participating media elements, and it's the act of playing a media element that activates the session and thus requests platform audio focus and UI. We're confident that this is a model that can be supported on both Android and iOS.

We are leaving the door open to activating a media session with no media element, but it seems clear that interoperable handling of metadata needs playing media, and this is precisely our current focus.

Ongoing technical constraints

There are many issues, large and small, which are currently under discussion. Implementation will help resolve many of these issues.

Will this feature be supported on all six Blink platforms (Windows, Mac, Linux, Chrome OS, Android, and Android WebView)?

No, we are focused on Android in this first implementation round.

OWP launch tracking bug

https://crbug.com/497735

Link to entry on the feature dashboard

https://www.chromestatus.com/features/5639924124483584

Note in particular that implementation has already begun in WebKit, and there is interest from Mozilla:

https://bugs.webkit.org/show_bug.cgi?id=145411

https://bugzilla.mozilla.org/show_bug.cgi?id=1166548

Requesting approval to ship?

No, we will implement behind a runtime flag.

Domenic Denicola

unread,

Jun 8, 2015, 12:31:42 PM6/8/15

to Philip Jägenstedt, blink-dev

From: blin...@chromium.org [mailto:blin...@chromium.org] On Behalf Of Philip Jägenstedt

> The biggest concern we have with media session is the ability to implement it on a wide range of platforms, summarized in "figure out the coupling between audio focus/session, audio playback and remote control events." Jer Noble from Apple clarified some things about iOS, and David built an iOS test app to reveal some interesting things.
>
> In the current spec, a media session has one or more participating media elements, and it's the act of playing a media element that activates the session and thus requests platform audio focus and UI. We're confident that this is a model that can be supported on both Android and iOS.
>
> We are leaving the door open to activating a media session with no media element, but it seems clear that interoperable handling of metadata needs playing media, and this is precisely our current focus.

My main concerns with both the spec and implementation is that they are heading down the wrong path with respect to Blink's goals of a layered platform. By building a new feature that is intimately tied to HTMLMediaElement, it prevents us from making HTMLMediaElement a simple declarative layer on top of more primitive underlying platform pieces. I am concerned about both the implementation-side technical debt this creates, as well as the specification-side extensibility debt.

From my understanding of the media session spec, the reason it's so intimately tied to media elements is because you need some manifestation of "playing audio" to map to iOS's AVAudioSession, so that media sessions can only be tied to playing audio. In my opinion, *an actual HTML element* is not a good representation of playing audio; that is just a declarative wrapper. Privileging built-in <audio> and <video> elements over, say, <custom-audio> or web audio or web MIDI seems very future-hostile, even if you can hack around some of these by manually creating a shim <audio> and trying to redirect output to it.

Instead, a better model than tying yourself to HTML elements would be web audio's AudioContext.

It's long been noted [1] that media elements should be layered on top of web audio. The Media Session spec does a lot of work to come up with a rational ontology relating media sessions and media elements, including noting how only one session is active, and there's the idea of a top-level browsing context having a media session. This seems like exactly the kind of work that [1] was asking for with regard to AudioContexts:

> - Can a media element be connected to multiple AudioContexts at the same time?
>
> [...]
>
> That leaves a few open issues for which we don't currently have suggestions but believe the WG should address:
>
> - What AudioContext do media elements use by default?
> - Is that context available to script? Is there such a thing as a "default context"?

I think it would be extremely helpful to the platform if the Media Session API were willing to recast itself in terms of web audio contexts instead of in terms of <audio> and <video> elements, and thus take on these questions regarding the relation between media elements, audio contexts, and media sessions.

[1]: https://github.com/w3ctag/spec-reviews/blob/master/2013/07/WebAudio.md#layering-considerations

Chris Wilson

unread,

Jun 8, 2015, 1:54:14 PM6/8/15

to Domenic Denicola, Philip Jägenstedt, blink-dev

+1.

An even better model would be to tie this to a device/audio "worker" - i.e. the underlying piece that has not been defined that needs to sit underneath an AudioContext as well, and just defines the access to and audio stream to an audio output device. (AudioContext sits on top of this, but has a lot of audio processing library stuff that's well above the layer of "I have a device and need to keep feeding it audio bits".

Philip Jägenstedt

unread,

Jun 9, 2015, 11:15:16 AM6/9/15

to Chris Wilson, Domenic Denicola, blink-dev

Hi Domenic, Chris,

The premise here is that a media session needs to be connected to the audio-producing objects in the platform, and that beginning to use the audio output device is what's required to transition a session from idle to active. As it stands, we have two kinds of audio-producing objects: HTMLMediaElement and AudioContext (or AudioDestinationNode). Our thinking is that we should simply put HTMLMediaElement and AudioContext on equal footing with respect to MediaSession, and I've filed an issue for that.

However, I don't think allowing web audio to use media sessions gets to the core of your concern, so let's look at the layering of media elements. The implementations of HTMLMediaElement in both Blink/WebKit and Presto have a media player as the next layer down, implemented with FFmpeg or Android's MediaPlayer in Chromium. The media player does almost all of the work for the media element, except for handling the poster image, <video controls> UI, text track rendering and some tidbits like autoplay and loop logic. For the sake of argument, we could have HTMLMediaElement.player expose a MediaPlayer object and use new MediaPlayer() for sans-HTML media playback. The next layer down is a media filter graph, where you feed a demuxer with data from the network, connect the demuxer to audio/video decoders and get decoded audio/video frames out. Finally, you would need to send your decoded audio to the audio output device. AudioContext revolves around blocks of 128 samples and low-latency processing, but most media frameworks would play an audio file with up to seconds of buffering. Maybe it could be made to work, but it's really just the audio output part of AudioContext you want.

Anyway: should we require that MediaSession integrates with only one primitive, and that the way to connect a media element to a media session is some form of mediaElement.player.something.session = new MediaSession()? That would actually be awesome, but would add many years where the web lags behind native platforms when it comes to audio focus handling. As I see it, if we make MediaSession work with HTMLMediaElement and AudioContext today, that session getter and setter can be defined to forward to the underlying layer if that is ever exposed.

Thoughts?

Domenic Denicola

unread,

Jun 10, 2015, 5:46:03 PM6/10/15

to Philip Jägenstedt, Chris Wilson, blink-dev

From: Philip Jägenstedt [mailto:phi...@opera.com]

> The next layer down is a media filter graph, where you feed a demuxer with data from the network, connect the demuxer to audio/video decoders and get decoded audio/video frames out.

Yeah, web audio doesn't have streaming decoding yet... Working on it... :)

> Finally, you would need to send your decoded audio to the audio output device. AudioContext revolves around blocks of 128 samples and low-latency processing, but most media frameworks would play an audio file with up to seconds of buffering. Maybe it could be made to work, but it's really just the audio output part of AudioContext you want.

It does seem like it's at least closer to the right layering. Something to investigate in the future, at the very least.

> Anyway: should we require that MediaSession integrates with only one primitive, and that the way to connect a media element to a media session is some form of mediaElement.player.something.session = new MediaSession()? That would actually be awesome, but would add many years where the web lags behind native platforms when it comes to audio focus handling. As I see it, if we make MediaSession work with HTMLMediaElement and AudioContext today, that session getter and setter can be defined to forward to the underlying layer if that is ever exposed.
>
> Thoughts?

Given that Chris is saying that indeed audio worker is a better representation of the output device, and that's still in progress, this plan seems reasonable for now---we can't block on audio worker.

However, I'd really like to see the spec updated to make the conceptual layering much clearer, instead of throwing "media elements" around everywhere as the magic sauce that brings the sessions to life. It seems like [1] has already done that in a large part, with only a few media element references remaining, which is truly great.

I also do think your proposal [2] to integrate with AudioContext better is important even for the initial implementation. Maybe we won't have perfect layering in v1, but it'd be great not to throw authors off of a cliff when they move from <audio> to Web Audio.

Thanks very much for working through my concerns on this. It does appear that between the work you all are doing on media session and Chris et al. are doing on web audio, we're slowly converging on something quite nice :)

[1]: https://github.com/whatwg/mediasession/commit/1c5824e1b7dd5eda5c1046d611a8688c2d7604b4
[2]: https://github.com/whatwg/mediasession/issues/48

Philip Jägenstedt

unread,

Jun 11, 2015, 10:55:50 AM6/11/15

to Domenic Denicola, Chris Wilson, blink-dev

Many thanks for your feedback, Domenic. Many of these issues have been
lingering for a long time, and it's just as well that we deal with
them early on, in particular how to deal with Web Audio. For anyone
taking an interest, there are a few hot issues that are particularly
relevant to layering, "magic" and coupling between media session and
its participants:
https://github.com/whatwg/mediasession/issues/41
https://github.com/whatwg/mediasession/issues/48
https://github.com/whatwg/mediasession/issues/49
https://github.com/whatwg/mediasession/issues/50

I neglected to do so in the initial email, but I would also like to
thank Mounir and Anton, who came to visit us in Gothenburg to work on
this, and who've been doing some foundational work on the Chromium
side.

Philip

Reply all

Reply to author

Forward