Intent to Update TextTrackCue and Add VTTCue

657 views
Skip to first unread message

Glenn Adams

unread,
Aug 9, 2013, 6:54:21 PM8/9/13
to blin...@chromium.org
Contact email:
gl...@skynav.com

Summary:
The following IDL interface will be added:
* interface VTTCue
The following IDL members will be moved from TextTrackCue to VTTCue:
* attribute DOMString vertical;
* attribute boolean snapToLines;
* attribute (long or AutoKeyword) line;
* attribute long position;
* attribute long size;
* attribute DOMString align;
* attribute DOMString text;
* DocumentFragment getCueAsHTML(); 
The following WebVTT Regions extension IDL attribute will likewise be moved  from TextTrackCue to VTTCue:
* attribute DOMString regionId
Two new IDL enumerations are introduced:

* enum DirectionSetting { "" /* horizontal */, "rl", "lr" };
* enum AlignSetting { "start", "middle", "end", "left", "right" };
Two of the above IDL attributes when moved to VTTCue change their declared type as follows:

* attribute DirectionSetting vertical;
* attribute AlignSetting align;
The existing TextTrackCue constructor is retained to instantiate generic cues containing raw text content without WebVTT semantics. A new VTTCue constructor is introduced to instantiate a WebVTT sub-type instance with existing WebVTT semantics for text, getCueAsHTML(), etc.

Motivation:
Updating HTML5 CR1 [1], the HTML5 CR ED [2], HTML5.1 ED [3], and WebVTT ED [4] specifications have introduced a WebVTT specific sub-type of TextTrackCue called VTTCue wherein WebVTT specific members are defined. This change is intended to open the way for the definition of additional text track specific sub-types based on other text track formats (e.g., TTML, etc). As a consequence, the existing WebVTT specific members on TextTrackCue have been moved to VTTCue.

In addition, the TextTrackCue.getCueAsHTML operation has been moved to VTTCue in consideration of the fact that the translation from the underlying text track format (in this case WebVTT) to HTML is text track specific, and other text track formats will define different semantics for such translation (if a translation to HTML is provided by other formats at all).

Compatibility Risk:
Moderate for existing users of the TextTrackCue API; otherwise, None.

For programmatic users of the existing TextTrackCue API, the proposed change will likely break the behavior of JS client code that explicitly constructs a TextTrackCue instance, since after this change, the result of using the existing constructor will obtain a non-WebVTT, generic cue instance, and not a VTTCue instance. Such code will need to be modified to make use of the VTTCue constructor instead.

This is essentially a substantive technical change to the HTML5 CR1 [1], and may result in a new, CR2 being published. Other substantive technical changes to CR1 are expected, so this is unlikely to be the only such change.

At present, there appears to be no UseCounter for determining the statistics of usage of this constructor in the wild.

The changes to the declared type of the vertical and align IDL attributes, from DOMString to enum, introduce no compatibility risk provided that one of the specified defined enumeration values is used when setting the attribute. However, if an attempt is made to set one of these attributes using an undefined value (i.e., not in the appropriate enum), then, according to WebIDL [5], the attempt to set the attribute will be silently ignored. This is distinct from the existing text defined in CR1 [1] which calls for a SyntaxException to be raised when attempting to set using an undefined value.

Additional Discussion:
In order to better segregate text track format specific functionality, it is anticipated that the WebVTT specific source files will be moved into a new sub-directory: core/html/track/vtt.

Additional information can be found in tracking Issue 270340 [6].

Adam Barth

unread,
Aug 9, 2013, 7:29:38 PM8/9/13
to Glenn Adams, blink-dev
Do you plan to stage these changes behind a runtime flag?  It sounds like the kind of thing that might be doable entirely in one CL...

Adam

Glenn Adams

unread,
Aug 9, 2013, 7:34:35 PM8/9/13
to Adam Barth, blink-dev
On Fri, Aug 9, 2013 at 5:29 PM, Adam Barth <aba...@chromium.org> wrote:
Do you plan to stage these changes behind a runtime flag?

It is possible, though probably easier not to use a flag. What is your preference?
 
It sounds like the kind of thing that might be doable entirely in one CL...

Yes.

Adam Barth

unread,
Aug 9, 2013, 7:36:56 PM8/9/13
to Glenn Adams, blink-dev
On Fri, Aug 9, 2013 at 4:34 PM, Glenn Adams <gl...@skynav.com> wrote:
On Fri, Aug 9, 2013 at 5:29 PM, Adam Barth <aba...@chromium.org> wrote:
Do you plan to stage these changes behind a runtime flag?

It is possible, though probably easier not to use a flag. What is your preference?
 
It sounds like the kind of thing that might be doable entirely in one CL...

Yes.

It's probably better to do as one CL.  I presume the other implementations are make this change a well.

Glenn Adams

unread,
Aug 9, 2013, 7:40:37 PM8/9/13
to Adam Barth, blink-dev
On Fri, Aug 9, 2013 at 5:36 PM, Adam Barth <aba...@chromium.org> wrote:
On Fri, Aug 9, 2013 at 4:34 PM, Glenn Adams <gl...@skynav.com> wrote:
On Fri, Aug 9, 2013 at 5:29 PM, Adam Barth <aba...@chromium.org> wrote:
Do you plan to stage these changes behind a runtime flag?

It is possible, though probably easier not to use a flag. What is your preference?
 
It sounds like the kind of thing that might be doable entirely in one CL...

Yes.

It's probably better to do as one CL.  I presume the other implementations are make this change a well.

I'm also making that presumption, but it wouldn't hurt to investigate their plans.

Adam Barth

unread,
Aug 9, 2013, 7:42:33 PM8/9/13
to Glenn Adams, blink-dev
On Fri, Aug 9, 2013 at 4:40 PM, Glenn Adams <gl...@skynav.com> wrote:
On Fri, Aug 9, 2013 at 5:36 PM, Adam Barth <aba...@chromium.org> wrote:
On Fri, Aug 9, 2013 at 4:34 PM, Glenn Adams <gl...@skynav.com> wrote:
On Fri, Aug 9, 2013 at 5:29 PM, Adam Barth <aba...@chromium.org> wrote:
Do you plan to stage these changes behind a runtime flag?

It is possible, though probably easier not to use a flag. What is your preference?
 
It sounds like the kind of thing that might be doable entirely in one CL...

Yes.

It's probably better to do as one CL.  I presume the other implementations are make this change a well.

I'm also making that presumption, but it wouldn't hurt to investigate their plans.

Ok.  LGTM once you've double-checked with other implementers.

Silvia Pfeiffer

unread,
Aug 9, 2013, 8:07:24 PM8/9/13
to Adam Barth, Glenn Adams, blink-dev, Victor Carbune
On Sat, Aug 10, 2013 at 9:42 AM, Adam Barth <aba...@chromium.org> wrote:
On Fri, Aug 9, 2013 at 4:40 PM, Glenn Adams <gl...@skynav.com> wrote:
On Fri, Aug 9, 2013 at 5:36 PM, Adam Barth <aba...@chromium.org> wrote:
On Fri, Aug 9, 2013 at 4:34 PM, Glenn Adams <gl...@skynav.com> wrote:
On Fri, Aug 9, 2013 at 5:29 PM, Adam Barth <aba...@chromium.org> wrote:
Do you plan to stage these changes behind a runtime flag?

It is possible, though probably easier not to use a flag. What is your preference?
 
It sounds like the kind of thing that might be doable entirely in one CL...

Yes.

It's probably better to do as one CL.  I presume the other implementations are make this change a well.

I'm also making that presumption, but it wouldn't hurt to investigate their plans.

Ok.  LGTM once you've double-checked with other implementers.


(Feedback as editor:)
The spec changes were made as a reaction to format flexibility needs for text tracks. I'm expecting WebKit and IE to follow (though you won't get a commitment from them, of course). Not so sure about Firefox at this stage, since they only want to support WebVTT.

(Feedback as committer:)
I suggest keeping the region-specific changes for a second CL because region is behind a flag right now, IIUC (Victor can confirm), and there is also a change from TextTrackRegion to VTTRegion to implement.

Silvia.

Glenn Adams

unread,
Aug 9, 2013, 8:12:56 PM8/9/13
to Silvia Pfeiffer, Adam Barth, blink-dev, Victor Carbune
Thanks. I'm not proposing any changes regarding TextTrackRegion/VTTRegion in this proposed modification, particularly since the definition of these region related interfaces is not yet finalized. I expect to follow up with another proposal for region related mods after this initial mod is effected. In other words, I believe they are sufficiently orthogonal that they can be address separately (particularly since the region APIs are behind a feature flag).

Victor Carbune

unread,
Aug 10, 2013, 6:58:36 AM8/10/13
to Glenn Adams, Silvia Pfeiffer, Adam Barth, blink-dev
As you mentioned, everything region-related should be indeed
orthogonal, as it's still behind a compile-time flag (I'm working now
to get the ifdefs removed, I guess it's one of the few ones
remaining). Also from a launch perspective, vtt regions will be
experimental for a while, so better to keep them separate.

Thanks for working on this!

Victor

Philip Jägenstedt

unread,
Aug 12, 2013, 7:36:11 AM8/12/13
to Glenn Adams, blin...@chromium.org, Ian Hickson
Hi Glenn,

You say that you're going to move .text from TextTrackCue to VTTCue,
but also that "The existing TextTrackCue constructor is retained to
instantiate generic cues containing raw text ..."

For those not aware, a spec fork has appeared [1] since Silvia and Ian
disagree on this. Compare WHATWG [2] and W3C [3].

Which spec do you intend to follow? I don't think that keeping the
TextTrackCue constructor and text property makes a lot of sense after
TextTrackCue has been stripped of its WebVTT semantics. As far as I
can tell, a TextTrackCue created by script can't be rendered at all,
since it doesn't have any rendering rules. In other words, I think the
WHATWG spec makes more sense here.

[1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=22903
[2] http://www.whatwg.org/specs/web-apps/current-work/multipage/the-video-element.html#texttrackcue
[3] http://www.w3.org/html/wg/drafts/html/master/embedded-content-0.html#texttrackcue

Philip

Silvia Pfeiffer

unread,
Aug 12, 2013, 9:33:35 AM8/12/13
to Philip Jägenstedt, Glenn Adams, blin...@chromium.org, Ian Hickson
On Mon, Aug 12, 2013 at 9:36 PM, Philip Jägenstedt <phi...@opera.com> wrote:
Hi Glenn,

You say that you're going to move .text from TextTrackCue to VTTCue,
but also that "The existing TextTrackCue constructor is retained to
instantiate generic cues containing raw text ..."

For those not aware, a spec fork has appeared [1] since Silvia and Ian
disagree on this. Compare WHATWG [2] and W3C [3].

Which spec do you intend to follow?

I'm letting Glenn speak for himself here.

But I wanted to add a comment to the below.
 
I don't think that keeping the
TextTrackCue constructor and text property makes a lot of sense after
TextTrackCue has been stripped of its WebVTT semantics. As far as I
can tell, a TextTrackCue created by script can't be rendered at all,
since it doesn't have any rendering rules. In other words, I think the
WHATWG spec makes more sense here.

A TextTrackCue created by script would use the browser to manage events and data, but not to render. All cues of kind=metadata are like that, in fact, and don't use the browser for rendering. Thus, cues of kind=metadata would more naturally be created as a TextTrackCue by a JS developer than as a VTTCue, where all the other attributes of the object are irrelevant to the JS developers.

Silvia.

Glenn Adams

unread,
Aug 12, 2013, 10:37:34 AM8/12/13
to Philip Jägenstedt, blink-dev, Ian Hickson
On Mon, Aug 12, 2013 at 5:36 AM, Philip Jägenstedt <phi...@opera.com> wrote:
Hi Glenn,

You say that you're going to move .text from TextTrackCue to VTTCue,
but also that "The existing TextTrackCue constructor is retained to
instantiate generic cues containing raw text ..."

For those not aware, a spec fork has appeared [1] since Silvia and Ian
disagree on this. Compare WHATWG [2] and W3C [3].

Which spec do you intend to follow? I don't think that keeping the
TextTrackCue constructor and text property makes a lot of sense after
TextTrackCue has been stripped of its WebVTT semantics. As far as I
can tell, a TextTrackCue created by script can't be rendered at all,
since it doesn't have any rendering rules. In other words, I think the
WHATWG spec makes more sense here.

Apologies, but my original summary of changes was in error regarding the text attribute. It will stay on TextTrackCue as indicated in the W3C HTML5 spec [1].


Regarding whether it makes sense or not to keep the generic constructor and text attribute, both [2] and [3] describe legitimate (and implemented) use cases for doing so, particularly for allowing JS client code to use these features to access, render, and manage non-WebVTT cue content exposed as in a raw text format. I am also aware that, in the TV/STB device space, that use is currently being made of these generic features to directly access various non-WebVTT cue content, such as MPEG-2 PSI, CEA-608, and so on.


It is also worth noting the existing implementation of an early attempt at a generic cue [4][5], which does expose a text attribute by means of subclassing the existing WebVTT flavor of cue, and then voiding it of its WebVTT semantics: a rather odd and roundabout way to achieve this it seems. The fact that this early generic cue was implemented in this fashion indicates to me that placing the text attribute on VTTCue (only) and not retaining it on TextTrackCue leads to a convoluted implementation, and not a better one.

Glenn Adams

unread,
Aug 12, 2013, 11:48:59 AM8/12/13
to Adam Barth, blink-dev
On Fri, Aug 9, 2013 at 5:42 PM, Adam Barth <aba...@chromium.org> wrote:
On Fri, Aug 9, 2013 at 4:40 PM, Glenn Adams <gl...@skynav.com> wrote:
On Fri, Aug 9, 2013 at 5:36 PM, Adam Barth <aba...@chromium.org> wrote:
On Fri, Aug 9, 2013 at 4:34 PM, Glenn Adams <gl...@skynav.com> wrote:
On Fri, Aug 9, 2013 at 5:29 PM, Adam Barth <aba...@chromium.org> wrote:
Do you plan to stage these changes behind a runtime flag?

It is possible, though probably easier not to use a flag. What is your preference?
 
It sounds like the kind of thing that might be doable entirely in one CL...

Yes.

It's probably better to do as one CL.  I presume the other implementations are make this change a well.

I'm also making that presumption, but it wouldn't hurt to investigate their plans.

Ok.  LGTM once you've double-checked with other implementers.

I've received feedback from Mozilla folks who say:

"We can switch to the new api quickly, but it will be some months before we have support of all the rendering rules."

"I think that will make transition easier for developers."

"We see no reason not to implement WebVTT as described in the new spec."

I've also inquired with MSFT and APPL, but have not received any response yet.

G.


Aaron Colwell

unread,
Aug 12, 2013, 2:30:35 PM8/12/13
to Glenn Adams, Adam Barth, blink-dev
LGTM. I am happy to review the code for you when it becomes available. I'm assuming that existing applications will simply need to check for the presence of VTTCue to determine which constructor they should use. Correct? I just want to make sure they have a sane path for supporting older versions of Chrome that contain the old behavior.

Aaron

Glenn Adams

unread,
Aug 12, 2013, 6:43:07 PM8/12/13
to Aaron Colwell, Adam Barth, blink-dev
On Mon, Aug 12, 2013 at 12:30 PM, Aaron Colwell <acol...@chromium.org> wrote:
LGTM. I am happy to review the code for you when it becomes available. I'm assuming that existing applications will simply need to check for the presence of VTTCue to determine which constructor they should use. Correct?

yes, that's a reasonable approach, which I'll document at the appropriate location

Philip Jägenstedt

unread,
Aug 21, 2013, 6:44:48 AM8/21/13
to Glenn Adams, blink-dev, Ian Hickson
On Mon, Aug 12, 2013 at 4:37 PM, Glenn Adams <gl...@skynav.com> wrote:
>
> On Mon, Aug 12, 2013 at 5:36 AM, Philip Jägenstedt <phi...@opera.com>
> wrote:
>>
>> Hi Glenn,
>>
>> You say that you're going to move .text from TextTrackCue to VTTCue,
>> but also that "The existing TextTrackCue constructor is retained to
>> instantiate generic cues containing raw text ..."
>>
>> For those not aware, a spec fork has appeared [1] since Silvia and Ian
>> disagree on this. Compare WHATWG [2] and W3C [3].
>>
>> Which spec do you intend to follow? I don't think that keeping the
>> TextTrackCue constructor and text property makes a lot of sense after
>> TextTrackCue has been stripped of its WebVTT semantics. As far as I
>> can tell, a TextTrackCue created by script can't be rendered at all,
>> since it doesn't have any rendering rules. In other words, I think the
>> WHATWG spec makes more sense here.
>
>
> Apologies, but my original summary of changes was in error regarding the
> text attribute. It will stay on TextTrackCue as indicated in the W3C HTML5
> spec [1].
>
> [1]
> http://www.w3.org/TR/2012/CR-html5-20121217/embedded-content-0.html#text-track-api

Thanks for clarifying!

> Regarding whether it makes sense or not to keep the generic constructor and
> text attribute, both [2] and [3] describe legitimate (and implemented) use
> cases for doing so, particularly for allowing JS client code to use these
> features to access, render, and manage non-WebVTT cue content exposed as in
> a raw text format. I am also aware that, in the TV/STB device space, that
> use is currently being made of these generic features to directly access
> various non-WebVTT cue content, such as MPEG-2 PSI, CEA-608, and so on.
>
> [2] https://www.w3.org/Bugs/Public/show_bug.cgi?id=21851
> [3] https://www.w3.org/Bugs/Public/show_bug.cgi?id=22903

We already have WebVTT metadata cues for script-controlled rendering,
do we really need two ways? Is it the memory overhead of the extra
fields of VTTCue that we want to avoid?

As for formats other than WebVTT and TTML I have no specific
knowledge, but I don't understand why the generic TextTrackCue is the
appropriate interface for those formats, as opposed to the
format-specific interface, with the non-renderable data stuck in a
.data property or some such.

If the generic TextTrackCue is suitable for scripted and
non-renderable cues from non-WebVTT formats then maybe we should
consider using them for WebVTT metadata tracks as well, although that
is certainly not be preferred solution since I think letting the
parser always output cues of the same type (even for non-WebVTT) makes
more sense.

> It is also worth noting the existing implementation of an early attempt at a
> generic cue [4][5], which does expose a text attribute by means of
> subclassing the existing WebVTT flavor of cue, and then voiding it of its
> WebVTT semantics: a rather odd and roundabout way to achieve this it seems.
> The fact that this early generic cue was implemented in this fashion
> indicates to me that placing the text attribute on VTTCue (only) and not
> retaining it on TextTrackCue leads to a convoluted implementation, and not a
> better one.
>
> [4]
> https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/html/track/TextTrackCueGeneric.h
> [5]
> https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/html/track/TextTrackCueGeneric.cpp

I admit zero knowledge of the internals of the WebKit/Blink
implementation, and will leave it to the reviewers to comment on the
best we to implement whichever spec they think Blink should follow.

As long as the reviewers are aware that there is a spec fork and
disagreement about the best path forward, I have nothing more to add
here. In the end both specs will have to align with what is
implemented, so the power is completely with the reviewers.

Philip

Glenn Adams

unread,
Aug 21, 2013, 9:23:06 AM8/21/13
to Philip Jägenstedt, blink-dev, Ian Hickson
On Wed, Aug 21, 2013 at 4:44 AM, Philip Jägenstedt <phi...@opera.com> wrote:
 
As for formats other than WebVTT and TTML I have no specific
knowledge, but I don't understand why the generic TextTrackCue is the
appropriate interface for those formats, as opposed to the
format-specific interface, with the non-renderable data stuck in a
.data property or some such.

See [1], Section 5.1.4 for an example of a generic use of TextTrackCue unrelated to WebVTT and TTML, where a specific use of the text attribute is prescribed.

 

Ian Hickson

unread,
Aug 21, 2013, 1:45:36 PM8/21/13
to Glenn Adams, Philip Jägenstedt, blink-dev
Note that this is because that spec is broken. It should be introducing a
new interface, just like WebVTTCue, it shouldn't be using the
intentionally abstract TextTrackCue.

We shouldn't be messing up HTML and WebKit to support a spec that's just
doing things wrong in the first place, IMHO.

--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'

Elliott Sprehn

unread,
Aug 21, 2013, 1:56:16 PM8/21/13
to Ian Hickson, Glenn Adams, Philip Jägenstedt, blink-dev
Can we hold off on this change until we can get some agreement between the specs? I'd hate for us to end up in a world where we have half of each of the specs since that's really harmful to developers.

What's the opinion of other browser vendors too? We shouldn't be following the HTMLWG spec if other vendors are going to follow the WHATWG spec.

Glenn Adams

unread,
Aug 21, 2013, 2:13:14 PM8/21/13
to Elliott Sprehn, Ian Hickson, Philip Jägenstedt, blink-dev
On Wed, Aug 21, 2013 at 11:56 AM, Elliott Sprehn <esp...@chromium.org> wrote:
Can we hold off on this change until we can get some agreement between the specs? I'd hate for us to end up in a world where we have half of each of the specs since that's really harmful to developers.

What's the opinion of other browser vendors too? We shouldn't be following the HTMLWG spec if other vendors are going to follow the WHATWG spec.

That seems a larger question than presented by this case in point. Mozilla has indicated to me they intend to implement the changes outlined here as indicated in the HTMLWG specs.

My own opinion is that W3C specifications should take precedence if there is an inconsistency.

Glenn Adams

unread,
Aug 21, 2013, 2:19:40 PM8/21/13
to Ian Hickson, Philip Jägenstedt, blink-dev
On Wed, Aug 21, 2013 at 11:45 AM, Ian Hickson <i...@hixie.ch> wrote:
On Wed, 21 Aug 2013, Glenn Adams wrote:
> On Wed, Aug 21, 2013 at 4:44 AM, Philip Jägenstedt <phi...@opera.com>wrote:
> >
> > As for formats other than WebVTT and TTML I have no specific
> > knowledge, but I don't understand why the generic TextTrackCue is the
> > appropriate interface for those formats, as opposed to the
> > format-specific interface, with the non-renderable data stuck in a
> > .data property or some such.
>
> See [1], Section 5.1.4 for an example of a generic use of TextTrackCue
> unrelated to WebVTT and TTML, where a specific use of the text attribute
> is prescribed.
>
> [1] http://www.cablelabs.com/specifications/CL-SP-HTML5-MAP-I02-120510.pdf

Note that this is because that spec is broken. It should be introducing a
new interface, just like WebVTTCue, it shouldn't be using the
intentionally abstract TextTrackCue.

Since TextTrackCue was previously defined in both WHATWG and HTMLWG version of the spec to include the text attribute, it was entirely reasonable to define a generic use of the text attribute.
 
We shouldn't be messing up HTML and WebKit to support a spec that's just
doing things wrong in the first place, IMHO.

It may also be argued that removing the generic use of the text attribute is the "wrong" thing to do. If what the cited MPEG mapping spec does is wrong, then it is only wrong on a retroactive basis after a non-backward compatible change was made.

It shouldn't be necessary to introduce a new interface to use TextTrackCue as a generic cue with a text attribute. The HTMLWG appears to agree with this point.

Glenn Adams

unread,
Aug 21, 2013, 2:25:23 PM8/21/13
to Elliott Sprehn, Ian Hickson, Philip Jägenstedt, blink-dev
On Wed, Aug 21, 2013 at 12:13 PM, Glenn Adams <gl...@skynav.com> wrote:


On Wed, Aug 21, 2013 at 11:56 AM, Elliott Sprehn <esp...@chromium.org> wrote:
Can we hold off on this change until we can get some agreement between the specs? I'd hate for us to end up in a world where we have half of each of the specs since that's really harmful to developers.

What's the opinion of other browser vendors too? We shouldn't be following the HTMLWG spec if other vendors are going to follow the WHATWG spec.

That seems a larger question than presented by this case in point. Mozilla has indicated to me they intend to implement the changes outlined here as indicated in the HTMLWG specs.

My own opinion is that W3C specifications should take precedence if there is an inconsistency.

I should add that "this change", namely, implementing VTTCue as described in this thread, makes no change with respect to the TextTrackCue.text attribute as currently implemented in Blink/Webkit. That is, the current implementations (WK, Blink, FF, IE) support this attribute on the generic cue interface type.

So what Ian is arguing for, i.e., moving text attribute to a sub-type of TextTrackCue, is not being considered by this issue. It could be accomplished as a follow-on change in the future, but it is technically independent of the changes proposed in this thread.

Ian Hickson

unread,
Aug 21, 2013, 3:20:04 PM8/21/13
to Glenn Adams, Elliott Sprehn, Philip Jägenstedt, blink-dev
On Wed, Aug 21, 2013 at 11:25 AM, Glenn Adams <gl...@skynav.com> wrote:
>
> I should add that "this change", namely, implementing VTTCue as described in
> this thread, makes no change with respect to the TextTrackCue.text attribute
> as currently implemented in Blink/Webkit. That is, the current
> implementations (WK, Blink, FF, IE) support this attribute on the generic
> cue interface type.

That's highly misleading. Firefox, for example, simply hasn't split
TextTrackCue at all — they don't have VTTCue. So it's not "the generic
cue interface type", it's really VTTCue, with the old name. As far as
I can tell, the same is true in Blink/WebKit.

--
Ian Hickson

Philip Jägenstedt

unread,
Aug 22, 2013, 6:50:16 AM8/22/13
to Glenn Adams, Elliott Sprehn, Ian Hickson, blink-dev
On Wed, Aug 21, 2013 at 8:25 PM, Glenn Adams <gl...@skynav.com> wrote:

> So what Ian is arguing for, i.e., moving text attribute to a sub-type of
> TextTrackCue, is not being considered by this issue. It could be
> accomplished as a follow-on change in the future, but it is technically
> independent of the changes proposed in this thread.

Not really, it's not possible to split TextTrackCue into
TextTrackCue+VTTCue without deciding which of the interfaces gets the
text property and which have a constructor. It's very likely that
whatever path Chromium follows is going to be enough to cause other
implementors and the spec to follow suite, so I wouldn't count on any
follow-on changes being possible.

Philip

Silvia Pfeiffer

unread,
Aug 23, 2013, 5:10:22 PM8/23/13
to Ian Hickson, Glenn Adams, Philip Jägenstedt, blink-dev

On Wed, Aug 21, 2013 at 8:44 PM, Philip Jägenstedt <phi...@opera.com> wrote:

> It is also worth noting the existing implementation of an early attempt at a
> generic cue [4][5], which does expose a text attribute by means of
> subclassing the existing WebVTT flavor of cue, and then voiding it of its
> WebVTT semantics: a rather odd and roundabout way to achieve this it seems.
> The fact that this early generic cue was implemented in this fashion
> indicates to me that placing the text attribute on VTTCue (only) and not
> retaining it on TextTrackCue leads to a convoluted implementation, and not a
> better one.
>
> [4]
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/html/track/TextTrackCueGeneric.h
> [5]
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/html/track/TextTrackCueGeneric.cpp

I admit zero knowledge of the internals of the WebKit/Blink
implementation, and will leave it to the reviewers to comment on the
best we to implement whichever spec they think Blink should follow.


The TextTrackCueGeneric class was implemented by Apple to deal with cues that come from in-band text tracks (i.e. from inside a video file), but are not in WebVTT format and therefore don't follow the WebVTT rendering rules. If following the W3C spec, that functionality would indeed now be provided by the TextTrackCue object and does not need creation of a separate class. There is no solution for the needs of the TextTrackCueGeneric class in the WHATWG right now. It's one of the key reasons the W3C spec has this unfortunate fork.



On Thu, Aug 22, 2013 at 3:45 AM, Ian Hickson <i...@hixie.ch> wrote:
On Wed, 21 Aug 2013, Glenn Adams wrote:
> On Wed, Aug 21, 2013 at 4:44 AM, Philip Jägenstedt <phi...@opera.com>wrote:
> >
> > As for formats other than WebVTT and TTML I have no specific
> > knowledge, but I don't understand why the generic TextTrackCue is the
> > appropriate interface for those formats, as opposed to the
> > format-specific interface, with the non-renderable data stuck in a
> > .data property or some such.
>
> See [1], Section 5.1.4 for an example of a generic use of TextTrackCue
> unrelated to WebVTT and TTML, where a specific use of the text attribute
> is prescribed.
>
> [1] http://www.cablelabs.com/specifications/CL-SP-HTML5-MAP-I02-120510.pdf

Note that this is because that spec is broken. It should be introducing a
new interface, just like WebVTTCue, it shouldn't be using the
intentionally abstract TextTrackCue.

That spec's from 5/10/12, so it's based on the old TextTrackCue spec.

We shouldn't be messing up HTML and WebKit to support a spec that's just
doing things wrong in the first place, IMHO.

Their use case is a common one. It's the same that caused Apple to create the TextTrackCueGeneric class: support for cues that are just starttime, endtime, and text and have no rendering algorithm. It's a use case that the HTML spec should provide for. The W3C spec does so, the WHATWG spec doesn't at this time.


 On Wed, Aug 21, 2013 at 8:44 PM, Philip Jägenstedt <phi...@opera.com> wrote:

As for formats other than WebVTT and TTML I have no specific
knowledge, but I don't understand why the generic TextTrackCue is the
appropriate interface for those formats, as opposed to the
format-specific interface, with the non-renderable data stuck in a
.data property or some such.


When adapting the W3C spec for the needs of generic cues (containing only starttime, endtime, text), I started from the current WHATWG spec.

The WHATWG spec has an abstract TextTrackCue as the root interface for text track cues and relies on other specs to derive concrete interfaces from it.
The only concrete interface that is currently defined FAIK is the VTTCue interface in the WebVTT spec.

I had two options to support generic cues:

1. Keep the abstract TextTrackCue API and derive a new interface from it - call it GenericCue. It adds a text attribute and a constructor to the abstract interface.

2. Turn the abstract TextTrackCue API into the generic interface by adding the text attribute and a constructor there.


I looked at use cases for what kinds of interfaces we might expect to derive from TextTrackCue in the future: DVD cues, cues with images, TTML cues, cues with JSON etc.

Focusing just on the needs for a .text attribute, you might come to the conclusion that there will be cue types that won't have text attributes.

However, I believe that is the wrong conclusion. The Web as we know it is based on text. Everything on the Web that provides information has a text equivalent. It is a requirement for accessibility, and it is also useful for search and text analysis etc.

If we supported DVD cues, we'd provide the text equivalent of the DVD cues in a .text attribute.

So, I came to the conclusion that .text would be useful on TextTrackCue itself.

Once I made that step, it also made sense for the constructor to follow the .text attribute, since now the TextTrackCue itself contains some actual useful data and should thus be instantiable in JavaScript.

That's how I ended up going with option 2.

Silvia.

Glenn Adams

unread,
Aug 23, 2013, 5:23:30 PM8/23/13
to Silvia Pfeiffer, Ian Hickson, Philip Jägenstedt, blink-dev
I'm on board with this new approach, and it is now implemented by [1]. However, a number of folks have expressed reservations about applying this patch since:

(1) the new TextTrackCue constructor breaks existing usage in a fashion that is not easily debuggable; in particular, existing uses that expect this constructor to use VTT semantics simply will stop working, i.e., not display captions; at least one person has suggested removing the TextTrackCue constructor in order to fail hard (not soft) on existing uses, and then introduce a GenericCue (or GenericTextCue) sub-interface that possesses the generic semantics you have now included in the base interface;

(2) the WHATWG and HTMLWG specs have diverged on these interface definitions;

I believe we need to have a plan for fixing these before we can progress on this patch. Since you and Ian appear to be driving the spec work, is it possible for you to reach a conclusion that resolves these issues?

Ian Hickson

unread,
Aug 23, 2013, 5:25:23 PM8/23/13
to Silvia Pfeiffer, Glenn Adams, Philip Jägenstedt, blink-dev

FWIW, this is the wrong forum for this discussion. I recommend moving it
to somewhere more appropriate, like the WHATWG list.

On Sat, 24 Aug 2013, Silvia Pfeiffer wrote:
>
> The TextTrackCueGeneric class was implemented by Apple to deal with cues
> that come from in-band text tracks (i.e. from inside a video file), but
> are not in WebVTT format and therefore don't follow the WebVTT rendering
> rules. If following the W3C spec, that functionality would indeed now be
> provided by the TextTrackCue object and does not need creation of a
> separate class. There is no solution for the needs of the
> TextTrackCueGeneric class in the WHATWG right now. It's one of the key
> reasons the W3C spec has this unfortunate fork.

To implement the formatting rules of a non-VTT format, you'd need a new
cue type, not TextTrackCue, since if you used TextTrackCue you wouldn't
get any rendering (since it's not associated with a format).


> Their use case is a common one.

Indeed. That's why we adjusted the spec to support exactly that: multiple
formats, each with their own cue rendering rules, each with their own cue
interface. Note that the use case here isn't non-rendering cues, it's cues
in a different format than VTT. There's no "generic" cue need as far as I
can tell.


> Focusing just on the needs for a .text attribute, you might come to the
> conclusion that there will be cue types that won't have text attributes.

That's the right conclusion, as evidenced by the fact that there are cue
types that don't have text (DVD image subtitles, e.g., or prerecorded
audio descriptions, or binary data blobs).


> The Web as we know it is based on text.

Text is certainly important on the Web, but it stands aside images, video,
audio, proprietary binary blobs, and many other formats.


> Everything on the Web that provides information has a text equivalent.

This is clearly false (much to the chagrin of many of us). There's no sane
text equivalent to Rachmaninoff's Piano Concerto No. 2 in C minor. There's
no sane text equivalent to the binary data that describes how to create
the graph on a slide as a video of a professor drawing that graph plays in
the background. And more importantly, even if there could be, and even if
there should be, there's not necessarily an _actual_ equivalent in the
format in which that data is encoded. It's just not accurate to say that
every timed cue format will always have textual data representing each cue.

Glenn Adams

unread,
Aug 23, 2013, 5:35:12 PM8/23/13
to Ian Hickson, Silvia Pfeiffer, Philip Jägenstedt, blink-dev

On Fri, Aug 23, 2013 at 3:25 PM, Ian Hickson <i...@hixie.ch> wrote:
On Sat, 24 Aug 2013, Silvia Pfeiffer wrote:> Their use case is a common one.

Indeed. That's why we adjusted the spec to support exactly that: multiple
formats, each with their own cue rendering rules, each with their own cue
interface. Note that the use case here isn't non-rendering cues, it's cues
in a different format than VTT. There's no "generic" cue need as far as I
can tell.

You appear to be assuming that (1) all forms of cues need rendering rules, and (2) that one should define a new cue format specific sub-interface for every distinct format.

I don't believe the first is correct, since it doesn't recognize the utility and use cases for having JS client code access the cue content in a generic text format independently of whether it is rendered or not. As has been pointed out a number of times, there are already implementations and JS client code using this technique.

The second assumption is a design preference you are expressing, but other don't share. Those who don't share this preference opine that a generic cue type of some sort (either in the base interface or a specific "generic" sub-interface) can address numerous use cases without forcing one to define and publish additional format specific sub-interface types.

G.

Ian Hickson

unread,
Aug 23, 2013, 6:16:52 PM8/23/13
to Glenn Adams, Silvia Pfeiffer, Philip Jägenstedt, blink-dev
On Fri, 23 Aug 2013, Glenn Adams wrote:
>
> You appear to be assuming that (1) all forms of cues need rendering
> rules

Not all forms of cues need to be rendered, no. But it doesn't matter what
the rendering rules are for those that don't need to be rendered.


> and (2) that one should define a new cue format specific sub-interface
> for every distinct format.

Not necessarily every distinct format; some will have similar needs and
can reuse the same interface. For example MicroDVD and PowerDivX would
probably use the same interface, if either got implemented by an HTML UA.


> As has been pointed out a number of times, there are already
> implementations and JS client code using this technique.

Where?

Backwards compatibility concerns are the #1 way to convince me. If there's
content depending on a particular behaviour, then that changes everything.
This whole conversation is moot if there's content depending on deployed
browser implementations.


> The second assumption is a design preference you are expressing, but
> other don't share. Those who don't share this preference opine that a
> generic cue type of some sort (either in the base interface or a
> specific "generic" sub-interface) can address numerous use cases without
> forcing one to define and publish additional format specific
> sub-interface types.

A generic cue format for tracks whose cues have text is not the same thing
as an abstract interface common to all cues.

Glenn Adams

unread,
Aug 23, 2013, 6:32:56 PM8/23/13
to wha...@whatwg.org, Silvia Pfeiffer, Philip Jägenstedt, blink-dev
On Fri, Aug 23, 2013 at 4:16 PM, Ian Hickson <i...@hixie.ch> wrote:
On Fri, 23 Aug 2013, Glenn Adams wrote:
>
> You appear to be assuming that (1) all forms of cues need rendering
> rules

Not all forms of cues need to be rendered, no. But it doesn't matter what
the rendering rules are for those that don't need to be rendered.


> and (2) that one should define a new cue format specific sub-interface
> for every distinct format.

Not necessarily every distinct format; some will have similar needs and
can reuse the same interface. For example MicroDVD and PowerDivX would
probably use the same interface, if either got implemented by an HTML UA.


> As has been pointed out a number of times, there are already
> implementations and JS client code using this technique.

Where?

I think I've pointed this out to you at least four times before, but I'll do so again:


See section 5.2 Closed Captioning.

PhistucK

unread,
Aug 24, 2013, 11:48:31 AM8/24/13
to Glenn Adams, wha...@whatwg.org, Silvia Pfeiffer, Philip Jägenstedt, blink-dev
But where is it used?


PhistucK


To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.

Glenn Adams

unread,
Aug 24, 2013, 12:11:36 PM8/24/13
to PhistucK, whatwg, Silvia Pfeiffer, Philip Jägenstedt, blink-dev
On Sat, Aug 24, 2013 at 9:48 AM, PhistucK <phis...@gmail.com> wrote:
But where is it used?

Below.
 


PhistucK


On Sat, Aug 24, 2013 at 1:32 AM, Glenn Adams <gl...@skynav.com> wrote:

On Fri, Aug 23, 2013 at 4:16 PM, Ian Hickson <i...@hixie.ch> wrote:
On Fri, 23 Aug 2013, Glenn Adams wrote:
>
> You appear to be assuming that (1) all forms of cues need rendering
> rules

Not all forms of cues need to be rendered, no. But it doesn't matter what
the rendering rules are for those that don't need to be rendered.


> and (2) that one should define a new cue format specific sub-interface
> for every distinct format.

Not necessarily every distinct format; some will have similar needs and
can reuse the same interface. For example MicroDVD and PowerDivX would
probably use the same interface, if either got implemented by an HTML UA.


> As has been pointed out a number of times, there are already
> implementations and JS client code using this technique.

Where?

I think I've pointed this out to you at least four times before, but I'll do so again:


See section 5.2 Closed Captioning.

This specification has been implemented by CableLabs in a reference implementation of a DLNA defined TV/STB platform for remote user interfaces. The "generic" usage implemented there is being used by television service provider operators to access both MPEG-2 PSI and CEA-608 data in JS client code.

Silvia Pfeiffer

unread,
Aug 24, 2013, 6:32:33 PM8/24/13
to Ian Hickson, Glenn Adams, Philip Jägenstedt, blink-dev
On Sat, Aug 24, 2013 at 7:25 AM, Ian Hickson <i...@hixie.ch> wrote:

FWIW, this is the wrong forum for this discussion. I recommend moving it
to somewhere more appropriate, like the WHATWG list.

On Sat, 24 Aug 2013, Silvia Pfeiffer wrote:
>
> The TextTrackCueGeneric class was implemented by Apple to deal with cues
> that come from in-band text tracks (i.e. from inside a video file), but
> are not in WebVTT format and therefore don't follow the WebVTT rendering
> rules. If following the W3C spec, that functionality would indeed now be
> provided by the TextTrackCue object and does not need creation of a
> separate class. There is no solution for the needs of the
> TextTrackCueGeneric class in the WHATWG right now. It's one of the key
> reasons the W3C spec has this unfortunate fork.

To implement the formatting rules of a non-VTT format, you'd need a new
cue type, not TextTrackCue, since if you used TextTrackCue you wouldn't
get any rendering (since it's not associated with a format).

Having no rendering is the whole idea of it. The rendering is left to the JS dev. The browser just exposes the cues (that are in non-VTT format) to the JS dev.
 

> Their use case is a common one.

Indeed. That's why we adjusted the spec to support exactly that: multiple
formats, each with their own cue rendering rules, each with their own cue
interface. Note that the use case here isn't non-rendering cues, it's cues
in a different format than VTT. There's no "generic" cue need as far as I
can tell.

> Focusing just on the needs for a .text attribute, you might come to the
> conclusion that there will be cue types that won't have text attributes.

That's the right conclusion, as evidenced by the fact that there are cue
types that don't have text (DVD image subtitles, e.g., or prerecorded
audio descriptions, or binary data blobs).


> The Web as we know it is based on text.

Text is certainly important on the Web, but it stands aside images, video,
audio, proprietary binary blobs, and many other formats.


> Everything on the Web that provides information has a text equivalent.

This is clearly false (much to the chagrin of many of us). There's no sane
text equivalent to Rachmaninoff's Piano Concerto No. 2 in C minor. There's
no sane text equivalent to the binary data that describes how to create
the graph on a slide as a video of a professor drawing that graph plays in
the background. And more importantly, even if there could be, and even if
there should be, there's not necessarily an _actual_ equivalent in the
format in which that data is encoded. It's just not accurate to say that
every timed cue format will always have textual data representing each cue.


Do you have a proposal to satisfy the use case? I am open for other solutions, but as it stands the WHATWG spec doesn't provide a solution for the use case.

Silvia.

Ian Hickson

unread,
Aug 25, 2013, 12:40:30 AM8/25/13
to Silvia Pfeiffer, Glenn Adams, Philip Jägenstedt, blink-dev
On Sun, 25 Aug 2013, Silvia Pfeiffer wrote:
>
> Having no rendering is the whole idea of it. The rendering is left to
> the JS dev. The browser just exposes the cues (that are in non-VTT
> format) to the JS dev.

If the use case is JavaScript adding cues to the object that are then
rendered by JS, then VTTCue serves that use case fine already.

If the use case is browsers fully supporting some other text track format,
then the relevant standard (or some glue standard) should define an
interface for how cues in that text track format are exposed to the Web,
and then that's the interface that should be used.

If the use case is browsers half-heartedly implementing some other text
track format by parsing its cues but not implementing the rendering rules
for them, then we shouldn't support the use case. Such half-hearted
support is bad for the Web. It causes fragmentation, it leads to standards
failure, it's actively harmful.

If you have some other use case in mind, then you should bring it up on
the WHATWG list. I'm not aware of any having been brought up that would
involve new text track interfaces that aren't already handled.

Glenn Adams

unread,
Aug 25, 2013, 2:30:59 AM8/25/13
to Ian Hickson, Silvia Pfeiffer, Philip Jägenstedt, blink-dev
On Sat, Aug 24, 2013 at 10:40 PM, Ian Hickson <i...@hixie.ch> wrote:
On Sun, 25 Aug 2013, Silvia Pfeiffer wrote:
>
> Having no rendering is the whole idea of it. The rendering is left to
> the JS dev. The browser just exposes the cues (that are in non-VTT
> format) to the JS dev.

If the use case is browsers half-heartedly implementing some other text
track format by parsing its cues but not implementing the rendering rules
for them, then we shouldn't support the use case. Such half-hearted
support is bad for the Web. It causes fragmentation, it leads to standards
failure, it's actively harmful.

This is where you're thinking goes wrong: exposing content from non-VTT cues via text is not a "half hearted" implementation when there is no intention that the UA render the cue. That's almost like saying that XHR should never expose its content to JS and that every use of XHR should define a sub-interface that knows how to render/use the returned results.

That you fail to recognize the viability of this use case should not block progress with resolving this functionality. I would suggest you defer to Silvia's judgment on this matter, particularly since you have said that this is now her "baby".
 

If you have some other use case in mind, then you should bring it up on
the WHATWG list. I'm not aware of any having been brought up that would
involve new text track interfaces that aren't already handled.

In other words, you fail to see the relevance of the citations that have been offered, even though they have been implemented by the industry based on the original specification of this functionality. Just because you don't recognize that relevance doesn't mean it isn't relevant.

I find it somewhat ironic that you continue to present arguments against supporting a text attribute by citing non-existent implementation support for exposing DVD image subtitles while at the same time you choose to ignore existing implementation support for generic use of the text attribute.

Can you and Silvia please come to some understanding that allows us to move forward rather than holding us back? I don't have a dog in this fight, so whatever you two agree upon I will accept, provided it admits non-VTT based cues, whether this requires defining a new sub-interface or not.

G.

Ian Hickson

unread,
Aug 25, 2013, 4:15:52 AM8/25/13
to Glenn Adams, Silvia Pfeiffer, Philip Jägenstedt, blink-dev
On Sun, 25 Aug 2013, Glenn Adams wrote:
> On Sat, Aug 24, 2013 at 10:40 PM, Ian Hickson <i...@hixie.ch> wrote:
> > On Sun, 25 Aug 2013, Silvia Pfeiffer wrote:
> > >
> > > Having no rendering is the whole idea of it. The rendering is left
> > > to the JS dev. The browser just exposes the cues (that are in
> > > non-VTT format) to the JS dev.
> >
> > If the use case is browsers half-heartedly implementing some other
> > text track format by parsing its cues but not implementing the
> > rendering rules for them, then we shouldn't support the use case. Such
> > half-hearted support is bad for the Web. It causes fragmentation, it
> > leads to standards failure, it's actively harmful.
>
> This is where you're thinking goes wrong: exposing content from non-VTT
> cues via text is not a "half hearted" implementation when there is no
> intention that the UA render the cue.

Let's talk concrete formats here. Exactly what format are we talking about
browsers implementing the parsing of that don't have any rendering rules?


> I would suggest you defer to Silvia's judgment on this matter,
> particularly since you have said that this is now her "baby".

What does "this" refer to in this sentence?

Glenn Adams

unread,
Aug 25, 2013, 10:44:07 AM8/25/13
to Ian Hickson, Silvia Pfeiffer, Philip Jägenstedt, blink-dev
On Sun, Aug 25, 2013 at 2:15 AM, Ian Hickson <i...@hixie.ch> wrote:
On Sun, 25 Aug 2013, Glenn Adams wrote:
> On Sat, Aug 24, 2013 at 10:40 PM, Ian Hickson <i...@hixie.ch> wrote:
> > On Sun, 25 Aug 2013, Silvia Pfeiffer wrote:
> > >
> > > Having no rendering is the whole idea of it. The rendering is left
> > > to the JS dev. The browser just exposes the cues (that are in
> > > non-VTT format) to the JS dev.
> >
> > If the use case is browsers half-heartedly implementing some other
> > text track format by parsing its cues but not implementing the
> > rendering rules for them, then we shouldn't support the use case. Such
> > half-hearted support is bad for the Web. It causes fragmentation, it
> > leads to standards failure, it's actively harmful.
>
> This is where you're thinking goes wrong: exposing content from non-VTT
> cues via text is not a "half hearted" implementation when there is no
> intention that the UA render the cue.

Let's talk concrete formats here. Exactly what format are we talking about
browsers implementing the parsing of that don't have any rendering rules?

Apparently you haven't bothered reading the citation I keep quoting [1], "Mapping from MPEG-2 Transport to HTML5", which I'll refer to as [MP2MAP].


In that document, you will find (Section 5.1.1) the specification of a Program Description TextTrack, wherein:

TextTrack.kind = "metadata"
TextTrack.label = "video/mp2t track-description"
TextTrack.mode = "hidden"

and, for each PMT (Program Map Table) instance received that differs from previous PMT table instance, a cue is created (by the UA, not client JS) wherein:

TextTrackCue.startTime = current media (NPT) time
TextTrackCue.endTime = INFINITY
TextTrackCue.text = JSON representation of the PMT (as specified in 5.1.1 of [MP2MAP])
TextTrackCue.pauseOnExit = false
TextTrackCue.getCueAsHTML() returns null

An MPEG-2 Program Map Table is defined in ISO/IEC 13818-1:2000 Section 2.4.4.8, and contains an enumeration of the elementary streams contained in the TS, their PIDs (packet identifiers), and descriptive metadata about such streams.

In addition, section 5.1.4 "Other TextTracks" of [MP2MAP] require instantiating additional "Other TextTracks" for all "MPEG-2 stream types that are not UA recognized audio or video stream types" as follows:

TextTrack.kind = "metadata"
TextTrack.label = "video/mp2t-pid" (where pid denotes the PID that contains the stream)
TextTrack.mode = "disabled"

and, "for each PES or private data packet in the program stream represented by the TextTrack", a cue is created (by the UA, not client JS) wherein:

TextTrackCue.startTime = current media (NPT) time
TextTrackCue.endTime = INFINITY
TextTrackCue.text = BASE64 representation of packet
TextTrackCue.pauseOnExit = false
TextTrackCue.getCueAsHTML() returns null

Clearly, both of these uses of TextTrackCue are not intended to be rendered by the UA.

This spec [MP2MAP] Section 5.2 also defines a mapping for CEA-708 (including embedded 608) captions, which in MPEG-2 are encoded in user private data in the video elementary stream. At present, few UAs support the decoding/rendering of embedded 708 (DTVCC) or embedded 608 captions. Consequently, exposing this raw data to JS client code permits one to construct a polyfill to render such captions until such time that such support is widely implemented. However, that day might not come, e.g., due to lower interest in UA vendors in supporting MPEG-2 (than supporting newer formats that support embedded WebVTT or TTML). In recognition of this state of affairs, [MP2MAP] defines (Section 5.2) a mapping to a generic text track as follows:

TextTrack.kind = "captions"
TextTrack.label = "pid" (where pid denotes the PID that contains the video ES that embeds captions)

and, "for each PES or private data packet in the program stream represented by the TextTrack", a cue is created (by the UA, not client JS) wherein:

TextTrackCue.startTime = media (NPT) time associated with start of caption
TextTrackCue.endTime = media (NPT) time associated with end of caption if known, otherwise INFINITY
TextTrackCue.text = BASE64 representation of embedded data
TextTrackCue.pauseOnExit = false
TextTrackCue.getCueAsHTML() returns an HTML representation if the UA knows how to render, else null

Of these three specified uses, only the last is potentially renderable as captions, while the first two are clearly unrenderable metadata.
 


> I would suggest you defer to Silvia's judgment on this matter,
> particularly since you have said that this is now her "baby".

What does "this" refer to in this sentence?

Well, you weren't specific when you said this to me, but I interpreted "this" as including WebVTT and the generic, non-WebVTT related TextTrack* APIs and related functionality. If your intention was to only have her do the former, and not the latter, then I would suggest you consider handing off the latter to her as well.

G.



Glenn Adams

unread,
Aug 25, 2013, 10:50:33 AM8/25/13
to Ian Hickson, Silvia Pfeiffer, Philip Jägenstedt, blink-dev
On Sun, Aug 25, 2013 at 8:44 AM, Glenn Adams <gl...@skynav.com> wrote:
This spec [MP2MAP] Section 5.2 also defines a mapping for CEA-708 (including embedded 608) captions, which in MPEG-2 are encoded in user private data in the video elementary stream. At present, few UAs support the decoding/rendering of embedded 708 (DTVCC) or embedded 608 captions. Consequently, exposing this raw data to JS client code permits one to construct a polyfill to render such captions until such time that such support is widely implemented. However, that day might not come, e.g., due to lower interest in UA vendors in supporting MPEG-2 (than supporting newer formats that support embedded WebVTT or TTML). In recognition of this state of affairs, [MP2MAP] defines (Section 5.2) a mapping to a generic text track as follows:

TextTrack.kind = "captions"
TextTrack.label = "pid" (where pid denotes the PID that contains the video ES that embeds captions)

and, "for each PES or private data packet in the program stream represented by the TextTrack", a cue is created (by the UA, not client JS) wherein:

Sorry, that last was a paste error, and should read:

and, "for each caption with attributes set" (see CEA-708 for definition of this construct), a cue is created (by the UA, not client JS) wherein:

Philip Jägenstedt

unread,
Aug 26, 2013, 3:32:28 AM8/26/13
to Glenn Adams, Ian Hickson, Silvia Pfeiffer, blink-dev
Is there any particular reason why a new interface for in-band MPEG-2
cues isn't used, as opposed to putting extra information into the
label? Also, what is to be done with in-band text tracks which *are*
supposed to be rendered? It looks like the only option is to render
them using scripts using getCueAsHTML?

Philip

Silvia Pfeiffer

unread,
Aug 26, 2013, 6:45:01 AM8/26/13
to Philip Jägenstedt, Glenn Adams, Ian Hickson, blink-dev
On Sun, Aug 25, 2013 at 2:40 PM, Ian Hickson <i...@hixie.ch> wrote:
> On Sun, 25 Aug 2013, Silvia Pfeiffer wrote:
>>
>> Having no rendering is the whole idea of it. The rendering is left to
>> the JS dev. The browser just exposes the cues (that are in non-VTT
>> format) to the JS dev.
>
> If the use case is JavaScript adding cues to the object that are then
> rendered by JS, then VTTCue serves that use case fine already.

We can't expose CEA708 captions as VTTCues, because the caption
rendering algorithm of VTTCue does not apply to cues of CEA708 format.


> If the use case is browsers fully supporting some other text track format,
> then the relevant standard (or some glue standard) should define an
> interface for how cues in that text track format are exposed to the Web,
> and then that's the interface that should be used.
>
> If the use case is browsers half-heartedly implementing some other text
> track format by parsing its cues but not implementing the rendering rules
> for them, then we shouldn't support the use case. Such half-hearted
> support is bad for the Web. It causes fragmentation, it leads to standards
> failure, it's actively harmful.

IMHO it would be mad for browsers to implement parsing & rendering
algorithms for more than 1 or 2 caption formats (a maintenance
nightmare for duplicate functionality).

But I don't see why that should inhibit browsers from exposing to Web
developers content of tracks in file formats that they support (such
as MKV or MP2 or MP4) and leave the parsing and rendering to the JS
devs. That is in fact what Apple have done in
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/html/track/TextTrackCueGeneric.h
and what is therefore already implemented in WebKit and blink.


On Mon, Aug 26, 2013 at 5:32 PM, Philip Jägenstedt <phi...@opera.com> wrote:
>
> Is there any particular reason why a new interface for in-band MPEG-2
> cues isn't used, as opposed to putting extra information into the
> label?

Just a FYI: I think the use of @label in what Glenn explained will be
replaced by using @inBandMetadataTrackDispatchType in future. The
latter attribute was created as a consequence of the in-band use cases
listed in the spec that Glenn keeps citing, since @label is supposed
to be a human readable string and thus not a good match for the use
case ( http://www.whatwg.org/specs/web-apps/current-work/#text-track-label
).


> Also, what is to be done with in-band text tracks which *are*
> supposed to be rendered?

There are specifications for how to put WebVTT into MPEG-2, TTML into
MPEG-2, SRT into MPEG-2, CEA-708 into MPEG-2, program description
tracks into MPEG-2, and likely other things. You can't expose them all
through a single generic "MPEG-2 cues" interface.

Rather you want those cues for which there is support in the browser
to be exposed by their concrete interfaces, eg WebVTT in MPEG-2 should
end up creating VTTCue objects.

The rest should just end up in a generic container that supports start
time, end time, text (which is what the W3C TextTrackCue objects are),
since no rendering is provided by the browser (yet). Information about
the format of the cue is available to the JS dev in the TextTrack's
@kind and @inBandMetadataTrackDispatchType (btw: the latter is a
poorly named attribute for non-metadata tracks).

In short: there is no need to invent a MPEG2Cue interface since it is
exactly what the W3C TextTrackCue interface provides: start time, end
time, text.

The same applies to other encapsulation formats: e.g. srt in MKV, srt
in OGG. These are also quite happily satisfied by the W3C TextTrackCue
objects.


> It looks like the only option is to render
> them using scripts using getCueAsHTML?

getCueAsHTML() is a browser-provided function - if browsers don't
parse the cue content, but just expose it, they certainly won't
provide a getCueAsHTML() that's useful for rendering. This is why I
didn't add getCueAsHTML() back into the TextTrackCue interface. The
minute that a browser implements a parsing and rendering function for
a TextTrackCue format, a spec should be written that defines a new
interface that inherits from TextTrackCue and defines a getCueAsHTML()
function.

Silvia.

Glenn Adams

unread,
Aug 26, 2013, 11:27:02 AM8/26/13
to Philip Jägenstedt, Ian Hickson, Silvia Pfeiffer, blink-dev
I believe Silvia's last message addresses these questions. Let me know if you feel more input is required. 

Philip Jägenstedt

unread,
Aug 27, 2013, 2:56:29 AM8/27/13
to Silvia Pfeiffer, Glenn Adams, Ian Hickson, blink-dev
On Mon, Aug 26, 2013 at 12:45 PM, Silvia Pfeiffer <silv...@chromium.org> wrote:
> On Sun, Aug 25, 2013 at 2:40 PM, Ian Hickson <i...@hixie.ch> wrote:
>> On Sun, 25 Aug 2013, Silvia Pfeiffer wrote:
>>>
>>> Having no rendering is the whole idea of it. The rendering is left to
>>> the JS dev. The browser just exposes the cues (that are in non-VTT
>>> format) to the JS dev.
>>
>> If the use case is JavaScript adding cues to the object that are then
>> rendered by JS, then VTTCue serves that use case fine already.
>
> We can't expose CEA708 captions as VTTCues, because the caption
> rendering algorithm of VTTCue does not apply to cues of CEA708 format.

I don't think I follow, if CEA708 can be rendered then surely it
should have an CEA708Cue interface or similar, no? I tried searching
for the plans for CEA708 (CEA708 TextTrackCue) but only found this
very thread :-/

>> If the use case is browsers fully supporting some other text track format,
>> then the relevant standard (or some glue standard) should define an
>> interface for how cues in that text track format are exposed to the Web,
>> and then that's the interface that should be used.
>>
>> If the use case is browsers half-heartedly implementing some other text
>> track format by parsing its cues but not implementing the rendering rules
>> for them, then we shouldn't support the use case. Such half-hearted
>> support is bad for the Web. It causes fragmentation, it leads to standards
>> failure, it's actively harmful.
>
> IMHO it would be mad for browsers to implement parsing & rendering
> algorithms for more than 1 or 2 caption formats (a maintenance
> nightmare for duplicate functionality).
>
> But I don't see why that should inhibit browsers from exposing to Web
> developers content of tracks in file formats that they support (such
> as MKV or MP2 or MP4) and leave the parsing and rendering to the JS
> devs. That is in fact what Apple have done in
> https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/html/track/TextTrackCueGeneric.h
> and what is therefore already implemented in WebKit and blink.

Does TextTrackCueGeneric have any rendering? In any event, using a
TextTrackCue which cannot be rendered doesn't sound like a good match
for in-band formats other than WebVTT. If [start, end, text] is all
the information carried by the in-band data, then exposing it as
VTTCue seems fine to me. However, if the format does have some other
information, then it seems like a really bad idea to initially expose
it as a non-renderable TextTrackCue, as scripts would come to depend
on it never being rendered and block browsers from ever rendering that
format itself.

> On Mon, Aug 26, 2013 at 5:32 PM, Philip Jägenstedt <phi...@opera.com> wrote:
>>
>> Is there any particular reason why a new interface for in-band MPEG-2
>> cues isn't used, as opposed to putting extra information into the
>> label?
>
> Just a FYI: I think the use of @label in what Glenn explained will be
> replaced by using @inBandMetadataTrackDispatchType in future. The
> latter attribute was created as a consequence of the in-band use cases
> listed in the spec that Glenn keeps citing, since @label is supposed
> to be a human readable string and thus not a good match for the use
> case ( http://www.whatwg.org/specs/web-apps/current-work/#text-track-label
> ).

Thanks, I have a vague memory of that discussion now that you mention it.

>> Also, what is to be done with in-band text tracks which *are*
>> supposed to be rendered?
>
> There are specifications for how to put WebVTT into MPEG-2, TTML into
> MPEG-2, SRT into MPEG-2, CEA-708 into MPEG-2, program description
> tracks into MPEG-2, and likely other things. You can't expose them all
> through a single generic "MPEG-2 cues" interface.
>
> Rather you want those cues for which there is support in the browser
> to be exposed by their concrete interfaces, eg WebVTT in MPEG-2 should
> end up creating VTTCue objects.
>
> The rest should just end up in a generic container that supports start
> time, end time, text (which is what the W3C TextTrackCue objects are),
> since no rendering is provided by the browser (yet). Information about
> the format of the cue is available to the JS dev in the TextTrack's
> @kind and @inBandMetadataTrackDispatchType (btw: the latter is a
> poorly named attribute for non-metadata tracks).

(The "yet" there is tricky, see above.)

> In short: there is no need to invent a MPEG2Cue interface since it is
> exactly what the W3C TextTrackCue interface provides: start time, end
> time, text.
>
> The same applies to other encapsulation formats: e.g. srt in MKV, srt
> in OGG. These are also quite happily satisfied by the W3C TextTrackCue
> objects.

Ah, right. I imagined that maybe MPEG-2 in-band text tracks had a
common format and that there was a "metadata" bit which could be set
on it, more like how the out-of-band WebVTT format works.

My current thinking is that proper renderable text tracks should just
be exposed using an appropriate interface (like VTTCue) or not at all,
but that leaves us with these in-band metadata tracks. (Continued in
response to Glenn.)

Philip

Philip Jägenstedt

unread,
Aug 27, 2013, 3:07:32 AM8/27/13
to Glenn Adams, Ian Hickson, Silvia Pfeiffer, blink-dev
Yeah, Silvia cleared up some things for me, but I'm not entirely clear
about which kinds of in-band MPEG-2 tracks you want to expose using
the TextTrackCue interface. The PDF says "For all MPEG-2 stream types
that are not UA recognized audio or video stream types, the UA MUST
create a new TextTrack in the TextTrackList of the media resource."
Does this mean that you intend to use it for any normal (non-metadata)
tracks which can be rendered but have no particular rendering rules?

As for metadata in-band tracks, are the kinds of in-band metadata
tracks completely open-ended, or why is it not feasible to expose
those using specific interfaces? For example, for PMT it seems more
reasonable to just have a PMTCue with stream_pid, pid and
es_descriptors rather than encoding that information as JSON and
putting it in TextTrackCue.text.

Apologies if you've covered this already during the many months this
discussion has been going.

Philip

Glenn Adams

unread,
Aug 27, 2013, 10:56:12 AM8/27/13
to Philip Jägenstedt, Ian Hickson, Silvia Pfeiffer, blink-dev
On Tue, Aug 27, 2013 at 1:07 AM, Philip Jägenstedt <phi...@opera.com> wrote:
On Mon, Aug 26, 2013 at 5:27 PM, Glenn Adams <gl...@skynav.com> wrote:
> On Mon, Aug 26, 2013 at 1:32 AM, Philip Jägenstedt <phi...@opera.com>
> wrote:
>>
>> On Sun, Aug 25, 2013 at 4:44 PM, Glenn Adams <gl...@skynav.com> wrote:
>> > Of these three specified uses, only the last is potentially renderable
>> > as
>> > captions, while the first two are clearly unrenderable metadata.
>>
>> Is there any particular reason why a new interface for in-band MPEG-2
>> cues isn't used, as opposed to putting extra information into the
>> label? Also, what is to be done with in-band text tracks which *are*
>> supposed to be rendered? It looks like the only option is to render
>> them using scripts using getCueAsHTML?
>
>
> I believe Silvia's last message addresses these questions. Let me know if
> you feel more input is required.

Yeah, Silvia cleared up some things for me, but I'm not entirely clear
about which kinds of in-band MPEG-2 tracks you want to expose using
the TextTrackCue interface.

To make it clear, I don't care if these are exposed using the TextTrackCue interface or some other GenericCue or MetadataCue interface derived from TextTrackCue. Also, this MPEG-2 spec referred to isn't my spec, but merely one I am aware of which exposes non-rendered metadata to JS client code.
 
The PDF says "For all MPEG-2 stream types
that are not UA recognized audio or video stream types, the UA MUST
create a new TextTrack in the TextTrackList of the media resource."
Does this mean that you intend to use it for any normal (non-metadata)
tracks which can be rendered but have no particular rendering rules?

Again, it is not necessarily a spec I intend to use, but one that is defined in the industry and is starting to be deployed for use. Since I work in that industry, it is probable that some of my clients will use it if they aren't already doing so.

The general assumption in that MPEG-2 mapping spec is that UAs will *not* know how to parse let alone render most potential metadata or even a number of caption/subtitle formats, but that, by exposing to JS client, it would be possible to do so.

As for potentially renderable tracks, such as 708, DVB subtitles, EBU teletext, etc., as I mentioned in my previous posting, these are not today parsed or rendered by UAs, and it is unclear if they ever will be. It would certainly be possible to define a sub-interface for each of these, but if the only distinction in such sub-interfaces is that they implement getCueAsHTML() as appropriate to the source format but are otherwise identical, then one wonders why it is necessary to create, standardize, and implement distinct, publicly exposed interfaces when a common base interface would suffice. Indeed, the authors of the MPEG-2 mapping spec assumed this was the intent of the original definition of TextTrackCue and that such an implementation approach was the logical thing to do. The only problem with that original definition was that it tied TextTrackCue to WebVTT semantics.

The problems arrived when the base interface was subdivided into VTTCue and the essential equipment needed to use TextTrackCue as a generic interface were moved (to VTTCue) as well. This change basically broke such use as a generic interface without providing an alternative. If that change had been accompanied by the definition of another interface type, such as GenericCue, that provided at least a constructor and a text attribute, then I believe there would have been no disagreement and no problem with supporting the MPEG-2 mapping spec (and similar uses).

As we know now, Silvia eventually concluded that this generic use case could be handled by the base interface itself by restoring a constructor and text attribute. I agree that is possible, but does create a backwards compatibility problem for existing uses that assume that a WebVTT cue is constructed with the base interface constructor. By changing the semantics of the constructor (as opposed to removing it), this will cause existing uses to fail soft rather than fail hard, which will be harder to correct. SimonP pointed this out in previous discussions. In my mind, this is a good argument for removing the existing constructor and adding a new GenericCue interface with its own constructor.
 

As for metadata in-band tracks, are the kinds of in-band metadata
tracks completely open-ended, or why is it not feasible to expose
those using specific interfaces?

Yes, completely open ended. Again, it is possible to create new sub-interfaces for every one that needs support, but that requires standardization even when there may be no distinction in such sub-interfaces; i.e, they all may simply use the existing text attribute to expose their data (as was done in the MPEG-2 mapping spec).

Regarding what non-AV streams are potentially exposed, one has to look first at the MPEG-2 Systems specification (ISO/IEC13818-1:2000), and then look at derivative specifications/systems. Table 2-29 defines the following non-AV stream types:

0x05 ISO/IEC 13818-1 Private Sections
0x06 ISO/IEC 13818-1 PES packets containing private data
0x07 ISO/IEC 13522 MHEG
0x08 ISO/IEC 13818-1 Annex A DSM-CC
0x09 ITU H.222.1 (ATM Multiplex - may contain AV and non-AV data)
0x0a ISO/IEC 13818-6 type A
0x0b ISO/IEC 13818-6 type B
0x0c ISO/IEC 13818-6 type C
0x0d ISO/IEC 13818-6 type D
0x0e ISO/IEC 13818-1 auxiliary
0x14 ISO/IEC 13818-6 Synchronized Download Protocol
0x15 ISO/IEC 13818-1:2000 AMD 1 Metadata carried in PES packets using Metadata Access Unit Wrapper
0x16 ISO/IEC 13818-1:2000 AMD 1 Metadata carried in metadata_sections
0x17 ISO/IEC 13818-1:2000 AMD 1 Metadata carried in ISO/IEC 13818-6 (DSM-CC) Data Carousel
0x18 ISO/IEC 13818-1:2000 AMD 1 Metadata carried in ISO/IEC 13818-6 (DSM-CC) Object Carousel
0x19 ISO/IEC 13818-1:2000 AMD 1 Metadata carried in ISO/IEC 13818-6 (DSM-CC) Synchronized Download Protocol

and stream types 0x80 to 0xff are defined for "User Private" applications.

Most of the above stream types further distinguish (meta)data subtypes, some of which are standardized for non-region specific use, and others standardized for region specific use. For example, stream type 0x05 is used to transport both DVB-SI (System Information) and ATSC PSIP (Program and System Information Protocol) which provide EPG scheduling data.

The different regional systems that other applications that employ MPEG-2 also define their own "use private" stream types, such as:

In ATSC Systems

0x95 ATSC A/90 Data Service Table, Network Resources Table
0xC2 ATSC A/90 PES packets containing streaming, synchronous data

This is just a sampling of potential sources for metadata in the MPEG-2 context. There are similar large sets of standards and usages defined for MPEG-4 and other container formats.

The bottom line is that it is neither practical or useful to have to define format specific sub-interfaces for each of these forms of metadata carriage or the metadata itself when the simple text attribute will suffice.

Can sub-interfaces be defined for potentially renderable non-AV tracks? Sure, but when will that be (if ever)? For example, stream type 0x07 supports MHEG (ISO/IEC 13522), which is a renderable format, much of which could be rendered as HTML. Would it be useful to define a sub-interface? Perhaps, but again a simple getCueAsHTML() may suffice as well. In the mean time, exposing MPEG data directly to JS allows an author to construct a polyfill to translate to HTML provided the raw data is available via a text attribute.
 
For example, for PMT it seems more
reasonable to just have a PMTCue with stream_pid, pid and
es_descriptors rather than encoding that information as JSON and
putting it in TextTrackCue.text.

While it is possible to define a PMTCue, are we going to standardize each and every of the myriad uses of metadata already fielded and used merely to provide syntactic sugar for accessing data parsed by the UA instead of by client JS code? Well, we might create such sub-interfaces in the future, but we should do so at the expense of providing support for JS client parsing approaches that don't depend on future, unknown standardization activities.

Glenn Adams

unread,
Aug 27, 2013, 11:07:59 AM8/27/13
to Philip Jägenstedt, Ian Hickson, Silvia Pfeiffer, blink-dev
On Tue, Aug 27, 2013 at 8:56 AM, Glenn Adams <gl...@skynav.com> wrote:
Well, we might create such sub-interfaces in the future, but we should do so at the expense of providing support for JS client parsing approaches that don't depend on future, unknown standardization activities.

s/we should do so/should we do so (?)/

Silvia Pfeiffer

unread,
Aug 27, 2013, 7:03:20 PM8/27/13
to Philip Jägenstedt, Glenn Adams, Ian Hickson, blink-dev
On Tue, Aug 27, 2013 at 4:56 PM, Philip Jägenstedt <phi...@opera.com> wrote:
> On Mon, Aug 26, 2013 at 12:45 PM, Silvia Pfeiffer <silv...@chromium.org> wrote:
>> On Sun, Aug 25, 2013 at 2:40 PM, Ian Hickson <i...@hixie.ch> wrote:
>>> On Sun, 25 Aug 2013, Silvia Pfeiffer wrote:
>>>>
>>>> Having no rendering is the whole idea of it. The rendering is left to
>>>> the JS dev. The browser just exposes the cues (that are in non-VTT
>>>> format) to the JS dev.
>>>
>>> If the use case is JavaScript adding cues to the object that are then
>>> rendered by JS, then VTTCue serves that use case fine already.
>>
>> We can't expose CEA708 captions as VTTCues, because the caption
>> rendering algorithm of VTTCue does not apply to cues of CEA708 format.
>
> I don't think I follow, if CEA708 can be rendered then surely it
> should have an CEA708Cue interface or similar, no?

If no browser wants to implement CEA708 rendering (which, FAIK is the
case right now), why should there be such an interface?

[..]

>
> My current thinking is that proper renderable text tracks should just
> be exposed using an appropriate interface (like VTTCue) or not at all,
> but that leaves us with these in-band metadata tracks.

There are three steps involved at which we can expose an interface for
in-band tracks:

1. unravel the in-band container encapsulation:
you end up with a sequence of {starttime, endtime, data} objects
(which we call "cues")

2. parse the content (data) in the cues:
you end up with a sequence of {starttime, endtime, data, getCueAsHTML()} objects
(incidentally, somebody on the HTML list is asking for such cues right now)

3. render the parsed data in the cues:
you end up with a sequence of {starttime, endtime, data,
getCueAsHTML()} objects with rendering rules


From where I stand, I saw the last two as an integrated pair: if a
browser decides to implement a parser for the cues, I would expect
them to also decide to implement rendering (since the hardest part is
already done). But we can have this discussion on the HTML list.

Your statement above, however, takes this an additional step and
assumes that browser would either take all three steps or none. That's
where reality has shown us that it's not the case: both the Apple
implementation of TextTrackCueGeneric and the MPEG-2 parsing
specification have shown the need to expose the data in a cue that has
been de-encapsulated, but without requiring parsing and rendering.


My conclusion of this thread for now is that we should now take this
discussion back to the HTML WG . It seems to me that the extension of
TextTrackCue to include .text is not agreeable or at least needs more
discussion. It might make more sense to create a sub-interface of
UnparsedCue (as already proposed by Simon) that captures the common
functionality of {starttime, endtime, data}. But we do need to discuss
the consequences of this on the standards list.

Silvia.

Ian Hickson

unread,
Aug 28, 2013, 4:21:20 PM8/28/13
to Glenn Adams, Philip Jägenstedt, Silvia Pfeiffer, blink-dev
On Sun, 25 Aug 2013, Glenn Adams wrote:
> On Sun, Aug 25, 2013 at 2:15 AM, Ian Hickson <i...@hixie.ch> wrote:
>
> http://www.cablelabs.com/specifications/CL-SP-HTML5-MAP-I02-120510.pdf
>
> In that document, you will find (Section 5.1.1) the specification of a
> Program Description TextTrack, wherein:
>
> TextTrack.kind = "metadata"
> TextTrack.label = "video/mp2t track-description"
> TextTrack.mode = "hidden"
>
> and, for each PMT (Program Map Table) instance received that differs from
> previous PMT table instance, a cue is created (by the UA, not client JS)
> wherein:
>
> TextTrackCue.startTime = current media (NPT) time
> TextTrackCue.endTime = INFINITY
> TextTrackCue.text = JSON representation of the PMT (as specified in 5.1.1
> of [MP2MAP])

I really don't think it makes any sense at all to be using a text track
interface for this at all.

These aren't text tracks. They're not cues. Why are we shoehorning this
data into an API that really wasn't designed for it?

Just make a new API specifically for exposing PMTs.

A cue's end time isn't supposed to be "Infinity". If you find yourself
creating infinite cues, you know you're doing it wrong.


> In addition, section 5.1.4 "Other TextTracks" of [MP2MAP] require
> instantiating additional "Other TextTracks" for all "MPEG-2 stream types
> that are not UA recognized audio or video stream types" as follows:
>
> TextTrack.kind = "metadata"
> TextTrack.label = "video/mp2t-*pid*" (where *pid* denotes the PID that
> contains the stream)
> TextTrack.mode = "disabled"

Note that this (and the other cases) are wildly misusing the TextTrack
attributes. A MIME type isn't a human-readable label. The PID should be in
the id or inBandMetadataTrackDispatchType attributes.


> and, "for each PES or private data packet in the program stream represented
> by the TextTrack", a cue is created (by the UA, not client JS) wherein:
>
> TextTrackCue.startTime = current media (NPT) time
> TextTrackCue.endTime = INFINITY
> TextTrackCue.text = BASE64 representation of packet

This is an even better example of why TextTrackCue isn't appropriate.

You have binary data, and you're _base64 encoding it to shoehorn it into a
DOMString field". Just expose it as binary data!


I really think the whole approach of that document is basically confused.

Are any browsers planning on implementing this?


> > > I would suggest you defer to Silvia's judgment on this matter,
> > > particularly since you have said that this is now her "baby".
> >
> > What does "this" refer to in this sentence?
>
> Well, you weren't specific when you said this to me, but I interpreted
> "this" as including WebVTT and the generic, non-WebVTT related
> TextTrack* APIs and related functionality. If your intention was to only
> have her do the former, and not the latter, then I would suggest you
> consider handing off the latter to her as well.

I meant WebVTT, which Silvia is editing.


On Mon, 26 Aug 2013, Silvia Pfeiffer wrote:
>
> We can't expose CEA708 captions as VTTCues, because the caption
> rendering algorithm of VTTCue does not apply to cues of CEA708 format.

Right, they should be using a CEA708 cue format, if browsers are going to
support the format at all.


On Tue, 27 Aug 2013, Glenn Adams wrote:
>
> To make it clear, I don't care if these are exposed using the TextTrackCue
> interface or some other GenericCue or MetadataCue interface derived from
> TextTrackCue.

Then why are we still debating this?

Glenn Adams

unread,
Aug 28, 2013, 4:26:06 PM8/28/13
to Ian Hickson, Philip Jägenstedt, Silvia Pfeiffer, blink-dev
On Wed, Aug 28, 2013 at 2:21 PM, Ian Hickson <i...@hixie.ch> wrote:
On Tue, 27 Aug 2013, Glenn Adams wrote:
>
> To make it clear, I don't care if these are exposed using the TextTrackCue
> interface or some other GenericCue or MetadataCue interface derived from
> TextTrackCue.

Then why are we still debating this?

Because you and Silvia haven't fixed the specs so that (1) there is no fork and (2) generic cues are explicitly supported. 

Ian Hickson

unread,
Aug 28, 2013, 4:30:32 PM8/28/13
to Glenn Adams, Philip Jägenstedt, Silvia Pfeiffer, blink-dev
Well Silvia can fix the W3C fork any time she wants; in the meantime, the
WHATWG spec already supports generic cues. You just create a new interface
for your cue, and derive it from TextTrackCue, in exactly the same way as
WebVTT does VTTCue. There's no changes needed to the HTML spec for this.

Glenn Adams

unread,
Aug 28, 2013, 4:33:19 PM8/28/13
to Ian Hickson, Philip Jägenstedt, Silvia Pfeiffer, blink-dev
On Wed, Aug 28, 2013 at 2:30 PM, Ian Hickson <i...@hixie.ch> wrote:
On Wed, 28 Aug 2013, Glenn Adams wrote:
> On Wed, Aug 28, 2013 at 2:21 PM, Ian Hickson <i...@hixie.ch> wrote:
> > On Tue, 27 Aug 2013, Glenn Adams wrote:
> > >
> > > To make it clear, I don't care if these are exposed using the
> > > TextTrackCue interface or some other GenericCue or MetadataCue
> > > interface derived from TextTrackCue.
> >
> > Then why are we still debating this?
>
> Because you and Silvia haven't fixed the specs so that (1) there is no
> fork and (2) generic cues are explicitly supported.

Well Silvia can fix the W3C fork any time she wants; in the meantime, the
WHATWG spec already supports generic cues. You just create a new interface
for your cue, and derive it from TextTrackCue, in exactly the same way as
WebVTT does VTTCue. There's no changes needed to the HTML spec for this.

ok, then Silvia needs to define a GenericCue interface sub-type in the real HTML (i.e., W3C) spec, and you can ignore it in your WHATWG sandbox as you see fit; that works for me
 

Ian Hickson

unread,
Aug 28, 2013, 4:48:07 PM8/28/13
to Glenn Adams, Philip Jägenstedt, Silvia Pfeiffer, blink-dev
It's not something you'd put in HTML. It's something you'd put in whatever
spec creates the cues (e.g. the MPEG2 spec or whatever).

Glenn Adams

unread,
Aug 28, 2013, 5:05:27 PM8/28/13
to Ian Hickson, Philip Jägenstedt, Silvia Pfeiffer, blink-dev
On Wed, Aug 28, 2013 at 2:48 PM, Ian Hickson <i...@hixie.ch> wrote:
On Wed, 28 Aug 2013, Glenn Adams wrote:
> On Wed, Aug 28, 2013 at 2:30 PM, Ian Hickson <i...@hixie.ch> wrote:
> > On Wed, 28 Aug 2013, Glenn Adams wrote:
> > > On Wed, Aug 28, 2013 at 2:21 PM, Ian Hickson <i...@hixie.ch> wrote:
> > > > On Tue, 27 Aug 2013, Glenn Adams wrote:
> > > > >
> > > > > To make it clear, I don't care if these are exposed using the
> > > > > TextTrackCue interface or some other GenericCue or MetadataCue
> > > > > interface derived from TextTrackCue.
> > > >
> > > > Then why are we still debating this?
> > >
> > > Because you and Silvia haven't fixed the specs so that (1) there is
> > > no fork and (2) generic cues are explicitly supported.
> >
> > Well Silvia can fix the W3C fork any time she wants; in the meantime,
> > the WHATWG spec already supports generic cues. You just create a new
> > interface for your cue, and derive it from TextTrackCue, in exactly
> > the same way as WebVTT does VTTCue. There's no changes needed to the
> > HTML spec for this.
>
> ok, then Silvia needs to define a GenericCue interface sub-type in the
> real HTML (i.e., W3C) spec, and you can ignore it in your WHATWG sandbox
> as you see fit; that works for me

It's not something you'd put in HTML. It's something you'd put in whatever
spec creates the cues (e.g. the MPEG2 spec or whatever).

Given Silvia's remarks, it would seem that she doesn't agree with you (on the need for a generic cue interface). Personally, I think it a bad design principle that increases Web fragmentation to insist on requiring a format specific interface for every type of metadata cues.

Philip Jägenstedt

unread,
Aug 29, 2013, 3:07:06 AM8/29/13
to Silvia Pfeiffer, Glenn Adams, Ian Hickson, blink-dev
On Wed, Aug 28, 2013 at 1:03 AM, Silvia Pfeiffer <silv...@chromium.org> wrote:
> On Tue, Aug 27, 2013 at 4:56 PM, Philip Jägenstedt <phi...@opera.com> wrote:
>> On Mon, Aug 26, 2013 at 12:45 PM, Silvia Pfeiffer <silv...@chromium.org> wrote:
>>> On Sun, Aug 25, 2013 at 2:40 PM, Ian Hickson <i...@hixie.ch> wrote:
>>>> On Sun, 25 Aug 2013, Silvia Pfeiffer wrote:
>>>>>
>>>>> Having no rendering is the whole idea of it. The rendering is left to
>>>>> the JS dev. The browser just exposes the cues (that are in non-VTT
>>>>> format) to the JS dev.
>>>>
>>>> If the use case is JavaScript adding cues to the object that are then
>>>> rendered by JS, then VTTCue serves that use case fine already.
>>>
>>> We can't expose CEA708 captions as VTTCues, because the caption
>>> rendering algorithm of VTTCue does not apply to cues of CEA708 format.
>>
>> I don't think I follow, if CEA708 can be rendered then surely it
>> should have an CEA708Cue interface or similar, no?
>
> If no browser wants to implement CEA708 rendering (which, FAIK is the
> case right now), why should there be such an interface?

I'm just assuming that if a browser wants to support CEA708 then
they'll also want to render it. To only parse it into unrenderable
cues sounds rather strange, if that's what you're suggesting.

>> My current thinking is that proper renderable text tracks should just
>> be exposed using an appropriate interface (like VTTCue) or not at all,
>> but that leaves us with these in-band metadata tracks.
>
> There are three steps involved at which we can expose an interface for
> in-band tracks:
>
> 1. unravel the in-band container encapsulation:
> you end up with a sequence of {starttime, endtime, data} objects
> (which we call "cues")
>
> 2. parse the content (data) in the cues:
> you end up with a sequence of {starttime, endtime, data, getCueAsHTML()} objects
> (incidentally, somebody on the HTML list is asking for such cues right now)
>
> 3. render the parsed data in the cues:
> you end up with a sequence of {starttime, endtime, data,
> getCueAsHTML()} objects with rendering rules
>
>
> From where I stand, I saw the last two as an integrated pair: if a
> browser decides to implement a parser for the cues, I would expect
> them to also decide to implement rendering (since the hardest part is
> already done). But we can have this discussion on the HTML list.
>
> Your statement above, however, takes this an additional step and
> assumes that browser would either take all three steps or none. That's
> where reality has shown us that it's not the case: both the Apple
> implementation of TextTrackCueGeneric and the MPEG-2 parsing
> specification have shown the need to expose the data in a cue that has
> been de-encapsulated, but without requiring parsing and rendering.

Thanks, that's a helpful analysis. I do indeed think that browsers
ought to support a format fully or not at all. However, since the
default rendering rules of WebVTT are pretty good I think that any
format that has only a start time, end time and text could simply be
exposed using VTTCue.

> My conclusion of this thread for now is that we should now take this
> discussion back to the HTML WG . It seems to me that the extension of
> TextTrackCue to include .text is not agreeable or at least needs more
> discussion. It might make more sense to create a sub-interface of
> UnparsedCue (as already proposed by Simon) that captures the common
> functionality of {starttime, endtime, data}. But we do need to discuss
> the consequences of this on the standards list.

Sure, I'll await a new thread in the WHATWG, Text Tracks CG or HTML WG.

Philip

Philip Jägenstedt

unread,
Aug 29, 2013, 3:28:49 AM8/29/13
to Glenn Adams, Ian Hickson, Silvia Pfeiffer, blink-dev
[snip]

> As we know now, Silvia eventually concluded that this generic use case could
> be handled by the base interface itself by restoring a constructor and text
> attribute. I agree that is possible, but does create a backwards
> compatibility problem for existing uses that assume that a WebVTT cue is
> constructed with the base interface constructor. By changing the semantics
> of the constructor (as opposed to removing it), this will cause existing
> uses to fail soft rather than fail hard, which will be harder to correct.
> SimonP pointed this out in previous discussions. In my mind, this is a good
> argument for removing the existing constructor and adding a new GenericCue
> interface with its own constructor.

It sounds like we're agreeing that removing the existing constructor
is a good idea. Given that, do you have a strong a strong preference
for whether the .text property should go on TextTrackCue or VTTCue? I
don't think it's very important, the only thing is that it's a lot
easier to later move it from VTTCue to TextTrackCue than the other way
around, so I slightly favor moving it.

If you're not intending to implement the MPEG-2 spec right now, do the
Blink changes need to block on resolving the remaining issues?

Philip

Silvia Pfeiffer

unread,
Aug 29, 2013, 7:38:02 AM8/29/13
to Philip Jägenstedt, Glenn Adams, Ian Hickson, blink-dev
On Thu, Aug 29, 2013 at 5:07 PM, Philip Jägenstedt <phi...@opera.com> wrote:
> On Wed, Aug 28, 2013 at 1:03 AM, Silvia Pfeiffer <silv...@chromium.org> wrote:
>> On Tue, Aug 27, 2013 at 4:56 PM, Philip Jägenstedt <phi...@opera.com> wrote:
>>> On Mon, Aug 26, 2013 at 12:45 PM, Silvia Pfeiffer <silv...@chromium.org> wrote:
>>>> On Sun, Aug 25, 2013 at 2:40 PM, Ian Hickson <i...@hixie.ch> wrote:
>>>>> On Sun, 25 Aug 2013, Silvia Pfeiffer wrote:
>>>>>>
>>>>>> Having no rendering is the whole idea of it. The rendering is left to
>>>>>> the JS dev. The browser just exposes the cues (that are in non-VTT
>>>>>> format) to the JS dev.
>>>>>
>>>>> If the use case is JavaScript adding cues to the object that are then
>>>>> rendered by JS, then VTTCue serves that use case fine already.
>>>>
>>>> We can't expose CEA708 captions as VTTCues, because the caption
>>>> rendering algorithm of VTTCue does not apply to cues of CEA708 format.
>>>
>>> I don't think I follow, if CEA708 can be rendered then surely it
>>> should have an CEA708Cue interface or similar, no?
>>
>> If no browser wants to implement CEA708 rendering (which, FAIK is the
>> case right now), why should there be such an interface?
>
> I'm just assuming that if a browser wants to support CEA708 then
> they'll also want to render it. To only parse it into unrenderable
> cues sounds rather strange, if that's what you're suggesting.

Yes, that's exactly what I'm suggesting and it's not based on
hypothesis, but on reality as the implementation of WebKit proves.
That might be the theoretically cleaner situation. I prefer to be
pragmatic. TextTrackCue is a container for delivery of timed content.
I don't expect browsers to go beyond delivering timed data - having it
parsed and rendered is additional luxury.


> However, since the
> default rendering rules of WebVTT are pretty good I think that any
> format that has only a start time, end time and text could simply be
> exposed using VTTCue.

That will not work when the content is of kind=captions (as is the
case for CEA708) or other content for which VTTCue has a parsing and
rendering algorithm. Let's not have this argument again - it's a red
herring.


>> My conclusion of this thread for now is that we should now take this
>> discussion back to the HTML WG . It seems to me that the extension of
>> TextTrackCue to include .text is not agreeable or at least needs more
>> discussion. It might make more sense to create a sub-interface of
>> UnparsedCue (as already proposed by Simon) that captures the common
>> functionality of {starttime, endtime, data}. But we do need to discuss
>> the consequences of this on the standards list.
>
> Sure, I'll await a new thread in the WHATWG, Text Tracks CG or HTML WG.

Sorry, I promise to start the thread over the weekend - having a busy week.

Silvia.

Glenn Adams

unread,
Aug 29, 2013, 9:31:32 AM8/29/13
to Philip Jägenstedt, Ian Hickson, Silvia Pfeiffer, blink-dev
Yes, removed from TextTrackCue and relanded onto a generic cue interface.
 
Given that, do you have a strong a strong preference
for whether the .text property should go on TextTrackCue or VTTCue?

There needs to be some interface that exposes raw, unparsed cues via the text attribute. It should not be VTTCue that serves this purpose. These use cases have no relation with VTT.
 
I
don't think it's very important, the only thing is that it's a lot
easier to later move it from VTTCue to TextTrackCue than the other way
around, so I slightly favor moving it.

My current thinking is it is best to follow the approach suggested by Simon of creating a new UnparsedCue (or GenericCue) interface that has the text attribute, and that is either a sibling of VTTCue or the parent of VTTCue (from an inheritance perspective).
 

If you're not intending to implement the MPEG-2 spec right now, do the
Blink changes need to block on resolving the remaining issues?

Perhaps you misunderstood me. I am not implementing the MPEG-2 mapping spec. I only pointed it out as an example of non-rendered, metadata tracks unrelated to VTT, which you and Ian had asked for to justify the use case of a generic cue interface. However, it is my understanding that CableLabs has implemented the MPEG-2 mapping spec in a DLNA reference implementation that is based on Webkit. I don't know if they have plans to submit their mods to WK.

I am only focusing on implementing the new subdivision of TextTrackCue into VTTCue, already implemented and reviewed, but not yet committed [1], as a preliminary step to enable implementing TTMLCue in order to address [2].


Elliot (esprehn) asked earlier in this thread [3] to hold off on a commit "until we can get some agreement between the specs". So we need Silvia to resolve the fork [4] on the definition of TextTrackCue between the two specs first. I believe we have a resolution now that is nominally acceptable, and awaiting Silvia to implement the changes, which she has promised to do shortly.



 

Philip

Reply all
Reply to author
Forward
0 new messages