Mismatch between HTML parser and createElement() et al

84 views
Skip to first unread message

Anne van Kesteren

unread,
Aug 3, 2015, 6:47:17 AM8/3/15
to blink-dev
Bringing this up here per Dimitri's suggestion:

https://www.w3.org/Bugs/Public/show_bug.cgi?id=27228

As you might know the HTML parser and createElement() have different
checks for what constitutes a correct local name. Is this something
the Blink project is interested in eliminating and being the guinea
pig for? Or should we simply give up on this and hope element
constructors bridge this naming gap somehow?


--
https://annevankesteren.nl/

Kouhei Ueno

unread,
Aug 3, 2015, 7:13:31 AM8/3/15
to Anne van Kesteren, blink-dev
Hi Anne,

As one who touch HTML parser code regularly, I welcome the experiment for more sane spec.
However I want more feedback from HTML DOM folks.



--
https://annevankesteren.nl/

To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.



--
Kouhei Ueno

Philip Jägenstedt

unread,
Aug 3, 2015, 9:02:54 AM8/3/15
to Kouhei Ueno, Anne van Kesteren, blink-dev
The difference is pretty useless since it's still possible to create
these attributes, as illustrated by
http://jsperf.com/specialsetattribute

It sounds like it would be pretty easy to measure the impact of this,
any suggestion for the precise phrasing of the spec to compare
against?

Anne van Kesteren

unread,
Aug 3, 2015, 9:12:17 AM8/3/15
to Philip Jägenstedt, Kouhei Ueno, blink-dev
On Mon, Aug 3, 2015 at 3:02 PM, Philip Jägenstedt <phi...@opera.com> wrote:
> The difference is pretty useless since it's still possible to create
> these attributes, as illustrated by
> http://jsperf.com/specialsetattribute
>
> It sounds like it would be pretty easy to measure the impact of this,
> any suggestion for the precise phrasing of the spec to compare
> against?

Note that it affects attributes as well as elements. Anywhere where
the specification now throws a TypeError due to XML-originating checks
instead we'd need to do the same kind of check the HTML parser does,
which mostly depends on which code points would have you end up in a
different state, I think. So e.g., ">" or ASCII whitespace would
throw.


--
https://annevankesteren.nl/

Joe Gregorio

unread,
Aug 3, 2015, 9:15:53 AM8/3/15
to Philip Jägenstedt, Kouhei Ueno, Anne van Kesteren, blink-dev
On Mon, Aug 3, 2015 at 9:02 AM Philip Jägenstedt <phi...@opera.com> wrote:
The difference is pretty useless since it's still possible to create
these attributes, as illustrated by
http://jsperf.com/specialsetattribute


Actually won't that workaround stop working under DOM 4 when Attr no longer inherits from Node and thus cloneNode() is no longer available? Or is there another way to achieve the same effect?

Philip Jägenstedt

unread,
Aug 3, 2015, 10:46:10 AM8/3/15
to Joe Gregorio, Kouhei Ueno, Anne van Kesteren, blink-dev
Unfortunately, yes, and this was discussed a bit on public-html a while back:
https://lists.w3.org/Archives/Public/public-html/2015May/thread.html#msg90

Attr.cloneNode() usage isn't measured, but if it's non-trivial then
making Attr not inherit from Node would also require adding
Attr.cloneNode() separately, at which point the split starts to look a
bit silly.

Philip

Philip Jägenstedt

unread,
Aug 3, 2015, 10:48:24 AM8/3/15
to Anne van Kesteren, Kouhei Ueno, blink-dev
Assuming that the new set of forbidden code points is a strict subset
of the existing set of forbidden code points, then it still seems
pretty likely to be web compatible.

It's hard to speculate any further, but easy to measure given a definition.

Philip

Anne van Kesteren

unread,
Aug 3, 2015, 12:38:29 PM8/3/15
to Philip Jägenstedt, Kouhei Ueno, blink-dev
On Mon, Aug 3, 2015 at 4:48 PM, Philip Jägenstedt <phi...@opera.com> wrote:
> Assuming that the new set of forbidden code points is a strict subset
> of the existing set of forbidden code points, then it still seems
> pretty likely to be web compatible.

It seems like that is not the case for elements. The HTML parser
requires that the first code point is a case-insensitive ASCII letter,
which is completely incompatible with XML:
http://www.w3.org/TR/xml/#NT-NameStartChar Any remaining code point
for a tag in HTML must not be U+0009, U+000A, U+000C, U+0020, U+002F,
U+003E, or U+0000. (Though we could allow U+0000 and treat it as
U+FFFD as the HTML parser does.) That does seem far more liberal than
XML allows and a full subset.

For attributes, they must not start with U+0009, U+000A, U+000C,
U+0020, U+002F, U+003E, or U+0000. (Again, U+0000 could be treated as
U+FFFD.) Any any subsequent code point must not be any of those, and
also not U+003D. That again seems far more liberal than XML and a full
subset.


--
https://annevankesteren.nl/

Anne van Kesteren

unread,
Aug 4, 2015, 3:34:36 AM8/4/15
to Philip Jägenstedt, Kouhei Ueno, blink-dev
On Mon, Aug 3, 2015 at 6:30 PM, Anne van Kesteren <ann...@annevk.nl> wrote:
> It seems like that is not the case for elements. The HTML parser
> requires that the first code point is a case-insensitive ASCII letter,
> which is completely incompatible with XML:
> http://www.w3.org/TR/xml/#NT-NameStartChar Any remaining code point
> for a tag in HTML must not be U+0009, U+000A, U+000C, U+0020, U+002F,
> U+003E, or U+0000. (Though we could allow U+0000 and treat it as
> U+FFFD as the HTML parser does.) That does seem far more liberal than
> XML allows and a full subset.
>
> For attributes, they must not start with U+0009, U+000A, U+000C,
> U+0020, U+002F, U+003E, or U+0000. (Again, U+0000 could be treated as
> U+FFFD.) Any any subsequent code point must not be any of those, and
> also not U+003D. That again seems far more liberal than XML and a full
> subset.

Domenic suggested that the answer was that we could check that the
given name either matches the XML production or the above outlined
HTML production. Then we'd basically end up throwing less and be
compatible with both. An even bolder approach would be to remove
exceptions altogether and basically allow any non-empty string. That
might create weird serialization issues though that we don't have
today (in HTML anyway).


--
https://annevankesteren.nl/

Philip Jägenstedt

unread,
Aug 4, 2015, 7:50:37 AM8/4/15
to Anne van Kesteren, Kouhei Ueno, blink-dev
Interesting, so there are already cases with HTML element names that
cannot be roundtripped:
http://software.hixie.ch/utilities/js/live-dom-viewer/saved/3580

Unless that's a hole worth plugging, estimating the impact of removing
the restrictions entirely might be worthwhile. The union of patterns
allowed by HTML and XML doesn't seem worthwhile if it still requires
HTML and XML serializers to handle (or ignore?) things that cannot be
roundtripped.

Other ideas?

Philip

Anne van Kesteren

unread,
Aug 4, 2015, 8:12:06 AM8/4/15
to Philip Jägenstedt, Kouhei Ueno, blink-dev
On Tue, Aug 4, 2015 at 1:50 PM, Philip Jägenstedt <phi...@opera.com> wrote:
> Interesting, so there are already cases with HTML element names that
> cannot be roundtripped:
> http://software.hixie.ch/utilities/js/live-dom-viewer/saved/3580

Good point.


> Unless that's a hole worth plugging, estimating the impact of removing
> the restrictions entirely might be worthwhile. The union of patterns
> allowed by HTML and XML doesn't seem worthwhile if it still requires
> HTML and XML serializers to handle (or ignore?) things that cannot be
> roundtripped.
>
> Other ideas?

No, that seems like the best idea and likely matches the internal
element creation factory, no?


--
https://annevankesteren.nl/

Philip Jägenstedt

unread,
Aug 4, 2015, 9:32:46 AM8/4/15
to Anne van Kesteren, Kouhei Ueno, blink-dev
Sure, in the end there's always a bit of code that creates the
element/attribute without validation. I think the risk here is that by
removing the checks, some bits of unrelated code that have un-asserted
invariants about attribute/element names is now broken, but that may
not be discovered until later.

Still, this seems at least worth investigating. Which are the entry
points in DOM that would be affected? I see these internally:

Document.createAttribute/NS
Document.createElement/NS
Document.createProcessingInstruction
DOMImplementation.createDocumentType
Element.setAttribute/NS

Philip
Reply all
Reply to author
Forward
0 new messages