HTML5 parser status

Henri Sivonen

unread,

Nov 24, 2008, 5:00:46 PM11/24/08

to

As discussed with peterv on IRC, here's an update on what's going on
with the HTML5 parser :

Done so far
===========

The parser core (tokenizer, tree implementation-independent parts of the
tree builder and some supporting classes) is written in Java. The code
compiles into Java bytecode using a Java compiler. It compiler to
JavaScript using the Google Web Toolkit compiler. For use in Gecko, it's
mechanically translated to C++ using a program I've written for this
purpose. (That is, nothing here creates a Gecko runtime dependency on a
JVM.)

The parser core itself is customizable, etc. but C++ classes written by
the translator aren't customizable on the C++ level. That is, the C++
translator generates code with a fixed set of customizations with
generics concretized, without inheritance or polymorphism and with
virtual methods kept to a minimum. (The idea is that for a different set
of customizations, one regenerates the C++ code.)

The tokenizer is fully suspendable after any character, so the tokenizer
is suitable for apps that support document.write.

The parser core needs an IO driver that manages the input stream and
performs the <meta> charset prescan. For Java, there's a driver that
uses SAX InputSource and java.io streams. For GWT, there's a driver that
accepts data as DOM strings (no <meta> prescan needed). For Gecko, I've
written a new implementation of nsIParser that acts as the IO driver for
the parser core. The <meta> prescan isn't done yet. document.write is
handled as described in
http://lists.w3.org/Archives/Public/public-html/2008Nov/0205.html

I'm not using CParserContext or nsScanner.

The tree builder needs implementation of a set of methods for dealing
with a concrete content tree implementation. The methods are roughly
covered in
http://groups.google.com/group/mozilla.dev.platform/msg/fdc1f20146559f54
(Previously, I've implemented concrete tree builders for Java DOM, XOM,
the browser JS DOM API via GWT and for a tree implementation I call SAX
Tree. I've also implemented a streaming SAX "tree builder".)

In C++, the abstract and concrete parts of the tree builder go into one
class using the C preprocessor. (In Java, there's an abstract superclass
and a concrete subclass.)

Based on that thread, I've proceeded without building a mechanism for
old-style deferrer notifications. However, since there isn't enough
infrastucture in place to go directly to deferred insertions, I have
proceeded by writing code that requires very few modifications under a
deferred insertion scheme but that doesn't actually defer anything just
yet. Specifically, the part of infrastucture missing is a callback that
the parser could count on getting called before a script accesses the
DOM in *any* way. I figured that it's better to make the parser run at
all first even if inefficiently than to try to add the missing piece of
infrastructure on my own.

From previous discussion I gathered that it's important to defer (or,
more to the point, batch) insertions but that there isn't a need to
defer element node creation. Therefore, I have not designed for deferred
element creation. However, I have designed for deferred text node
creation. Except for table foster parenting cases, the tree builder
always waits until it can create an entire text node in one go. That is,
under normal (non-foster-parented) conditions, it doesn't add text to
nodes that have already been inserted into the tree.

The abstract tree builder notifies the concrete part of the tree builder
whenever an element is pushed or popped from the stack (including for
leaf nodes that aren't *really* pushed and popped). These aren't called
for foster parented nodes or nodes that are rearranged as part of the
adoption agency algorithm. The original motivation providing hooks for
the streaming SAX pseudo-tree builder to fire startElement and
endElement events. In the Gecko concrete tree builder implementation, I
use these for tracking discretionary suspends for incremental rendering,
for XLink autolinks and for the stuff that nsXMLContentSink does in its
CloseElement() method. (In general, I've tried to adapt code from XML
content sink--not from the HTML content sink, which is weirder and
doesn't deal with SVG and MathML.)

I made a perhaps questionable choice early on and inherited the
nsIParser implementation class from nsContentSink instead of inhering
the tree builder from nsContenSink. (I wanted to keep all generated C++
classes as plain C++ classes that are invisible to XPCOM.) This means
that the concrete tree builder calls into the outer parser class instead
calling into superclass of its own for the kind of things that
nsContentSink provides. (Also, it follows that nsIParser implementation
class holds a strong ref to itself, but it's broken the same way as the
old parser breaks the strong ref cycle between itself and the sink.)

None of the code checks for memory allocation failures. I'm assuming
that the parser can use the new no-fail allocator.

The main reason why none of the above-described C++ code runs yet is
that I'm having linkage problems between /parser/ and /content/. (I
initially started putting code in /parser/html5/src/, because that
seemed like a natural place for it.)

Not Done Yet
============

The deferred insertion mechanism I envision is a list of object that
each represent an insertion in a sequence of insertions. The objects
would encapsulate the operands of the type of insertion and would have a
common superclass with a method called actuate(), commit() or somesuch
that would actually perform the insertion. Flushing pending insertion
would simply call the actuation method in sequence for each objects on
the list and empty the list.

I think the deferred insertion objects could be also used from a future
reimplementation of the XML tree builder. However, I don't see value in
trying to abstract HTML and XML node insertions more than that. The
HTML5 spec constrains the HTML5 parser into being pretty close to the
content tree (including being able to read from it!).

The meta prescan isn't done yet. I'm not sure yet if it should wait for
512 bytes and then run the prescan to completion or whether it should be
fancy and incremental if the first 512 bytes arrive slowly. Either way,
I think I'd write it as part of the common Java code base and translate
it into C++. (The Java code base already has a slightly out-of-date
implementation of the prescan algorithm, and that implementation
wouldn't be incremental when used together with non-blocking IO.)

The BOM scan and chardet integration isn't done, either.

I haven't yet implemented the renavigation of the browsing context when
a meta charset is found late in the parse. (The code for detecting this
condition is part of the auto-translated core.)

I haven't yet resolved whether the parser object should be resettable or
whether the object should be discarded after one parse.

I haven't dealt with lazy loading of the MathML UA style sheet. (I'm
inclined to think that the way the XML side deals with it isn't a good
way to begin with.)

I haven't yet implemented line number tracking or error reporting on the
C++ side. (The Java code obviously has these, since it's used in a
validator. :-)

The parser core support fragment parsing per spec, but I haven't
implemented support for fragment parsing in the Gecko IO driver yet. I'd
be interested in feedback on a optimization I have in mind:
http://lists.w3.org/Archives/Public/public-html/2008Nov/0192.html

I haven't investigated parsing requirements for clipboard import at all.

None of the data structures that allocate more memory depending on the
input have maximum limits yet, so malicious documents could exhaust
memory. Need to figure out sensible caps here.

Parser-internal memory management needs more work to avoid leaks.

Surprises
=========

The speculative script prescan thread caught me by surprise. It wasn't
there when I planned this. I haven't tried to replicate its
functionality yet, although eventually I should, of course.

The existence of an HTML sanitizer in Gecko was also news to me. I'm not
yet fully aware of the constraints there. Probably pretty easy to add
similar functionality.

Duty Mismatches
===============

The old tokenizer is used in view source highlighting. The new one isn't
suitable for that purpose. Further, I think making it suitable would be
particularly bad for performance. I think it makes sense to either keep
the old tokenizer for view source or look into something completely
different like Diavolo.

The legacy bookmarks.html parsing uses a special content sink. With the
new parser, it the bookmark import would have to traverse an HTML DOM
instead.

--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/

Boris Zbarsky

unread,

Nov 24, 2008, 5:58:51 PM11/24/08

to

Henri Sivonen wrote:
> I haven't yet resolved whether the parser object should be resettable or
> whether the object should be discarded after one parse.

What are the implications of the latter for repeated innerHTML set
performance?

> The old tokenizer is used in view source highlighting. The new one isn't
> suitable for that purpose. Further, I think making it suitable would be
> particularly bad for performance.

Is there a simple-to-explain reason for this? Just trying to get a
sense of what the issues here are....

> The legacy bookmarks.html parsing uses a special content sink. With the
> new parser, it the bookmark import would have to traverse an HTML DOM
> instead.

That seems perfectly fine to me.

> I'd be interested in feedback on a optimization I have in mind:
> http://lists.w3.org/Archives/Public/public-html/2008Nov/0192.html

Seems reasonable to me, as long as we handle the <html> and <body> stuff
correctly, as well as any other things that would normally affect stuff
outside the fragment (<base> comes to mind).

Also have to be a bit careful with form controls, probably....

-Boris

Henri Sivonen

unread,

Nov 24, 2008, 7:15:56 PM11/24/08

to

In article <gqadnQaex67BdrfU...@mozilla.org>,
Boris Zbarsky <bzba...@mit.edu> wrote:

> Henri Sivonen wrote:
> > I haven't yet resolved whether the parser object should be resettable or
> > whether the object should be discarded after one parse.
>
> What are the implications of the latter for repeated innerHTML set
> performance?

I have no idea if there'd be a notable difference and to which
direction. But if existing code assumes that it can reset, perhaps it's
better to support resetting.

> > The old tokenizer is used in view source highlighting. The new one isn't
> > suitable for that purpose. Further, I think making it suitable would be
> > particularly bad for performance.
>
> Is there a simple-to-explain reason for this? Just trying to get a
> sense of what the issues here are....

For syntax highlighting, the tokens should be syntactic items within a
tag. However, from the HTML5 point of view, the entire tag is a "token"
and things like inter-attribute whitespace and element name case is
forgotten early.

It seems to me that adding reporting of microtoken boundaries inside a
tag would have to come at a price when doing the kind of tokenization
that is necessary when browsing the Web: there'd have to be a non-zero
cost decision (branch on boolean or virtual invocation) *not* to report
microtoken boundaries at every microtoken boundary when not doing view
source. (Unless there'd be a separately generated view source tokenizer,
but that would come back to not using the same tokenizer.)

It would be somewhat feasible (at the cost having to decide not to do it
often) to use the HTML5 tokenizer code structure to emit a stream of
index boundaries of interesting pieces of the input while expecting the
caller to hold a copy of the source text. However, expecting the
tokenizer to report e.g. an element name in its original literal case
would be an even more invasive and optimization-breaking change.

> > The legacy bookmarks.html parsing uses a special content sink. With the
> > new parser, it the bookmark import would have to traverse an HTML DOM
> > instead.
>
> That seems perfectly fine to me.

OK.

> > I'd be interested in feedback on a optimization I have in mind:
> > http://lists.w3.org/Archives/Public/public-html/2008Nov/0192.html
>
> Seems reasonable to me, as long as we handle the <html> and <body> stuff
> correctly, as well as any other things that would normally affect stuff
> outside the fragment (<base> comes to mind).

Base I hadn't thought about.

> Also have to be a bit careful with form controls, probably....

In what way?

Boris Zbarsky

unread,

Nov 24, 2008, 8:44:00 PM11/24/08

to

Henri Sivonen wrote:
>> What are the implications of the latter for repeated innerHTML set
>> performance?
>
> I have no idea if there'd be a notable difference and to which
> direction. But if existing code assumes that it can reset, perhaps it's
> better to support resetting.

Existing code received a noticeable performance boost from moving to
resetting instead of throwing out and recreating the parser, since the
object creation was a significant part of the cost.

>> Is there a simple-to-explain reason for this? Just trying to get a
>> sense of what the issues here are....
>
> For syntax highlighting, the tokens should be syntactic items within a
> tag. However, from the HTML5 point of view, the entire tag is a "token"
> and things like inter-attribute whitespace and element name case is
> forgotten early.

That's also the case for our tokenizer right now, no (except in
view-source mode). But I can see how adding such a view-source mode
might be pretty annoying.

> It seems to me that adding reporting of microtoken boundaries inside a
> tag would have to come at a price when doing the kind of tokenization
> that is necessary when browsing the Web: there'd have to be a non-zero
> cost decision (branch on boolean or virtual invocation) *not* to report
> microtoken boundaries at every microtoken boundary when not doing view
> source.

Indeed. Right now we pay the branching cost, but I've never seen this
show up in a profile, to be honest.

> It would be somewhat feasible (at the cost having to decide not to do it
> often) to use the HTML5 tokenizer code structure to emit a stream of
> index boundaries of interesting pieces of the input while expecting the
> caller to hold a copy of the source text. However, expecting the
> tokenizer to report e.g. an element name in its original literal case
> would be an even more invasive and optimization-breaking change.

OK.

>> Also have to be a bit careful with form controls, probably....
>
> In what way?

Right now we associate them with the <form> at parse time, not insertion
time, to handle them being hoisted out of a table. That would need to
not happen during fragment parsing, I would think.

-Boris

Gervase Markham

unread,

Nov 24, 2008, 9:09:20 PM11/24/08

to

Henri Sivonen wrote:
> As discussed with peterv on IRC, here's an update on what's going on
> with the HTML5 parser :

Forgive me for not having the wider context but: is there a long-term
plan by the platform team to replace Gecko's current HTML parser with
your one?

Gerv

Boris Zbarsky

unread,

Nov 24, 2008, 9:14:11 PM11/24/08

to

Gervase Markham wrote:
> Forgive me for not having the wider context but: is there a long-term
> plan by the platform team to replace Gecko's current HTML parser with
> your one?

Yes.

-Boris

P.S. Or rather "God, yes, pretty please with a cherry on top."

Henri Sivonen

unread,

Dec 3, 2008, 11:56:02 AM12/3/08

to

Now it builds and runs:
http://hsivonen.iki.fi/html5-gecko-build/

(With lots and lots of broken things still...)

Damon Sicore

unread,

Dec 10, 2008, 3:39:55 AM12/10/08

to Boris Zbarsky, dev-pl...@lists.mozilla.org

I know I'm late on catching up on this thread, but this is really
really funny. :)

>
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform