Speculative HTML tree building

Henri Sivonen

unread,

Mar 11, 2009, 5:58:49 AM3/11/09

to

(Moving here from public-html.)

After the HTML5 parsing meeting in December, the plan was to perform
speculative tokenization while scripts load and remember the tokens to
avoid retokenization when feasible.

The feedback from the HTML5 tree builder into the tokenizer state
transitions makes this hard when <math> or <svg> has been seen.

I mentioned speculative tree building as a potential solution on
public-html
(http://lists.w3.org/Archives/Public/public-html/2009Mar/0200.html), but
I was too pessimistic about its failure condition.

Thinking about this more, I think the failure condition can be narrowed
so significantly that document.written ads above SVG content could work.
Here's how:

When a </script> end tag (either HTML or SVG </script>) is seen, a
speculation discontinuity point would be established. From the point of
view of setting up input stream rewinding possibility, this would work
the same way as speculative tokenization. The internal state of the tree
builder (insertion mode, in foreign flag, secondary insertion mode,
stack and the list of active formatting elements) would be copied and
stored with the discontinuity point. This state snapshot would be taken
immediately after the script element has been popped off the stack. A
new queue of tree operations would be associated with the discontinuity
point.

The speculative parsing would then start filling up the new tree
operation queue. If more </script> tags are seen, new discontinuity
points and new queues would need to be started while keeping the earlier
ones around.

When a script starts doing document.writes, the speculation would stop
and the tree builder state from the discontinuity point data would be
copied back to the tree builder and the parser would parse the
document.written content.

After the script has run and all its document.writes have been parser,
the state of the tokenizer would be inspected and the internal state of
the tree builder would be compared against the copy of the state stored
that is waiting around with the pending queue of speculative tree
operations.

If the state of the tokenizer is between tokens and the tree builder
state matches the stored snapshot, the speculated tree operations can be
handed to the main thread for actuation. If not, the speculative tree
operations would need to be trashed and the stream rewinded.

The practical effect would be that document.writes that don't trigger
residual style handling and that write a balanced subtree of content
(optionally with text before) would not trash the speculated tree
operations.

(Writing some text after the balanced would be a slight requiring
potential pending text node merging.)

This would require one spec change (which a multi-threaded
implementation would require anyway, it seems):
Instead of storing references to element nodes on the list of active
formatting elements, the tree builder should store the element name and
the list of attributes as supplied by the tokenizer. Then, instead of
doing a shallow clone of a live node, the name and the attribute list
would be used for creating new nodes wherever the spec now clones. This
way, the indeterministic 'cloning' time and scripted attribute mutations
to the live nodes wouldn't interfere with each other.

Pros:
* Doesn't require code for queuing up tokens. The tree builder
operation queuing code is enough, and that code needs to exist anyway.
The tokenizer and the tree builder would be connected to each other the
same way whether speculating or not.
* Works with the MathML in text/html and SVG in text/html proposals
without special casing those.

Cons:
* There'd be more cases where the speculative work is trashed compared
to speculative tokenization: when the document.writes don't write a
balanced tree but still finish on a token boundary or when
document.writes trigger residual style handling.
* When speculative work is trashed, there'd be more work to trash,
since the HTML5 tree builder algorithm will have run on the speculative
data.

(It may be possible to relax the failure conditions when residual style
handling is triggered. I'll have to think about that some more.)

Opinions?

--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/

Henri Sivonen

unread,

Mar 11, 2009, 6:01:29 AM3/11/09

to

In article <hsivonen-4356E5...@news.mozilla.org>,
Henri Sivonen <hsiv...@iki.fi> wrote:

> (Writing some text after the balanced would be a slight requiring
> potential pending text node merging.)

Oops. That's a completely bogus sentence.

Writing some text after the balanced subtree would be a slight
complication requiring the text to be merged into a pending text node if
the speculated operation queue starts with a text node.

Jonas Sicking

unread,

Mar 18, 2009, 8:04:43 PM3/18/09

to

My gut feeling is that it would be nice to get some data here. While I
think that someone writing half a token is probably pretty rare, I
strongly suspect that is less the case for writing a start tag.

It would be great to spider some set of sites and see how common it is
for document.write to output an unbalaced tree.

That said, my gut feeling is that we might want to cache both the token
stream, and the parser-action stream. It's probably common enough that
the parser-action stream can be saved (such as when there is no
document.write at all) that having the saved parser actions ready-to-go
would be in general beneficial.

But it's possibly also common enough to write just a start tag that
being able to replay an existing token stream would also be worth it.
Though I am less sure about this. And it'd probably be ok to not have
this part figured out by the time we do the initial landing of the HTML5
parser.

The most critical thing is that we're as performant as trunk is. We can
always figure out how to get even more performance later.

/ Jonas

Mike Shaver

unread,

Mar 18, 2009, 10:06:00 PM3/18/09

to Jonas Sicking, dev-pl...@lists.mozilla.org

On Wed, Mar 18, 2009 at 8:04 PM, Jonas Sicking <jo...@sicking.cc> wrote:
> My gut feeling is that it would be nice to get some data here. While I think
> that someone writing half a token is probably pretty rare, I strongly
> suspect that is less the case for writing a start tag.

I think the only case you're likely to see is

document.write("</scr");
document.write("ipt>");

as one form of hiding the </script> from the HTML parser.

document.write("</scr" + "ipt>");

is more common, but I've seen the former.

Mike

Jonas Sicking

unread,

Mar 19, 2009, 1:30:41 PM3/19/09

to

That is fine, we're talking about the result of all document.writes in a
script producing half a token. So something like

would *not* count as outputting half a token. Whereas

would. Which feels like it would be very rare. Probably, the most common
case is:

/ Jonas

Henri Sivonen

unread,

Mar 24, 2009, 10:32:36 AM3/24/09

to

In article <lsmdnXR2DswPEVzU...@mozilla.org>,
Jonas Sicking <jo...@sicking.cc> wrote:

> My gut feeling is that it would be nice to get some data here. While I
> think that someone writing half a token is probably pretty rare, I
> strongly suspect that is less the case for writing a start tag.

I think it would make sense to instrument the tree builder for getting
this data before implementing any flavor of speculation.

> That said, my gut feeling is that we might want to cache both the token
> stream, and the parser-action stream.

I'd want to avoid saving the token stream entirely if data indicates
that saving the tree builder action stream works most the time.