(crossposted to blink-dev@ and security-dev@, and CCing some specific folks who might be interested)# Contact Emails# SpecNone yet. I will submit pull-requests to HTML if we decide that the experiments discussed below are at all reasonable.# SummaryThis Intent is a little vague, as I'm still exploring the space, but I'm adding metrics for a few concrete proposals, so poking at the list seems like a reasonable thing to do (both for visibility and for new/better ideas). In particular:1. https://codereview.chromium.org/2626243002 adds metrics for `\n` and `<` occurring inside `<base target>`.2. https://codereview.chromium.org/2628723004 adds metrics for `<textarea>` and `<select>` being closed by end-of-file, and a flag to block form submission in those cases.
3. https://codereview.chromium.org/2629393002 adds metrics for `\n` and `<` characters occurring during URL parsing (e.g. as a result of processing `<img src>` or `<link href>`). https://codereview.chromium.org/2634893003 adds a flag to cause such resolutions to error out for non-`data:`/`javascript:` URLs.
I have a concern that this is going to make us slower. We already know that the parser is in general "slow" because it's so branchy. Simplification of the parser has yielded 20x+ improvements in experiments. I worry that adding even more complexity to the parser is doubling down on the slowness. I'm also concerned about making URL processing slower for pages with lots of urls like wikipedia articles and long email threads.
On Tue, Jan 17, 2017 at 2:43 AM, Mike West <mk...@chromium.org> wrote:(crossposted to blink-dev@ and security-dev@, and CCing some specific folks who might be interested)# Contact Emails# SpecNone yet. I will submit pull-requests to HTML if we decide that the experiments discussed below are at all reasonable.# SummaryThis Intent is a little vague, as I'm still exploring the space, but I'm adding metrics for a few concrete proposals, so poking at the list seems like a reasonable thing to do (both for visibility and for new/better ideas). In particular:1. https://codereview.chromium.org/2626243002 adds metrics for `\n` and `<` occurring inside `<base target>`.2. https://codereview.chromium.org/2628723004 adds metrics for `<textarea>` and `<select>` being closed by end-of-file, and a flag to block form submission in those cases.This puts an extra branch (and a couple nested tests) in the clean up loop at the end of parsing, I guess most pages don't have lots of tags open though?
3. https://codereview.chromium.org/2629393002 adds metrics for `\n` and `<` characters occurring during URL parsing (e.g. as a result of processing `<img src>` or `<link href>`). https://codereview.chromium.org/2634893003 adds a flag to cause such resolutions to error out for non-`data:`/`javascript:` URLs.This is putting 3 string scans over every URL we resolve in the engine just for a use counter. I'd actually like us to revert this one, I don't think we'd want to make the spec require this, but also I don't think we should be paying that cost on every URL for the entire document. Our URL processing is already a bottle neck on pages with lots of links like wikipedia.
--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.
I chatted w/ esprehn@ about this. (please correct me if I got things wrong)Some facts:- HTML tokenizer / tree-builder is far from optimal. Internal experiment showed that it can be two orders of magnitude faster than its current state if we drastically simplify the grammar.
-- Even with the full spec grammar, I'm quite confident that we can apply modern interpreter optimization techniques to get close to that.- However, the amount of time spent in tokenizer/tree-builder seems trivial compared to other components. Which doesn't really justify the optimization work atm.
On Wed, Jan 18, 2017 at 11:52 PM, 'Kouhei Ueno' via blink-dev <blin...@chromium.org> wrote:I chatted w/ esprehn@ about this. (please correct me if I got things wrong)Some facts:- HTML tokenizer / tree-builder is far from optimal. Internal experiment showed that it can be two orders of magnitude faster than its current state if we drastically simplify the grammar.This reported number seems to vary a lot. Could someone share data?
> 2. https://codereview.chromium.org/2628723004 adds metrics for
> `<textarea>` and `<select>` being closed by end-of-file, and a flag
> to block form submission in those cases.
Is there a particular reason to do this for EOF only as opposed to
defaulting to parser-created "select" being flagged as "do not submit"
and a </select> end tag marking it as not "do not submit" as I
suggested on GitHub? (That would account for more cases of lack of end
tag but would have the kind of compat risk discussed for <button> on
GitHub.)
> The number we measured in the experiment was 19x faster at
> parsing. The corpus we parsed was the HTML specification
> (at the time, which was a few years ago). Obviously, the parser
> resulted in a different DOM, so the comparison isn't as direct as
> between two implementations that produce the same DOM.
Making parsing changes that'd change the resulting DOM would be
harmful both for interop generally and for markup generators to be
able to have security-sensitive expectations of how stuff parses.