Re: character encoding in the HTML parser

Simon Sapin

unread,

Mar 29, 2014, 6:56:36 PM3/29/14

to dev-...@lists.mozilla.org

On 10/03/2014 23:54, Keegan McAllister wrote:
> [...]
>
> Also, should we follow Gecko in representing the input stream as a
> queue of UTF-16 buffers? With UTF-8 we would have about half as much
> data to stream through the parser, and (on 64-bit) we could do
> case-insensitive ASCII operations 8 characters at a time.

UTF-8 would be nice, but I think we’re stuck with UTF-16 for the reasons
below. (Actually sequences of 16 bit integers containing
potentially-invalid UTF-16. Is that called "UCS-2"?)

Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8
without the artificial restriction of not encoding surrogates. But I
feel bad just suggesting it. (This restriction, as well as the upper
limit of U+10FFFF, exist to align with the value-space of UTF-16.
UTF-8’s underlying algorithm is perfectly fine without either.)

> Speaking of which, [5]
>
>> Any character that is a not a Unicode character, i.e. any isolated
>> surrogate, is a parse error. (These can only find their way into
>> the input stream via script APIs such as document.write().)
>
> I don't see a designated error-recovery behavior, unlike most parse
> errors in the spec. Is there an implicit behavior that applies to
> input stream preprocessing? Anyway I hope that this means we don't
> need to represent isolated surrogates in the input to a UTF-8
> parser.
>
> [5] http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#input-stream

As far as I understand, a "parse error" in the spec is meant for
conformance checkers (validators), not user agents. There is no error
recovery behavior, because this is not an error.

I’d be surprised if anyone relies on truly isolated surrogates, because
Chrome gets very confused when rendering them:

data:text/html,<script>document.write("a\uD800b")</script>

Unfortunately I’d be less surprised if someone relies on having the two
halves of a surrogate pair in separate document.write() call, as this
seems more interoperable:

data:text/html,<script>document.write("\uD83D");document.write("\uDCA9")</script>

--
Simon Sapin

Boris Zbarsky

unread,

Mar 29, 2014, 7:15:25 PM3/29/14

to mozilla-...@lists.mozilla.org

On 3/29/14 6:56 PM, Simon Sapin wrote:
> Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8
> without the artificial restriction of not encoding surrogates.

http://en.wikipedia.org/wiki/CESU-8

> As far as I understand, a "parse error" in the spec is meant for
> conformance checkers (validators), not user agents. There is no error
> recovery behavior, because this is not an error.

Technically a UA is allowed to just stop parsing at a "parse error".
For a browser, of course, that's not web-compatible as you note.

-Boris

Simon Sapin

unread,

Mar 30, 2014, 4:52:46 AM3/30/14

to dev-...@lists.mozilla.org

On 29/03/2014 23:15, Boris Zbarsky wrote:
> On 3/29/14 6:56 PM, Simon Sapin wrote:

>> Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8
>> without the artificial restriction of not encoding surrogates.

> http://en.wikipedia.org/wiki/CESU-8

CESU-8 is evil too, but it’s not what I had in mind. Its main
characteristic is encoding non-BMP characters as surrogates pairs, which
does not change the value space.

But http://www.unicode.org/reports/tr26/ is unclear whether CESU-8
allows unpaired surrogates (which was the issue in the previous
message.) I suppose it does not, by virtue of valid UTF-16 not allowing
them either.

--
Simon Sapin

Simon Sapin

unread,

Mar 30, 2014, 12:18:01 PM3/30/14

to dev-...@lists.mozilla.org

On 29/03/2014 22:56, Simon Sapin wrote:
> On 10/03/2014 23:54, Keegan McAllister wrote:
>> Speaking of which, [5]
>>
>>> Any character that is a not a Unicode character, i.e. any isolated
>>> surrogate, is a parse error. (These can only find their way into
>>> the input stream via script APIs such as document.write().)
>>
>> I don't see a designated error-recovery behavior, unlike most parse
>> errors in the spec. Is there an implicit behavior that applies to
>> input stream preprocessing? Anyway I hope that this means we don't
>> need to represent isolated surrogates in the input to a UTF-8
>> parser.
>>
>> [5] http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#input-stream
>

> As far as I understand, a "parse error" in the spec is meant for
> conformance checkers (validators), not user agents. There is no error
> recovery behavior, because this is not an error.
>
>

> I’d be surprised if anyone relies on truly isolated surrogates, because
> Chrome gets very confused when rendering them:
>
> data:text/html,<script>document.write("a\uD800b")</script>
>
>
> Unfortunately I’d be less surprised if someone relies on having the two
> halves of a surrogate pair in separate document.write() call, as this
> seems more interoperable:
>
> data:text/html,<script>document.write("\uD83D");document.write("\uDCA9")</script>

More on this:

https://www.w3.org/Bugs/Public/show_bug.cgi?id=11298#c2
https://github.com/html5lib/html5lib-tests/issues/19

(Thanks Geoffrey Sneddon for doing the archaeology work.)

--
Simon Sapin

James Graham

unread,

Mar 31, 2014, 7:19:32 AM3/31/14

to dev-...@lists.mozilla.org

On 10/03/14 23:54, Keegan McAllister wrote:
> Should we implement character encoding detection [1] at the same time
> as the rest of the HTML parser? It seems to be separable; the only
> design interactions I see are:
>
> - The character decoder and script APIs can write into the same input
> stream - The encoding can change during parsing [2], which can be
> handled as a mostly-normal navigation

>
> Also, should we follow Gecko in representing the input stream as a
> queue of UTF-16 buffers? With UTF-8 we would have about half as much
> data to stream through the parser, and (on 64-bit) we could do
> case-insensitive ASCII operations 8 characters at a time.
>

> Most content [3] is UTF-8, and the trend is in that direction. But
> for non-UTF-8 content we would pay the price of two conversions, if
> the ultimate product is UCS-2 DOM strings. However I don't think
> we've finalized our representation of DOM strings, and we might do
> the UCS-2 conversion lazily [4].

Apart from the compat problems that Simon already mentioned (which I
think are important), it's worth considering in what cases the parser is
likely to be a bottleneck. I believe that in load-type situations the
parser is almost never the performance limiting factor (assuming a
reasonably well optimised implementation); it's much more likely that
network performance, layout, or scripting will dominate the time to
load. On the other hand I think that there exist pages which do things
like run innerHTML in performance-critical loops. Therefore I expect it
makes more sense for the parser to operate on the same string type as
the DOM than the same type as the network data.

Keegan McAllister

unread,

Mar 31, 2014, 10:01:13 PM3/31/14

to James Graham, dev-...@lists.mozilla.org

> Unfortunately I’d be less surprised if someone relies on having the two
> halves of a surrogate pair in separate document.write() call, as this
> seems more interoperable:
>
> data:text/html,<script>document.write("\uD83D");document.write("\uDCA9")</script>

The tokenizer's input is a queue of buffers, and I'm imagining that document.write will insert a new buffer into that queue (at the script's insertion point) without modifying existing buffers. In that case we can track an optional trailing high surrogate and/or leading low surrogate for each buffer, with the rest of the content as UTF-8. This is a hack but seems pretty straightforward.

At that point are there any remaining compat issues? It does seem like replacing truly lone surrogates with U+FFFD would be an acceptable deviation from the spec, but maybe we want to avoid those absolutely.

> Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8
> without the artificial restriction of not encoding surrogates.

We'd lose the ability to use Rust's native ~str, so I think it's not worth it unless we can demonstrate a significant performance win from a smaller and ASCII-compatible encoding.

> I believe that in load-type situations the parser is almost never

> the performance limiting factor...

> On the other hand I think that there exist pages which do things
> like run innerHTML in performance-critical loops. Therefore I expect it
> makes more sense for the parser to operate on the same string type as
> the DOM than the same type as the network data.

Good point. Even if the DOM is a mix of UTF-8 lazily converted to UCS-2, the argument to document.write or the innerHTML setter is a JS string.

I was thinking about network data as the bulk of the parser input by volume, but you're right that performance requirements differ between net and script. Also I've worked on webapps that produced the majority of their HTML from client-side templates in JS; this may be fairly common by now.

Since the tokenizer is ≈complete and passing tests, perhaps I'll prepare a branch which uses UTF-16 strings so I can compare performance. I haven't profiled or optimized the tokenizer at all yet, so I'll need to do some of that first to get relevant results.

It might be that a UTF-8 parser is no faster, but still worth it for the optimization of keeping DOM strings (especially long text nodes) as UTF-8 until touched by script. I don't know how we would easily test that at this stage.

Much as I'd like to use UTF-8 on principle, it does seem doubtful that practical benefits will justify shoehorning it into a platform which is basically incompatible :/

keegan

Boris Zbarsky

unread,

Mar 31, 2014, 10:25:22 PM3/31/14

to mozilla-...@lists.mozilla.org

On 3/31/14 10:01 PM, Keegan McAllister wrote:
> Good point. Even if the DOM is a mix of UTF-8 lazily converted to UCS-2, the argument to document.write or the innerHTML setter is a JS string.

Which may itself in the future be some mix of UCS-2 and ascii, or UCS-2
and "evil UTF-8", or just UTF-16, or UTF-8. The JS folks are very
interested in trying to move away from pure-UCS-2, for memory reasons...

-Boris

Simon Sapin

unread,

Apr 1, 2014, 5:50:29 AM4/1/14

to dev-...@lists.mozilla.org, Henri Sivonen

On 01/04/2014 03:01, Keegan McAllister wrote:
> It does seem like replacing truly lone surrogates with U+FFFD would
> be an acceptable deviation from the spec, but maybe we want to avoid
> those absolutely.

As much as I’d like this to be true, I don’t know. Henri seemed pretty
opposed to changing the spec that way, a few years ago:

https://www.w3.org/Bugs/Public/show_bug.cgi?id=11298#c2

Henri, any comment?

--
Simon Sapin

Keegan McAllister

unread,

Apr 1, 2014, 3:07:13 PM4/1/14

to Boris Zbarsky, mozilla-...@lists.mozilla.org

> The JS folks are very
> interested in trying to move away from pure-UCS-2, for memory reasons...

That's very interesting. Servo seems like a good place to prototype that with a DOM to match.

Who should I talk to about JS string representation changes?

keegan

Boris Zbarsky

unread,

Apr 1, 2014, 3:18:46 PM4/1/14

to mozilla-...@lists.mozilla.org

On 4/1/14 3:07 PM, Keegan McAllister wrote:
> Who should I talk to about JS string representation changes?

Eddy Bruel. ejpbruel at mozilla dot com.

-Boris

Robert O'Callahan

unread,

Apr 2, 2014, 9:25:12 AM4/2/14

to Boris Zbarsky, Eddy Bruel, mozilla-...@lists.mozilla.org

On Tue, Apr 1, 2014 at 3:18 PM, Boris Zbarsky <bzba...@mit.edu> wrote:

> On 4/1/14 3:07 PM, Keegan McAllister wrote:
>

>> Who should I talk to about JS string representation changes?
>>
>

> Eddy Bruel. ejpbruel at mozilla dot com.

If we could get the JS engine to use evil-UTF8 with some hack to handle
charAt and friends efficiently (e.g. tacking on a UCS-2 version of the
string when necessary), then we could use evil-UTF8 exclusively in Servo
and that would be awesome.

Rob
--
Jtehsauts tshaei dS,o n" Wohfy Mdaon yhoaus eanuttehrotraiitny eovni
le atrhtohu gthot sf oirng iyvoeu rs ihnesa.r"t sS?o Whhei csha iids teoa
stiheer :p atroa lsyazye,d 'mYaonu,r "sGients uapr,e tfaokreg iyvoeunr,
'm aotr atnod sgaoy ,h o'mGee.t" uTph eann dt hwea lmka'n? gBoutt uIp
waanndt wyeonut thoo mken.o w

Luke Wagner

unread,

Apr 2, 2014, 10:45:50 AM4/2/14

to rob...@ocallahan.org, Boris Zbarsky, mozilla-...@lists.mozilla.org, Nicholas Nethercote, Eddy Bruel

Nick just brought up the topic of adding a second compact string representation to the JS engine (motivated by string memory use). One question was whether to use ASCII (which V8 does, iirc) or UTF8. Several DOM people have pointed out over the years that if SM would accept UTF8 it'd be really good for Gecko, so I was leaning toward UTF8 but hoping for some measurement of how much it'd win over ASCII to back this up. Sounds like UTF8 would serve Servo well too.

Another question is how much of the total win would result from adding the ability to parse UTF8 directly (technically, the JSAPI accepts utf8 chars, but they are immediately inflated) vs. JS strings passed in/out through JSAPI.

----- Original Message -----
> On Tue, Apr 1, 2014 at 3:18 PM, Boris Zbarsky <bzba...@mit.edu> wrote:
>
> > On 4/1/14 3:07 PM, Keegan McAllister wrote:
> >

> >> Who should I talk to about JS string representation changes?
> >>
> >

> > Eddy Bruel. ejpbruel at mozilla dot com.
>
>
> If we could get the JS engine to use evil-UTF8 with some hack to handle
> charAt and friends efficiently (e.g. tacking on a UCS-2 version of the
> string when necessary), then we could use evil-UTF8 exclusively in Servo
> and that would be awesome.
>
> Rob
> --
> Jtehsauts tshaei dS,o n" Wohfy Mdaon yhoaus eanuttehrotraiitny eovni
> le atrhtohu gthot sf oirng iyvoeu rs ihnesa.r"t sS?o Whhei csha iids teoa
> stiheer :p atroa lsyazye,d 'mYaonu,r "sGients uapr,e tfaokreg iyvoeunr,
> 'm aotr atnod sgaoy ,h o'mGee.t" uTph eann dt hwea lmka'n? gBoutt uIp
> waanndt wyeonut thoo mken.o w

> _______________________________________________
> dev-servo mailing list
> dev-...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-servo
>

Robert O'Callahan

unread,

Apr 2, 2014, 11:15:24 PM4/2/14

to Luke Wagner, Boris Zbarsky, mozilla-...@lists.mozilla.org, Nicholas Nethercote, Eddy Bruel

On Wed, Apr 2, 2014 at 10:45 AM, Luke Wagner <lu...@mozilla.com> wrote:

> Nick just brought up the topic of adding a second compact string
> representation to the JS engine (motivated by string memory use). One
> question was whether to use ASCII (which V8 does, iirc) or UTF8. Several
> DOM people have pointed out over the years that if SM would accept UTF8
> it'd be really good for Gecko, so I was leaning toward UTF8 but hoping for
> some measurement of how much it'd win over ASCII to back this up. Sounds
> like UTF8 would serve Servo well too.
>

I think the biggest win for using a UTF8-alike is that it lets you write a
single code path that's optimal for ASCII but also handles everything else.

Henri Sivonen

unread,

Apr 3, 2014, 7:37:25 AM4/3/14

to dev-...@lists.mozilla.org

My position at the time was based on the assumption that all real
browser implementations represent DOM strings as 16-bit units. I saw
no value in adding Unicode theoretical purity complexity when it was
really 16-bit units in every real browser implementation.

If Servo is seriously going to 8-bit code units, then I think I'd be
OK with lone surrogates converting to the REPLACEMENT CHARACTER on the
WebIDL layer (conceptually; as an optimization, you'd probably want to
do it in the parser in the case of document.write() or innerHTML).

--
Henri Sivonen
hsiv...@hsivonen.fi
https://hsivonen.fi/

Henri Sivonen

unread,

Apr 3, 2014, 8:03:35 AM4/3/14

to Robert O'Callahan, Boris Zbarsky, mozilla-...@lists.mozilla.org, Eddy Bruel

On Wed, Apr 2, 2014 at 4:25 PM, Robert O'Callahan <rob...@ocallahan.org> wrote:
> If we could get the JS engine to use evil-UTF8 with some hack to handle
> charAt and friends efficiently (e.g. tacking on a UCS-2 version of the
> string when necessary)

Have we instrumented Gecko to find out what the access patterns are
like? If the main performance-sensitive access pattern is sequential
iteration over the string, instead of storing a UCS-2 copy, we could
store the next expected UCS-2 index and the next UTF-8 index. charAt
would then start by comparing if its argument equals the next expected
UCS-2 index in which case read would start at the next UTF-8 index.

Boris Zbarsky

unread,

Apr 3, 2014, 11:07:00 AM4/3/14

to mozilla-...@lists.mozilla.org

On 4/3/14 8:03 AM, Henri Sivonen wrote:
> Have we instrumented Gecko to find out what the access patterns are
> like?

We have not, but I will bet money the answer is "different for
benchmarks and actual content"...

-Boris

Luke Wagner

unread,

Apr 3, 2014, 12:02:38 PM4/3/14

to Henri Sivonen, Boris Zbarsky, mozilla-...@lists.mozilla.org, Robert O'Callahan, Eddy Bruel

Another option we've just been discussing is to lazily compute a flag on the string indicating "contents are 7-bit ascii" that allowed us to use array indexing. I'd expect this to often be true. There are also many cases where we'd eagerly have this flag (atoms produced during parsing, strings converted from numbers, concatenations of 7-bit ascii strings, substrings of 7-bit ascii strings, as a parameter from the embedding, etc) so that we would be able to avoid much of the overhead of this check. (One could even imagine a background thread that computed 7-bit-ness ;)

----- Original Message -----
> On Wed, Apr 2, 2014 at 4:25 PM, Robert O'Callahan <rob...@ocallahan.org>
> wrote:
> > If we could get the JS engine to use evil-UTF8 with some hack to handle
> > charAt and friends efficiently (e.g. tacking on a UCS-2 version of the
> > string when necessary)
>

> Have we instrumented Gecko to find out what the access patterns are

> like? If the main performance-sensitive access pattern is sequential
> iteration over the string, instead of storing a UCS-2 copy, we could
> store the next expected UCS-2 index and the next UTF-8 index. charAt
> would then start by comparing if its argument equals the next expected
> UCS-2 index in which case read would start at the next UTF-8 index.
>
> --
> Henri Sivonen
> hsiv...@hsivonen.fi
> https://hsivonen.fi/

Keegan McAllister

unread,

Apr 22, 2014, 7:15:56 PM4/22/14

to Luke Wagner, Henri Sivonen, Boris Zbarsky, mozilla-...@lists.mozilla.org, Robert O'Callahan, Eddy Bruel

You could use a rope where individual chunks can be either UTF-8 or UCS-2. UTF-8 strings would also record whether they happen to be 7-bit ASCII, and UCS-2 strings would record whether they contain any surrogates (and also maybe whether all surrogates are correctly paired according to UTF-16). Together with the lengths stored throughout the rope's tree structure, this would enable efficient indexing by either UCS-2 or Unicode codepoint index.

That's a fair bit of complexity and so probably only worth it for large strings built in complicated ways. I don't know whether these are common in the wild.

If you want to get really fancy, I think you could use succinct rank/select dictionaries or wavelet trees to make UTF-8 codepoint-indexable with minimal space overhead. (But I don't actually know much about these data structures!)

keegan

Keegan McAllister

unread,

Apr 22, 2014, 10:31:12 PM4/22/14

to Luke Wagner, Henri Sivonen, Boris Zbarsky, mozilla-...@lists.mozilla.org, Robert O'Callahan, Eddy Bruel

After some general performance improvements, I prepared a UTF-16 version [1] of the tokenizer, using Ms2ger's DOMString with some Rust updates and performance fixes [2], and benchmarked it against the UTF-8 version on master. Even excluding the cost of UTF-8 to UTF-16 conversion, the UTF-16 tokenizer is about 10% - 20% slower. The exception is documents consisting of (for example) Chinese text with no markup, for which UTF-16 is significantly faster. But real documents contain a lot of ASCII markup, whatever the text language. I measured a 15% slowdown for http://sina.com.cn.

What of jgraham's point that parsing initiated by script is more performance-sensitive? Here the proper comparison is a native UTF-16 parser vs. a conversion [3] from UTF-16 going into the UTF-8 parser. The latter is about 25% - 55% slower on realistic HTML; I tested complete documents as well as small fragments you might get from innerHTML. Tokenizing plain text is much slower with the conversion, but still fast in absolute terms (15 ms for 1 MB of text).

So this is a significant penalty to script-initiated parsing today. But it sounds like there is interest in extending SpiderMonkey to use UTF-8, perhaps alongside UTF-16. Servo seems like a good place to try this out. And we might want a UTF-8 DOM anyway [4]. So I'd like to stick with UTF-8 in the parser. It's also more friendly to non-Servo users; ideally this library will be the standard HTML parser for Rust programs.

I also benchmarked against Hubbub's tokenizer. It's hard to get an exact comparison because Hubbub doesn't copy the strings making up tokens; it just passes pointers into the original input. (We could do this too, but the memory safety story is much more complicated. My hope is that it's ultimately a wash because the strings will necessarily get copied at some point before they go into the DOM.) Even so, we win on some benchmarks: we're 2.8x as fast as Hubbub on webapps.html.

I also experimented with using SSE 4.2 string instructions to accelerate tokenization. In a standalone program [5] that looks for spaces, it gives more than a 10x speedup. But my first attempt [6] to integrate this with the HTML tokenizer produced only a 5% - 7% overall improvement. I think we can push this quite a bit higher by using more of the information from the fast search and by using it in more places. (If you want to understand this code, look at [6] because it's the version with comments. :) These instructions can handle UCS-2 as well, but I didn't try that yet.

keegan

[1] https://github.com/kmcallister/html5/tree/utf16-vec
[2] https://github.com/kmcallister/html5/blob/utf16-vec/src/util/domstring.rs
[3] https://github.com/kmcallister/html5/blob/utf16to8/bench/tokenizer.rs#L63
[4] https://github.com/mozilla/servo/issues/1880
[5] https://gist.github.com/kmcallister/11200458
[6] https://github.com/kmcallister/html5/blob/sse/src/tokenizer/buffer_queue.rs#L182-L252