Re: [whatwg] Default encoding to UTF-8?

9 views

Skip to first unread message

Henri Sivonen

unread,

Jan 3, 2012, 3:33:02 AM1/3/12

to wha...@whatwg.org

On Thu, Dec 22, 2011 at 12:36 PM, Leif Halvard Silli <l...@russisk.no> wrote:
>> It's unclear to me if you are talking about HTTP-level charset=UNICODE
>> or charset=UNICODE in a meta. Is content labeled with charset=UNICODE
>> BOMless?
>
> Charset=UNICODE in meta, as generated by MS tools (Office or IE, eg.)
> seems to usually be "BOM-full". But there are still enough occurrences
> of pages without BOM. I have found UTF-8 pages with the charset=unicode
> label in meta. But the few page I found contained either BOM or
> HTTP-level charset=utf-8. I have to little "research material" when it
> comes to UTF-8 pages with charset=unicode inside.

Making 'unicode' an alias of UTF-16 or UTF-16LE would be useless for
pages that have a BOM, because the BOM is already inspected before
<meta> and if HTTP-level charset is unrecognized, the BOM wins.

Making 'unicode' an alias of UTF-16 or UTF-16LE would be useful for
UTF-8-encoded pages that say charset=unicode in <meta> if alias
resolution happens before UTF-16 labels are mapped to UTF-8.

Making 'unicode' an alias for UTF-16 or UTF-16LE would be useless for
pages that are (BOMless) UTF-16LE and that have charset=unicode in
<meta>, because the <meta> prescan doesn't see UTF-16-encoded metas.
Furthermore, it doesn't make sense to make the <meta> prescan look for
UTF-16-encoded metas, because it would make sense to honor the value
only if it matched a flavor of UTF-16 appropriate for the pattern of
zero bytes in the file, so it would be more reliable and straight
forward to just analyze the pattern of zero bytes without bothering to
look for UTF-16-encoded <meta>s.

> When the detector says UTF-8 - that is step 7 of the sniffing algorith,
> no?
> http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

Yes.

>> 2) Start the parse assuming UTF-8 and reload as Windows-1252 if the
>> detector says non-UTF-8.
...
> I think you are mistaken there: If parsers perform UTF-8 detection,
> then unlabelled pages will be detected, and no reparsing will happen.
> Not even increase. You at least need to explain this negative spiral
> theory better before I buy it ... Step 7 will *not* lead to reparsing
> unless the default encoding is WINDOWS-1252. If the default encoding is
> UTF-8, then step 7, when it detects UTF-8, then it means that parsing
> can continue uninterrupted.

That would be what I labeled as option #2 above.

> What we will instead see is that those using legacy encodings must be
> more clever in labelling their pages, or else they won't be detected.

Many pages that use legacy encodings are legacy pages that aren't
actively maintained. Unmaintained pages aren't going to become more
clever about labeling.

> I am a bitt baffled here: It sounds like you say that there will be bad
> consequences if browsers becomes more reliable ...

Becoming more reliable can be bad if the reliability comes at the cost
of performance, which would be the case if the kind of heuristic
detector that e.g. Firefox has was turned on for all locales. (I don't
mean the performance impact of running a detector state machine. I
mean the performance impact of reloading the page or, alternatively,
the loss of incremental rendering.)

A solution that would border on reasonable would be decoding as
US-ASCII up to the first non-ASCII byte and then deciding between
UTF-8 and the locale-specific legacy encoding by examining the first
non-ASCII byte and up to 3 bytes after it to see if they form a valid
UTF-8 byte sequence. But trying to gain more statistical confidence
about UTF-8ness than that would be bad for performance (either due to
stalling stream processing or due to reloading).

> Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding
> detection. So it might still be an competitive advantage.

It would be interesting to know what exactly Chrome does. Maybe
someone who knows the code could enlighten us?

>>> * Let's say that I *kept* ISO-8859-1 as default encoding, but instead
>>> enabled the Universal detector. The frame then works.
>>> * But if I make the frame page very short, 10 * the letter "ø" as
>>> content, then the Universal detector fails - on a test on my own
>>> computer, it guess the page to be Cyrillic rather than Norwegian.
>>> * What's the problem? The Universal detector is too greedy - it tries
>>> to fix more problems than I have. I only want it to guess on "UTF-8".
>>> And if it doesn't detect UTF-8, then it should fall back to the locale
>>> default (including fall back to the encoding of the parent frame).
>>>
>>> Wouldn't that be an idea?
>>
>> No. The current configuration works for Norwegian users already. For
>> users from different silos, the ad might break, but ad breakage is
>> less bad than spreading heuristic detection to more locales.
>
> Here I must disagree: Less bad for whom?

For users performance-wise.

--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/

Henri Sivonen

unread,

Jan 3, 2012, 3:50:26 AM1/3/12

to wha...@whatwg.org

On Tue, Jan 3, 2012 at 10:33 AM, Henri Sivonen <hsiv...@iki.fi> wrote:
> A solution that would border on reasonable would be decoding as
> US-ASCII up to the first non-ASCII byte and then deciding between
> UTF-8 and the locale-specific legacy encoding by examining the first
> non-ASCII byte and up to 3 bytes after it to see if they form a valid
> UTF-8 byte sequence. But trying to gain more statistical confidence
> about UTF-8ness than that would be bad for performance (either due to
> stalling stream processing or due to reloading).

And it's worth noting that the above paragraph states a "solution" to
the problem that is: "How to make it possible to use UTF-8 without
declaring it?"

Adding autodetection wouldn't actually force authors to use UTF-8, so
the problem Faruk stated at the start of the thread (authors not using
UTF-8 throughout systems that process user input) wouldn't be solved.

Leif Halvard Silli

unread,

Jan 3, 2012, 5:34:31 PM1/3/12

to Henri Sivonen, wha...@whatwg.org

Henri Sivonen, Tue Jan 3 00:33:02 PST 2012:

> On Thu, Dec 22, 2011 at 12:36 PM, Leif Halvard Silli wrote:

> Making 'unicode' an alias of UTF-16 or UTF-16LE would be useful for
> UTF-8-encoded pages that say charset=unicode in <meta> if alias
> resolution happens before UTF-16 labels are mapped to UTF-8.

Yup.

> Making 'unicode' an alias for UTF-16 or UTF-16LE would be useless for
> pages that are (BOMless) UTF-16LE and that have charset=unicode in
> <meta>, because the <meta> prescan doesn't see UTF-16-encoded metas.

Hm. Yes. I see that I misread something, and ended up believing that
the <meta> would *still* be used if the mapping from 'UTF-16' to
'UTF-8' turned out to be incorrect. I guess I had not understood, well
enough, that the meta prescan *really* doesn't see UTF-16-encoded
metas. Also contributing was the fact that I did nto realize that IE
doesn't actually read the page as UTF-16 but as Windows-1252:
<http://www.hughesrenier.be/actualites.html>. (Actually, browsers does
see the UTF-16 <meta>, but only if the default encoding is set to be
UTF-16 - see step 1 of '8.2.2.4 Changing the encoding while parsing'
<http://dev.w3.org/html5/spec/parsing.html#change-the-encoding>.)

> Furthermore, it doesn't make sense to make the <meta> prescan look for
> UTF-16-encoded metas, because it would make sense to honor the value
> only if it matched a flavor of UTF-16 appropriate for the pattern of
> zero bytes in the file, so it would be more reliable and straight
> forward to just analyze the pattern of zero bytes without bothering to
> look for UTF-16-encoded <meta>s.

Makes sense.

[ snip ]

>> What we will instead see is that those using legacy encodings must be
>> more clever in labelling their pages, or else they won't be detected.
>
> Many pages that use legacy encodings are legacy pages that aren't
> actively maintained. Unmaintained pages aren't going to become more
> clever about labeling.

But their Non-UTF-8-ness should be picked up in the first 1024 bytes?

[... sniff - sorry, meant snip ;-) ...]

> I mean the performance impact of reloading the page or,
> alternatively, the loss of incremental rendering.)
>

> A solution that would border on reasonable would be decoding as
> US-ASCII up to the first non-ASCII byte

Thus possibly prescan of more than 1024 bytes? Is it faster to scan
ASCII? (In Chrome, there does not seem to be an end to the prescan, as
long as the text source code is ASCII only.)

> and then deciding between
> UTF-8 and the locale-specific legacy encoding by examining the first
> non-ASCII byte and up to 3 bytes after it to see if they form a valid
> UTF-8 byte sequence.

Except for the specifics, that sounds like more or less the idea I
tried to state. May be it could be made into a bug in Mozilla? (I could
do it, but ...)

However, there is one thing that should be added: The parser should
default to UTF-8 even if it does not detect any UTF-8-ish non-ASCII. Is
that part of your idea? Because, if it does not behave like that, then
it would work as Google Chrome now does work. Which for the following,
UTF-8 encoded (but charset-un-labelled) page means, that it default to
UTF-8:

<!DOCTYPE html><title>æøå</title></html>

While it for this - identical - page, would default to the locale
encoding, due to the use of ASCII based character entities, which
causes that it does not detect any UTF-8-ish characters:

<!DOCTYPE html><title>æøå</title></html>

As weird variant of the latter example is UTF-8 based data URIs, where
all browsers (that I could test - IE only supports data URIs in the
@src attribute, including <script@src>) default to the locale encoding
(apart for Mozilla Camino - which has character detection enabled by
default):

data:text/html,<!DOCTYPE html><title>%C3%A6%C3%B8%C3%A5</title></html>

All the 3 examples above should default to UTF-8, if the "border on
sane" approach was applied.

> But trying to gain more statistical confidence
> about UTF-8ness than that would be bad for performance (either due to
> stalling stream processing or due to reloading).

So here you say tthat it is better to start to present early, and
eventually reload [I think] if during the presentation the encoding
choice shows itself to be wrong, than it would be to investigate too
much and be absolutely certain before starting to present the page.

Later, at Jan 3 00:50:26 PST 2012, you added:

> And it's worth noting that the above paragraph states a "solution" to
> the problem that is: "How to make it possible to use UTF-8 without
> declaring it?"

Indeed.

> Adding autodetection wouldn't actually force authors to use UTF-8, so
> the problem Faruk stated at the start of the thread (authors not using
> UTF-8 throughout systems that process user input) wouldn't be solved.

If we take that logic to its end, then it would not make sense for the
validator to display an error when a page contains a form without being
UTF-8 encoded, either. Because, after all, the backend/whatever could
be non-UTF-8 based. The only way to solve that problem on those
systems, would be to send form content as character entities. (However,
then too the form based page should still be UTF-8 in the first place,
in order to be able to take any content.)

[ Original letter continued: ]

>> Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding
>> detection. So it might still be an competitive advantage.
>
> It would be interesting to know what exactly Chrome does. Maybe
> someone who knows the code could enlighten us?

+1 (But their approach looks similar to the 'border on sane' approach
you presented. Except that they seek to detect also non-UTF-8.)
--
Leif Halvard Silli

Henri Sivonen

unread,

Apr 3, 2012, 7:59:25 AM4/3/12

to wha...@whatwg.org

On Wed, Jan 4, 2012 at 12:34 AM, Leif Halvard Silli
<xn--mlf...@xn--mlform-iua.no> wrote:
>> I mean the performance impact of reloading the page or,
>> alternatively, the loss of incremental rendering.)
>>
>> A solution that would border on reasonable would be decoding as
>> US-ASCII up to the first non-ASCII byte
>
> Thus possibly prescan of more than 1024 bytes?

I didn't mean a prescan. I meant proceeding with the real parse and
switching decoders in midstream. This would have the complication of
also having to change the encoding the document object reports to
JavaScript in some cases.

>> and then deciding between
>> UTF-8 and the locale-specific legacy encoding by examining the first
>> non-ASCII byte and up to 3 bytes after it to see if they form a valid
>> UTF-8 byte sequence.
>
> Except for the specifics, that sounds like more or less the idea I
> tried to state. May be it could be made into a bug in Mozilla?

It's not clear that this is actually worth implementing or spending
time on its this stage.

> However, there is one thing that should be added: The parser should
> default to UTF-8 even if it does not detect any UTF-8-ish non-ASCII.

That would break form submissions.

>> But trying to gain more statistical confidence
>> about UTF-8ness than that would be bad for performance (either due to
>> stalling stream processing or due to reloading).
>
> So here you say tthat it is better to start to present early, and
> eventually reload [I think] if during the presentation the encoding
> choice shows itself to be wrong, than it would be to investigate too
> much and be absolutely certain before starting to present the page.

I didn't intend to suggest reloading.

>> Adding autodetection wouldn't actually force authors to use UTF-8, so
>> the problem Faruk stated at the start of the thread (authors not using
>> UTF-8 throughout systems that process user input) wouldn't be solved.
>
> If we take that logic to its end, then it would not make sense for the
> validator to display an error when a page contains a form without being
> UTF-8 encoded, either. Because, after all, the backend/whatever could
> be non-UTF-8 based. The only way to solve that problem on those
> systems, would be to send form content as character entities. (However,
> then too the form based page should still be UTF-8 in the first place,
> in order to be able to take any content.)

Presumably, when an author reacts to an error message, (s)he not only
fixes the page but also the back end. When a browser makes encoding
guesses, it obviously cannot fix the back end.

> [ Original letter continued: ]
>>> Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding
>>> detection. So it might still be an competitive advantage.
>>
>> It would be interesting to know what exactly Chrome does. Maybe
>> someone who knows the code could enlighten us?
>
> +1 (But their approach looks similar to the 'border on sane' approach
> you presented. Except that they seek to detect also non-UTF-8.)

I'm slightly disappointed but not surprised that this thread hasn't
gained a message explaining what Chrome does exactly.

Anne van Kesteren

unread,

Apr 3, 2012, 3:08:41 PM4/3/12

to wha...@whatwg.org, Henri Sivonen

On Tue, 03 Apr 2012 13:59:25 +0200, Henri Sivonen <hsiv...@iki.fi> wrote:
> On Wed, Jan 4, 2012 at 12:34 AM, Leif Halvard Silli
> <xn--mlf...@xn--mlform-iua.no> wrote:
>>> A solution that would border on reasonable would be decoding as
>>> US-ASCII up to the first non-ASCII byte
>>
>> Thus possibly prescan of more than 1024 bytes?
>
> I didn't mean a prescan. I meant proceeding with the real parse and
> switching decoders in midstream. This would have the complication of
> also having to change the encoding the document object reports to
> JavaScript in some cases.

On IRC (#whatwg) zcorpan pointed out this would break URLs where entities
are used to encode non-ASCII code points in the query component.