Review of changes to Web compat-sensitive prefs in localizations

Henri Sivonen

unread,

Feb 22, 2013, 9:37:32 AM2/22/13

to dev-platform

I've been finding and, to a lesser extent, reporting and writing
patches for bugs where a localization sets the fallback encoding to a
value that doesn't suit the purpose of the fallback. In some cases,
there such bogosity in the intl.properties file (e.g. translation of
the word "windows" as part of a charset label) that I suspect that
changes to intl.properties have been landing without review.

I propose we adopt a rule that says that localizations need review
from the HTML parser module owner (i.e. me) to change the values of
preferences that modify the behavior of the HTML parser. (In practice,
this means the localizable properties intl.charset.default and
intl.charset.detector.)

Opinions?

--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/

Axel Hecht

unread,

Feb 22, 2013, 10:26:17 AM2/22/13

to

I don't think that .platform is the right group to discuss policies for
l10n, tbh.

Anyway, I don't think that it requires your review. For one, these rules
just don't work in practice. We're facing the very same problem with
search engines. There's just no other way than post-mortem work. That's
one of the reasons why we're not taking arbitrary changesets to ship to
any audience beyond aurora and nightly, for beta and release, we got to
have technical checks in place.

I usually catch regressions to intl.properties when reviewing requests
for updates to those changesets.

That said, I don't know what intl.charset.detector should be set to,
aside from nothing. Looking at your patch, the comment doesn't make that
clearer, too, I'll follow up there.

Axel

Henri Sivonen

unread,

Feb 22, 2013, 12:41:13 PM2/22/13

to dev-platform

On Feb 22, 2013 5:30 PM, "Axel Hecht" <l1...@mozilla.com> wrote:
> There's just no other way than post-mortem work. That's one of the
reasons why we're not taking arbitrary changesets to ship to any audience
beyond aurora and nightly, for beta and release, we got to have technical
checks in place.

Where should I file bugs to add checks to this set of checks?

Axel Hecht

unread,

Feb 22, 2013, 1:03:48 PM2/22/13

to

Not sure which checks you're talking about, so I can't really tell what
you want.

Axel

L. David Baron

unread,

Feb 22, 2013, 2:02:41 PM2/22/13

to Henri Sivonen, dev-platform

On Friday 2013-02-22 16:37 +0200, Henri Sivonen wrote:
> I've been finding and, to a lesser extent, reporting and writing
> patches for bugs where a localization sets the fallback encoding to a
> value that doesn't suit the purpose of the fallback. In some cases,
> there such bogosity in the intl.properties file (e.g. translation of
> the word "windows" as part of a charset label) that I suspect that
> changes to intl.properties have been landing without review.

It might not be a bad idea to have a better explanation in
http://mxr.mozilla.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/intl.properties
of why one would want to change intl.charset.default and
intl.charset.detector, explaining clearly that they should only be
set to "interesting" values to deal with a substantial body of
legacy content that requires those values, and then saying what they
should be in the absence of such legacy content (the latter should
clearly be empty; I'm not sure whether the former should be UTF-8 or
ISO-8859-1, but we should have a consistent policy).

That said, I don't actually know whether the tools localizers use to
do localization lead them to read the text.

The reality is that I suspect it may be important for you to do
occasional audits of these values; it could also be valuable to have
a tool that exposes all of them in a single place (perhaps even a
place with history, like an automatically-generated wiki page).

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla http://www.mozilla.org/ 𝄂

Axel Hecht

unread,

Feb 22, 2013, 2:12:33 PM2/22/13

to

Henri filed https://bugzilla.mozilla.org/show_bug.cgi?id=844042 before
posting here (or at least around the same time).

Axel

Henri Sivonen

unread,

Feb 27, 2013, 3:30:16 AM2/27/13

to dev-platform

I meant checks like flagging attempts to go to beta with either of the
following:
* Detector pref not being blank except for a specific white list of
particular values for the ru, uk, ja, ja-JP-Mac and zh-TW locales.
* Fallback charset set to UTF-8 in any locale that doesn't already
have it set to UTF-8.

Axel Hecht

unread,

Feb 27, 2013, 7:28:43 AM2/27/13

to

I'm doing a source-based review, which at least catches regressions to
those settings.

And I think we're doing charset detector settings wrong. Let me see if I
get right what we're doing:

- most content should be labeled for charset
- if not, let's see if we can guess the encoding
-- if we assume the language of the content, we can guess better
-- many languages really only have one option
-- ru, uk, ja, zh-TW do have options, so we use a charset detector

Now, I don't think it's right to use the UI language to guess content
language. We have a list of user-preferred languages (with good defaults
based on UI language). We should go through that list, and pick charsets
to try for unlabeled content from there.

That's rather orthogonal to what you're currently trying to do, but it's
also indicating to me that we should remove all of those settings from
intl.properties, and just leave accept-lang, and deduce the rest.

You also mentioned in the bug that you didn't get the OK to use
telemetry to gather further data. I think if we just collect the data
about the charset optimization and how good it's doing, we should be OK.
I.e., at the point where the locale doesn't matter, but just cp-1252
etc, the entropy goes up a good deal. In particular for small locales.
I'd argue that this might even make sense to be part of health report.

Axel

Anne van Kesteren

unread,

Aug 28, 2013, 7:12:21 AM8/28/13

to

On Wednesday, February 27, 2013 12:28:43 PM UTC, Axel Hecht wrote:
> That's rather orthogonal to what you're currently trying to do, but it's
> also indicating to me that we should remove all of those settings from
> intl.properties, and just leave accept-lang, and deduce the rest.

So how about the parser just accepts a locale value and implements the locale-to-fallback encoding map? Given the numerous problems discovered[1], locale-defaults actually being part of the HTML Standard, and it being available as option to change encourages people to tweak it, I think that would be a better way forward.

I wonder if there are similar settings that are in a sense too technical to leave up to localization teams.

[1]Recent issues discovered by hsivonen:
* https://bugzilla.mozilla.org/show_bug.cgi?id=910163
* https://bugzilla.mozilla.org/show_bug.cgi?id=910165
* https://bugzilla.mozilla.org/show_bug.cgi?id=910169 (bogus value, even)

Henri Sivonen

unread,

Aug 28, 2013, 8:19:35 AM8/28/13

to Anne van Kesteren, dev-platform

On Wed, Aug 28, 2013 at 2:12 PM, Anne van Kesteren <ann...@annevk.nl> wrote:

> On Wednesday, February 27, 2013 12:28:43 PM UTC, Axel Hecht wrote:
> > That's rather orthogonal to what you're currently trying to do, but it's
> > also indicating to me that we should remove all of those settings from
> > intl.properties, and just leave accept-lang, and deduce the rest.
>
> So how about the parser just accepts a locale value and implements the
> locale-to-fallback encoding map?
>

Good idea. Bug filed: https://bugzilla.mozilla.org/show_bug.cgi?id=910192

As mentioned in the third paragraph of the bug description, a
non-localizable override pref is probably still needed so that the
user--not the localizer--can deal with situations like using an en-US build
in a non-windows-1252 context. (E.g. because you are a developer and always
use en-US builds despite being located in a non-windows-1252 country.)

--
Henri Sivonen
hsiv...@hsivonen.fi
http://hsivonen.iki.fi/

Axel Hecht

unread,

Aug 28, 2013, 8:20:05 AM8/28/13

to

On 8/28/13 1:12 PM, Anne van Kesteren wrote:
> On Wednesday, February 27, 2013 12:28:43 PM UTC, Axel Hecht wrote:
>> That's rather orthogonal to what you're currently trying to do, but it's
>> also indicating to me that we should remove all of those settings from
>> intl.properties, and just leave accept-lang, and deduce the rest.
> So how about the parser just accepts a locale value and implements the locale-to-fallback encoding map? Given the numerous problems discovered[1], locale-defaults actually being part of the HTML Standard, and it being available as option to change encourages people to tweak it, I think that would be a better way forward.

I don't think that 'a locale value' is correct. We should use content
languages and not UI language. But from the list of preferred content
languages, we can help the parser. It is a bit more tricky in general
than we have right now, as for some users, we'll end up with mismatches
between the fallback encodings. We could just use the first language for
which we have one, though. At least as first step.

I don't know which locale-defaults are part of the html spec, before I
read it all, can you elaborate?

>
> I wonder if there are similar settings that are in a sense too technical to leave up to localization teams.

We have a few. We're trying to set them up these days such that garbage
values mean en-US default, and provide patches and edits for the others.

Axel

Henri Sivonen

unread,

Aug 28, 2013, 8:33:45 AM8/28/13

to Axel Hecht, dev-platform

On Wed, Aug 28, 2013 at 3:20 PM, Axel Hecht <l1...@mozilla.com> wrote:

> On 8/28/13 1:12 PM, Anne van Kesteren wrote:
>
>> On Wednesday, February 27, 2013 12:28:43 PM UTC, Axel Hecht wrote:
>>
>>> That's rather orthogonal to what you're currently trying to do, but it's
>>> also indicating to me that we should remove all of those settings from
>>> intl.properties, and just leave accept-lang, and deduce the rest.
>>>
>> So how about the parser just accepts a locale value and implements the
>> locale-to-fallback encoding map? Given the numerous problems discovered[1],
>> locale-defaults actually being part of the HTML Standard, and it being
>> available as option to change encourages people to tweak it, I think that
>> would be a better way forward.
>>
> I don't think that 'a locale value' is correct.

It's not, logically, but it's what we and other browsers currently use in
the absence of a better solution. Moving to what Anne suggested plus my
elaboration would not make us worse off compared to the status quo.

> We should use content languages and not UI language. But from the list of
> preferred content languages, we can help the parser.

I'm not at all fond of the idea of making *that* obscure piece of
configurability having parser behavior implications.

If we want to use inputs to the guessing other than the inputs we are using
today, that's a research project and not a bug fix project. If I were
starting such a research project, I'd start by testing hypotheses about TLD
correlation with legacy encodings. The first thing I'd like to test would
be whether it would be an improvement to make builds that have Traditional
Chinese as the UI language use gbk (as opposed to big5) as the fallback
encoding when browsing content loaded from a .cn domain.

> It is a bit more tricky in general than we have right now, as for some
> users, we'll end up with mismatches between the fallback encodings. We
> could just use the first language for which we have one, though. At least
> as first step.
>

I'd rather not block solving the problem raised in this thread on research
about how well novel inputs to the guessing process would work.

> I don't know which locale-defaults are part of the html spec, before I
> read it all, can you elaborate?
>

See the table under step 9 of
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

Henri Sivonen

unread,

Aug 28, 2013, 8:46:27 AM8/28/13

to dev-platform

On Wed, Aug 28, 2013 at 3:33 PM, Henri Sivonen <hsiv...@hsivonen.fi> wrote:

> If I were starting such a research project, I'd start by testing
> hypotheses about TLD correlation with legacy encodings. The first thing I'd
> like to test would be whether it would be an improvement to make builds
> that have Traditional Chinese as the UI language use gbk (as opposed to
> big5) as the fallback encoding when browsing content loaded from a .cn
> domain.
>

To elaborate, we could first have a lookup table from country TLDs to
legacy encodings and then only as a second step would use the lookup from
the UI localization to legacy encodings for TLDs that don't have a strong
country affiliation. So for example, we'd map .cn to gbk, .tw to big5, .ru
to windows-1251 and .de, .fr, .se, .nl, .fi etc. to windows-1252, but for
.com, .org and such we'd base the guess on the UI locale like today but
using a less brittle way of managing the mapping.

But anyway, that would be improving the guessing instead of just fixing how
the current guessing mechanism is a managed. I don't want better to be a
blocker for good here.

Henri Sivonen

unread,

Aug 28, 2013, 9:07:17 AM8/28/13

to dev-platform

On Wed, Aug 28, 2013 at 3:46 PM, Henri Sivonen <hsiv...@hsivonen.fi> wrote:

> On Wed, Aug 28, 2013 at 3:33 PM, Henri Sivonen <hsiv...@hsivonen.fi>wrote:
>
>> If I were starting such a research project, I'd start by testing
>> hypotheses about TLD correlation with legacy encodings. The first thing I'd
>> like to test would be whether it would be an improvement to make builds
>> that have Traditional Chinese as the UI language use gbk (as opposed to
>> big5) as the fallback encoding when browsing content loaded from a .cn
>> domain.
>>
>
> To elaborate, we could first have a lookup table from country TLDs to
> legacy encodings and then only as a second step would use the lookup from
> the UI localization to legacy encodings for TLDs that don't have a strong
> country affiliation. So for example, we'd map .cn to gbk, .tw to big5, .ru
> to windows-1251 and .de, .fr, .se, .nl, .fi etc. to windows-1252, but for
> .com, .org and such we'd base the guess on the UI locale like today but
> using a less brittle way of managing the mapping.
>

Filed as: https://bugzilla.mozilla.org/show_bug.cgi?id=910211