Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

new RFCs on language tags

2 views
Skip to first unread message

L. David Baron

unread,
Sep 21, 2006, 4:47:41 PM9/21/06
to dev-...@lists.mozilla.org
For what it's worth, there are two new RFCs on language tags (things
like fr, en-US, etc.). The main one of interest is:
http://www.rfc-editor.org/rfc/rfc4646.txt

There's also a new registry of language tags:
http://www.iana.org/assignments/language-subtag-registry
This document has sections for languages, scripts, regions, and
variants, which are the important parts, and also for grandfathered and
redundant tags.

Further information is at:
http://www.w3.org/International/articles/language-tags/
http://www.w3.org/International/questions/qa-lang-2or3

I think it would be good if, at least for new localizations, we used
language tags that were described by this spec and registry as much as
possible.


For what it's worth, existing localizations that I'd think would be
named differently if we were doing things this way (no idea whether
people would care enough to rename these) would be (although I don't
necessarily have a clue what I'm saying here):

zh-TW -> zh-Hant
zh-CN -> zh-Hans
ja-JP-mac -> ja-x-mac
es-AR -> es-419 (419 is the region code for Latin America and the
Caribbean)
pa-IN -> pa-Guru or just pa, since Gurmukhi is listed as the default
script in the registry (unless there's some reason it's also specific
to India in addition to being specific to the Gurmukhi script)
mn -> mn-Cyrl

(I don't know enough to say whether certain existing localizations
should or shouldn't have country codes, although there are some that
have country codes that seem unnecessary to me, such as nb-NO and
nn-NO.)

-David

--
L. David Baron <URL: http://dbaron.org/ >
Technical Lead, Layout & CSS, Mozilla Corporation

Simon Montagu

unread,
Sep 21, 2006, 6:15:47 PM9/21/06
to
L. David Baron wrote:
> For what it's worth, existing localizations that I'd think would be
> named differently if we were doing things this way (no idea whether
> people would care enough to rename these) would be (although I don't
> necessarily have a clue what I'm saying here):
>
> zh-TW -> zh-Hant
> zh-CN -> zh-Hans
> ja-JP-mac -> ja-x-mac
> es-AR -> es-419 (419 is the region code for Latin America and the
> Caribbean)
> pa-IN -> pa-Guru or just pa, since Gurmukhi is listed as the default
> script in the registry (unless there's some reason it's also specific
> to India in addition to being specific to the Gurmukhi script)
> mn -> mn-Cyrl

pa-Guru should not be used acc. to RFC 4646. Gurmukhi is listed as
Suppress-Script, which is not quite the same as a default. Apart from
that, I agree with what you say.

Axel Hecht

unread,
Sep 21, 2006, 6:39:33 PM9/21/06
to

I kinda doubt es-AR vs es-419. Regarding zh-Han[st], is there a
reference for that?

I'm a tad too tired to read the spec fullly to reverse engineer those,
but I don't see us following that lightly.

http://www.rfc-editor.org/rfc/rfc4646.txt, 2.1. Syntax says:


region = 2ALPHA ; ISO 3166 code
/ 3DIGIT ; UN M.49 code

Sounds to me like anything we did should still be OK. I kinda read
2.2.4. to suggest that if there is a 3166 code, one must use them.
Though I get totally confuzzled right after that.

Axel

Simon Montagu

unread,
Sep 21, 2006, 7:01:15 PM9/21/06
to
Axel Hecht wrote:
> I kinda doubt es-AR vs es-419. Regarding zh-Han[st], is there a
> reference for that?
>
> I'm a tad too tired to read the spec fullly to reverse engineer those,
> but I don't see us following that lightly.
>
> http://www.rfc-editor.org/rfc/rfc4646.txt, 2.1. Syntax says:
>
>
> region = 2ALPHA ; ISO 3166 code
> / 3DIGIT ; UN M.49 code
>
> Sounds to me like anything we did should still be OK.

If what we did conformed to RFC 3066, then it conforms to RFC 4646, by
definition. The question is more whether we could do some things better
now. (In my mind, that is. I'm not trying to put words in dbaron's mouth)

As for es-AR vs. es-419, it depends what that localization is intended
for. If it's specific to Argentina, it should remain es-AR. If it's
intended as general Latin American Spanish, es-419 is a more accurate
description.

I'm not sure what you're asking about Hans and Hant. Does this help?
(from
http://www.w3.org/International/articles/language-tags/Overview.en.php)

"Although for common uses of language tags it is not likely that you
will need to specify the script, there are one or two situations that
have been crying out for it for some time. One such example is Chinese.
There are many Chinese dialects, often mutually unintelligible, but
these dialects are all written using either Simplified or Traditional
Chinese script. People typically want to label Chinese text as either
Simplified or Traditional, but until recently there was no way to do so.
People had to bend something like zh-CN (meaning Chinese as spoken in
China) to mean Simplified Chinese, even in Singapore, and zh-TW (meaning
Chinese as spoken in Taiwan) for Traditional Chinese. Some people,
however, use zh-HK for Traditional Chinese. The availability of zh-Hans
and zh-Hant for Chinese written in Simplified and Traditional scripts
should improve consistency and accuracy, and is already becoming widely
used."

Marek Stępień

unread,
Sep 21, 2006, 8:02:23 PM9/21/06
to
Simon Montagu napisał:

> As for es-AR vs. es-419, it depends what that localization is intended
> for. If it's specific to Argentina, it should remain es-AR. If it's
> intended as general Latin American Spanish, es-419 is a more accurate
> description.

Well, I'm no expert in Spanish, but es-AR contains Argentinian search
plugins for Yahoo and eBay (Mercado Libre) and an Argentinian RSS Feed.

So, it's really "AR", not 419. I don't even know why we call it "Spanish
(Latin America)" on the downloads page...

--
Marek Stępień <mar...@aviary.pl>
AviaryPL - polski zespół lokalizacyjny Mozilli
http://www.firefox.pl/ | http://www.mozilla.org.pl/

Mark Tyndall

unread,
Sep 22, 2006, 4:38:49 AM9/22/06
to
L. David Baron wrote:
>
[...]

> For what it's worth, existing localizations that I'd think would be
> named differently if we were doing things this way (no idea whether
> people would care enough to rename these) would be (although I don't
> necessarily have a clue what I'm saying here):
>
[...]
> mn -> mn-Cyrl

Section 4.1 says that language tags SHOULD "Use as precise a tag as
possible, but no more specific than is justified".

The example given is en-Latn - it notes that Latn is unnecessary since
English is primarily written in a Latin script: so (IMO) for Mongolian,
mn-Cyrl would only be appropriate if the use of Cyrillic script was in
some way unusual.

regards,
Mark..

--
British English localisations of:
SeaMonkey <http://www.tyndall.org.uk/moz_en-gb.html>
Firefox <http://www.tyndall.org.uk/fb_en-gb.html>
Thunderbird <http://www.tyndall.org.uk/tb_en-gb.html>

Simon Montagu

unread,
Sep 22, 2006, 5:05:04 AM9/22/06
to
Mark Tyndall wrote:
> L. David Baron wrote:
>>
> [...]
>> For what it's worth, existing localizations that I'd think would be
>> named differently if we were doing things this way (no idea whether
>> people would care enough to rename these) would be (although I don't
>> necessarily have a clue what I'm saying here):
>>
> [...]
>> mn -> mn-Cyrl
>
> Section 4.1 says that language tags SHOULD "Use as precise a tag as
> possible, but no more specific than is justified".
>
> The example given is en-Latn - it notes that Latn is unnecessary since
> English is primarily written in a Latin script: so (IMO) for Mongolian,
> mn-Cyrl would only be appropriate if the use of Cyrillic script was in
> some way unusual.

I think you're probably right, although it's less clear-cut than the
case of en-Latn. Mongolian can be written either in Cyrillic (mn-Cyrl)
or Mongolian script (mn-Mong), but
http://www.omniglot.com/writing/mongolian.htm says "The average person
in Mongolia knows little or nothing about the Classical Mongol script,
though there is high literacy in Cyrillic."

Axel Hecht

unread,
Sep 22, 2006, 5:06:24 AM9/22/06
to
Simon Montagu wrote:
> Axel Hecht wrote:
>> I kinda doubt es-AR vs es-419. Regarding zh-Han[st], is there a
>> reference for that?
>>
>> I'm a tad too tired to read the spec fullly to reverse engineer those,
>> but I don't see us following that lightly.
>>
>> http://www.rfc-editor.org/rfc/rfc4646.txt, 2.1. Syntax says:
>>
>>
>> region = 2ALPHA ; ISO 3166 code
>> / 3DIGIT ; UN M.49 code
>>
>> Sounds to me like anything we did should still be OK.
>
> If what we did conformed to RFC 3066, then it conforms to RFC 4646, by
> definition. The question is more whether we could do some things better
> now. (In my mind, that is. I'm not trying to put words in dbaron's mouth)
>
> As for es-AR vs. es-419, it depends what that localization is intended
> for. If it's specific to Argentina, it should remain es-AR. If it's
> intended as general Latin American Spanish, es-419 is a more accurate
> description.

Looking at
http://wiki.mozilla.org/L10n:Localization_Teams#Teams_with_no_locales_in_bugzilla_or_CVS,
there are requests for Chile and Mexico. Those aren't really alive and
kicking, and I haven't invested in asking in-depth questions on the
linguistic differences, though.

In general, I don't want us to use numbers in language codes. I don't
really care if 419 is an accurate code for Latin America, to me, it's
just cryptic and will make life tricky. I'd rather stick in a region
that's rougly right and gives me a clue on whether a search engine or
feed makes sense than to have a perfect number that I have to look up
each time I or Mic have to work on it.

> I'm not sure what you're asking about Hans and Hant. Does this help?
> (from
> http://www.w3.org/International/articles/language-tags/Overview.en.php)
>
> "Although for common uses of language tags it is not likely that you
> will need to specify the script, there are one or two situations that
> have been crying out for it for some time. One such example is Chinese.
> There are many Chinese dialects, often mutually unintelligible, but
> these dialects are all written using either Simplified or Traditional
> Chinese script. People typically want to label Chinese text as either
> Simplified or Traditional, but until recently there was no way to do so.
> People had to bend something like zh-CN (meaning Chinese as spoken in
> China) to mean Simplified Chinese, even in Singapore, and zh-TW (meaning
> Chinese as spoken in Taiwan) for Traditional Chinese. Some people,
> however, use zh-HK for Traditional Chinese. The availability of zh-Hans
> and zh-Hant for Chinese written in Simplified and Traditional scripts
> should improve consistency and accuracy, and is already becoming widely
> used."

Thanks, that quote helps.

In this particular case, I'd need some indepth understanding and
feedback on the linguistic part, when I recall two chinese friends of me
chatting with each other (one Taiwan, one Hongkong), the script was the
only thing they used to cover the linguistic differences. Whether those
would have mattered in a browser is a completely different beast.
In addition, China is not only an issue of script, but also one of
freedom of choice. For example see
https://bugzilla.mozilla.org/show_bug.cgi?id=351096#c9.

I think that the numeric codes are really for language only, and we're
dealing with a mix of language and region, so I second what Marek says.

And technically, renaming locales ain't a piece of cake either, so I'd
rather not do that.

Bottom line: Renames, nah, not really. Numeric region codes? Too little
value practically, too high cost in the process of product/business
development. (Not revenue, to clarify, just process costs of digging up
numbers for regions.)

Axel

L. David Baron

unread,
Sep 22, 2006, 5:48:05 PM9/22/06
to
Marek Stępień wrote:
> Simon Montagu napisał:
>> As for es-AR vs. es-419, it depends what that localization is intended
>> for. If it's specific to Argentina, it should remain es-AR. If it's
>> intended as general Latin American Spanish, es-419 is a more accurate
>> description.
>
> Well, I'm no expert in Spanish, but es-AR contains Argentinian search
> plugins for Yahoo and eBay (Mercado Libre) and an Argentinian RSS Feed.
>
> So, it's really "AR", not 419. I don't even know why we call it "Spanish
> (Latin America)" on the downloads page...

The download page is listing builds by language. Should we list them by
language and region instead? I don't think there are significant enough
linguistic differences to merit separate localizations for different
parts of Latin America. (I think if there were, volunteers probably
would have stepped up and done such localizations by now.)

I thought our localizations were based on languages, and then we were
including appropriate content for the region where the language is
spoken. If we're actually targeting regions, then I think getting
additional Latin American builds ought to be a priority.

L. David Baron

unread,
Sep 22, 2006, 5:58:00 PM9/22/06
to
L. David Baron wrote:

> Marek Stępień wrote:
>> Well, I'm no expert in Spanish, but es-AR contains Argentinian search
>> plugins for Yahoo and eBay (Mercado Libre) and an Argentinian RSS Feed.
>>
>> So, it's really "AR", not 419. I don't even know why we call it "Spanish
>> (Latin America)" on the downloads page...

And to add one more thought:

Did we end up with Argentinian-specific content because we went through
the steps:

1. somebody wanted to write a Latin American Spanish localization
2. somebody (we or the localizers) had to figure out what to call it,
since there was no good way to represent the concept of Latin American
Spanish, and picked es-AR, perhaps because of where the localizers were from
3. when the localizers added regional content, they added
Argentinian-specific content because it seemed appropriate given the
name we were using for the build

or because we went through the steps:

1. somebody wanted to write an Argentinian Spanish localization, with
the intent that we would also have Mexican Spanish, Colombian Spanish, etc.

?

Marek Stępień

unread,
Sep 22, 2006, 6:22:02 PM9/22/06
to
L. David Baron napisał:

> Did we end up with Argentinian-specific content because we went through
> the steps:
> 1. somebody wanted to write a Latin American Spanish localization
[...]

> or because we went through the steps:
> 1. somebody wanted to write an Argentinian Spanish localization, with
> the intent that we would also have Mexican Spanish, Colombian Spanish, etc.

Looking at this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=253423
I'd say the latter.

We're calling es-AR "Spanish, Argentina" here:
http://wiki.mozilla.org/L10n:Localization_Teams#Spanish.2C_Argentina_.28es-AR.29
and here:
https://bugzilla.mozilla.org/describecomponents.cgi?product=Mozilla%20Localizations

but "Spanish (Latin America)" on mozilla.com.

As the language itself probably doesn't differ much between Argentina
and, say, Chile, I think that ye olde Suite idea of having separate
language and content packs wasn't that bad after all.

pascal

unread,
Sep 22, 2006, 6:26:53 PM9/22/06
to
L. David Baron a écrit :

European Spanish and Latin-american Spanish(es) diverge in several ways.
Argentinian is in my opinion the version that is most different from
es-ES but also from other Latin-american versions of Spanish with strong
differences in syntax.

These differences are very striking in informal speach with a slightly
different way of ending verbs and addressing informally to the people,
but Marcelo and Ricardo could tell you more about it. I would also add
that Argentinian has a nice pronunciacion that makes it sound like
Italian ;)

Another difference is in technical vocabulary and this is really a
Europe/America difference. Latin America is influenced in its vocabulary
by US words while Spain is influenced by French words (I should add that
French is also influenced by Spanish words in other lexical fields).

For example a latin american would say computadora (computer) and
ciencias de la computacion (computing science) while a european will say
ordenador (ordinateur) and informatica (informatique).

Since the divergence in es-AR is mostly in informal speech, I guess that
for South Americans, formal Argentinan with computer words inherited
from US English is more familiar than European Spanish, but it is just a
guess.

AFAIK, there are projects for Chilian and Mexican versions of Firefox.

Pascal

Marek Stępień

unread,
Sep 22, 2006, 6:31:37 PM9/22/06
to
pascal napisał:

> AFAIK, there are projects for Chilian and Mexican versions of Firefox.

http://www.firefox.cl/ links to es-AR builds. These guys were trying to
register es-CL, but it hasn't happened yet. (bugs: 285765, 294959).

pascal

unread,
Sep 22, 2006, 6:43:31 PM9/22/06
to
Marek Stępień a écrit :

> pascal napisał:
>> AFAIK, there are projects for Chilian and Mexican versions of Firefox.
>
> http://www.firefox.cl/ links to es-AR builds. These guys were trying to
> register es-CL, but it hasn't happened yet. (bugs: 285765, 294959).
>

Yes, I meant informal project among latinamericans to make more es-XX
builds, not registered projects in bugzilla :)

Pascal

Robert Kaiser

unread,
Sep 22, 2006, 6:50:22 PM9/22/06
to
Axel Hecht schrieb:

> Sounds to me like anything we did should still be OK.

Nothing we do is illegal now. There are just ways to do things better,
if we want. We have not even allowed all of RFC 3066 tags though, so the
question is if we want to allow more now.

As a reference, http://wiki.mozilla.org/L10n:Simple_locale_names
documents what locale names/tags we currently allow.

If we want to change that, we also likely need some changes in code
(esp. tinderbox or some other server-side stuff, AFAIK) about the
assuptions/regex's we have for locale names.

Robert Kaiser

Axel Hecht

unread,
Sep 22, 2006, 6:59:37 PM9/22/06
to

https://bugzilla.mozilla.org/show_bug.cgi?id=253423 mentions "I wanna do
it because I do it since Phoenix 0.4".

So, only Marcelo would know.

Axel

Axel Hecht

unread,
Sep 22, 2006, 7:01:06 PM9/22/06
to

Which raises the question, why did we kill region packs? That story
predates my involvement in l10n, does anybody else know the pros and cons?

Axel

Robert Kaiser

unread,
Sep 23, 2006, 9:05:41 AM9/23/06
to
Axel Hecht schrieb:

> Which raises the question, why did we kill region packs?

To make the story short: Because noone shipped any region packs
independently of language packs, and no language shipped more than one
region packs.
To make the story even shorter: Because they weren't actively used.

Robert Kaiser

Benjamin Smedberg

unread,
Sep 24, 2006, 6:36:00 AM9/24/06
to
Axel Hecht wrote:

> Which raises the question, why did we kill region packs? That story
> predates my involvement in l10n, does anybody else know the pros and cons?

It was difficult or impossible to separate out what actually belonged in the
region pack or in the language pack. The use case was frequently described
as shipping Canadian/English/American region packs for a single English
language pack, but since spelling are different, it wasn't a practical
arrangement.

--BDS

0 new messages