LOCALIZATION NOTE, DONT_TRANSLATE and more

Axel Hecht

unread,

Mar 23, 2006, 5:28:42 AM3/23/06

to

Dwayne opened a bug [1] with the following initial comment:
<quote>
The current method of using DONT_TRANSLATE is very good at conveying
information to translators and it is standard enough to allow automated
tools to use the information.

I would like to propose that we do a similar thing for configuration
information in DTD and .properties files.

Eg: in a DTD

<!ENTITY test.label "true">

You could even have some sort of regex

<!--CONFIG (dialog.width) valid="[0-9]+em" comment="The width of the
save dialog"

Comments can follow the same format in .properties files.

With this information a tool could choose not to convert/display items
with CONFIG markers. Or it could display them and provide a usefull
comment and be able to check values provided by the translator.

The advantage to Mozilla l10n vouchers is that they could simply run a
tool across the files and check to ensure that these CONFIG items are
compliant.
</quote>

I resolved that bug, as I don't see tree-wide coding style changes as a
matter of a single bug. While looking at it I found (quoting myself)

<quote>
The first step is to really formalize the localization notes.

Being compatible here would mean something like

I'm not sure if there is a historic difference between DONT_TRANSLATE
and "Do not translate ..."
</quote>

http://developer.mozilla.org/en/docs/XUL_Coding_Style_Guidelines doesn't
give any guidelines on how localization notes should look, I would like
us to gather some information on the formats we have out there, what
they try to convey and then find a common denominator and possibly
enhance that to make it clearer what values should be used by localizers.

Axel

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=331438

Dwayne Bailey

unread,

Mar 24, 2006, 6:27:47 AM3/24/06

to dev-...@lists.mozilla.org

That would work, it stays close to the original with little new clutter.

> I'm not sure if there is a historic difference between DONT_TRANSLATE
> and "Do not translate ..."

We probably need to distinguish between 1) don't translate the item or
2) don't translate some text in the item:

eg:
1) "Mozilla Firefox"
2) "IMAP command LIST failed"

ie in 2) we want to specify that neither IMAP nor LIST should be
translated. While in 1) we don't want anything translated.

I would suggest that DONT_TRANSLATE be used to exclude whole messaged

And that

be used for issue 2.

VARIABLES:

This could fit into the DONT_TRANSLATE items above but for variables it
would be useful to be able to provide comments specific to the variable

I propose

VAR="%s:The downloaded filename"

You should be able to define multiple VAR entries. I separate this from
the DONT_TRANSLATE as it allows tools to consider things such as
variable renaming

SPLIT MESSAGES:

Personally I think these should all be made to go away. Far away :)
But as programmers will forever insist on breaking strings and in fact
merging something might be too invasive for stable code, I suggest this
NOTE

MERGE="first.label:1:<textbox>"
MERGE="second.label:2"

So we have "entity label:order:joining marker"

So a translators tool could present the item as:

First string <textbox> second string

The joining marker is completely arbitrary but should convey something
useful and not tempt the translator to translate it. Of course tools
can also test that the item has not been translated.

The tools themselves are responsible for unpacking the merged entities
back into their respective segment.

> </quote>
>
> http://developer.mozilla.org/en/docs/XUL_Coding_Style_Guidelines doesn't
> give any guidelines on how localization notes should look, I would like
> us to gather some information on the formats we have out there, what
> they try to convey and then find a common denominator and possibly
> enhance that to make it clearer what values should be used by localizers.

In the Translate Toolkit we found that they were quite consistent.
There are a few anomalies though.

The bigger issue is to make them consistent between .properties
and .dtd's

Its easy to pull the items from a DTD as they are in very clear 
comments so you can easily tell the start and end. However,
in .properties as comments as simply ^# is is difficult to tell if
a wrapped line is a continuation of the localisation note or not. This
could be solved by either insiting that comments in the localisation not
be closed or by using a closing tag. Eg /NOTE I'd be in favour of the
second.

>
> Axel
>
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=331438
> _______________________________________________
> dev-l10n mailing list
> dev-...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-l10n
--
Dwayne Bailey
Translate.org.za

+27-12-460-1095 (w)
+27-83-443-7114 (cell)

Benjamin Smedberg

unread,

Mar 24, 2006, 7:51:44 AM3/24/06

to

Why would we have an entire item that shouldn't be translated in a
localizable file? We should just remove case 1) entirely.

--BDS

Axel Hecht

unread,

Mar 24, 2006, 9:12:23 AM3/24/06

to

Yes, that's a bit odd, but we have quite a few of that. Seems like that
is used for rebranding etc. All of these cases should be revisited and
we may want to put some of that into
- either the corresponding file
- a content dtd, ie chrome://mod/content/helper.dtd

One tricky aspect here is, there are cases in our trademarks policy
where we say that l10n folks can transcribe a name.
http://www.mozilla.org/foundation/trademarks/l10n-policy.html, point 8.

I guess the localization note should say so, if so.

Do we have any locale that does that?

We could break our format and use #expand for most locales instead. But
this should probably be limited to the branding directories.

Axel

Marek Stepien

unread,

Mar 24, 2006, 9:18:29 AM3/24/06

to

Axel Hecht napisał(a):

> One tricky aspect here is, there are cases in our trademarks policy
> where we say that l10n folks can transcribe a name.
> http://www.mozilla.org/foundation/trademarks/l10n-policy.html, point 8.
> I guess the localization note should say so, if so.
>
> Do we have any locale that does that?

Belarusian does that (Mozilla Firefox -> Мазіла Файэфокс)
http://lxr.mozilla.org/l10n-mozilla1.8/source/be/other-licenses/branding/firefox/brand.dtd?raw=1

Arabic and Gujarati also seem to do so, but as I am unable to read any
of these languages, I'm not sure if they transcribe it or translate it:

http://lxr.mozilla.org/l10n-mozilla1.8/source/ar/other-licenses/branding/firefox/brand.dtd?raw=1
http://lxr.mozilla.org/l10n-mozilla1.8/source/gu-IN/other-licenses/branding/firefox/brand.dtd?raw=1

Axel Hecht

unread,

Mar 24, 2006, 9:32:24 AM3/24/06

to

Let's not over-engineer this. I'd rather incrementally fix this.
Plus, localization notes are like all comments notes from humans to
humans, from my point of view, and thus, they need to make sense to
humans. If they end up machine readable, that's nice, but that's just
the icing on the comment cake.

From a coding style point of view, as DTD entities don't support
arguments, we can't work around split strings there, unless we generate
the UI from properties, which we sometimes do. But that makes both UI
design and localization harder, I'd say, as you can only find out where
your translated string ends up by reading both the xul and the js.

>> </quote>
>>
>> http://developer.mozilla.org/en/docs/XUL_Coding_Style_Guidelines doesn't
>> give any guidelines on how localization notes should look, I would like
>> us to gather some information on the formats we have out there, what
>> they try to convey and then find a common denominator and possibly
>> enhance that to make it clearer what values should be used by localizers.
>
> In the Translate Toolkit we found that they were quite consistent.
> There are a few anomalies though.
>
> The bigger issue is to make them consistent between .properties
> and .dtd's
>
> Its easy to pull the items from a DTD as they are in very clear 
> comments so you can easily tell the start and end. However,
> in .properties as comments as simply ^# is is difficult to tell if
> a wrapped line is a continuation of the localisation note or not. This
> could be solved by either insiting that comments in the localisation not
> be closed or by using a closing tag. Eg /NOTE I'd be in favour of the
> second.
>

I'd consider everything following the LOCALIZATION NOTE up to the
localizable entry to be part of the LOCALIZATION NOTE.

Axel

Axel Hecht

unread,

Mar 24, 2006, 12:10:17 PM3/24/06

to

Nice, thanks. I sure hope those are transcribed only.

Axel

Filip Miletic

unread,

Mar 24, 2006, 6:13:17 PM3/24/06

to

Axel Hecht wrote:
>>>> DONT_TRANSLATE and "Do not translate ..."

>>> eg:
>>> 1) "Mozilla Firefox"

>> localizable file? We should just remove case 1) entirely.

> Do we have any locale that does that?

The removal of "Mozilla Firefox" is not good for the locales where the
names must be transcribed. Point 8 of the Mozilla Trademark policy indeed.

f

Filip Miletic

unread,

Mar 24, 2006, 6:17:37 PM3/24/06

to

Axel Hecht wrote:
> Nice, thanks. I sure hope those are transcribed only.

Add Serbian (sr-CS) to that too. The Serbian locale is not yet in the
official builds, but we've completed the community translations of both
Firefox and Thunderbird.

Serbian style guides require that the name be translated completely,
thus not only transcribed. Thus Thunderbird becomes „Громоптица“ etc.

f

Axel Hecht

unread,

Mar 24, 2006, 9:26:43 PM3/24/06

to

Please get permission to do this from lice...@mozilla.org (CC me). The
trademarks policy we have only allows transcription, independent of what
a local style guide says. I'm not ruling it out, just want to note that
this won't go without discussion and approval.

Axel

Filip Miletic

unread,

Mar 25, 2006, 8:36:03 AM3/25/06

to

Axel Hecht wrote:
> a local style guide says. I'm not ruling it out, just want to note that
> this won't go without discussion and approval.

Thanks for noting. I understand your concern.

But first, could someone be kind to untie this knot at Bugzilla so we
can actually commit some translations:
https://bugzilla.mozilla.org/show_bug.cgi?id=283456

f

Axel Hecht

unread,

Mar 25, 2006, 6:43:11 PM3/25/06

to

Reading that bug, it looks like Sale is out of the loop for getting
owner, but he's glad to help, and there is you for Thunderbird and
Nikola for Firefox?

How about toolkit, are there differences there? We'd need to get those
resolved.

Axel

Filip Miletic

unread,

Mar 25, 2006, 6:59:15 PM3/25/06

to

Axel Hecht wrote:
> owner, but he's glad to help, and there is you for Thunderbird and
> Nikola for Firefox?

Nikola took on the task of translating Firefox and providing an XPI. I
was one of the several people who translated Thunderbird together.

> How about toolkit, are there differences there? We'd need to get those
> resolved.

Could you be so kind to explain? Sorry if I'm being dense, but I don't
know about mozilla internals to know what you mean here.

The translation approach I used so far has been simplistic. I took an
english XPI and used the translation toolkit (tt) to produce the PO
files. These were translated and the tt was used to put the things back
together. AFAIK quite a lot of l10n strings have been translated and the
bulk of the work is likely going to be applying these to the current ff
and tb, as well as providing the translations in both cyrillic and latin
scripts.

f

Simon Montagu

unread,

Mar 26, 2006, 3:07:33 AM3/26/06

to

These are both transcriptions.

Marek Stepien

unread,

Mar 26, 2006, 10:01:24 AM3/26/06

to

Filip Miletic napisał(a):

> Serbian style guides require that the name be translated completely,
> thus not only transcribed. Thus Thunderbird becomes „Громоптица“ etc.

Is this your own style guide or a general Serbian style guide from some
authorities? Microsoft does not translate the names "Windows" or
"Office": http://www.microsoft.com/scg/windowsxp/default.mspx

Axel Hecht

unread,

Mar 26, 2006, 4:41:06 PM3/26/06

to

Filip Miletic wrote:
> Axel Hecht wrote:
>> owner, but he's glad to help, and there is you for Thunderbird and
>> Nikola for Firefox?
> Nikola took on the task of translating Firefox and providing an XPI. I
> was one of the several people who translated Thunderbird together.
>
>> How about toolkit, are there differences there? We'd need to get those
>> resolved.
>
> Could you be so kind to explain? Sorry if I'm being dense, but I don't
> know about mozilla internals to know what you mean here.

Firefox and Thunderbird share parts of their localization, if you start
diggin up a source tree for you localization, you will start to realize.

Thus the localization teams for fx and tb need to agree on an owner for
toolkit and merge their work (if not already derived from each other).

Late night short answer, I'm traveling tomorrow.

> The translation approach I used so far has been simplistic. I took an
> english XPI and used the translation toolkit (tt) to produce the PO
> files. These were translated and the tt was used to put the things back
> together. AFAIK quite a lot of l10n strings have been translated and the
> bulk of the work is likely going to be applying these to the current ff
> and tb, as well as providing the translations in both cyrillic and latin
> scripts.

The two scripts look like a potential problem. I'm tempted to think that
we'd want one only. Not sure how the two relate, and if there is a
strong argument to maintain both. It may end up with being twice the
work in our scheme.

Axel

Filip Miletic

unread,

Mar 27, 2006, 4:11:27 AM3/27/06

to

Axel Hecht wrote:
> Thus the localization teams for fx and tb need to agree on an owner for
> toolkit and merge their work (if not already derived from each other).

The translations were not derived from each other, although I did use
parts of the FF translations to fill the TB in at some places. Right
now, FF uses more the 'computerese' jargon, while I tried to soften that
and use as many translated terms as possible.

However both approaches were somewhat 'uneducated' guesses, and I
suppose someone with a background in language should have a say too. We
should be able to come to an agreement there.

> The two scripts look like a potential problem. I'm tempted to think that
> we'd want one only. Not sure how the two relate, and if there is a
> strong argument to maintain both. It may end up with being twice the
> work

The latin script translation can be produced automatically from the
cyrillic one (the other way around does not work). I fail to see why in
that case there could not be a Serbian (Cyrillic) and a Serbian (Latin)
editions of FF and TB.

Serbian language uses two scripts and this is stipulated in the
constitution (art. 8 of the Constitution, Republic of Serbia
http://www.parlament.sr.gov.yu/content/cir/akta/ustav/ustav_1.asp). It
says, roughly that in Serbia "[...] the cyrillic script is in the
official use, and latin is used as stipulated by law".

While the Constitution seems to have a preference towards cyrillic,
there is no clear preference among the people, as both scripts are
taught at schools and used in everyday communication. I consider it is
only fair that both varieties are offered so that the end users can make
own choice, instead of the choice being imposed on them. Given that one
can obtain both translations while maintaining only one, the duality
comes at no extra cost from where I'm standing.

Two important issues need be mentioned here. One is that the official
institutions must use the cyrillic script. The other is that the duality
has been acknowledged by Microsoft, who has provided the Serbian
language pack for Windows XP in both scripts.

f

Dwayne Bailey

unread,

Mar 27, 2006, 12:59:27 AM3/27/06

to dev-...@lists.mozilla.org

On Fri, 2006-03-24 at 15:32 +0100, Axel Hecht wrote:

[snip]

> Let's not over-engineer this. I'd rather incrementally fix this.

Good point. Then perhaps we should define the format for the
DONT_TRANSLATE entries as a first step?

> Plus, localization notes are like all comments notes from humans to
> humans, from my point of view, and thus, they need to make sense to
> humans. If they end up machine readable, that's nice, but that's just
> the icing on the comment cake.

I would say that we must achieve both. If these comments aren't machine
readable then we can't automate any checks. And if we're scaling up the
number of locales we need some things to be machine readable.

> From a coding style point of view, as DTD entities don't support
> arguments, we can't work around split strings there, unless we generate
> the UI from properties, which we sometimes do. But that makes both UI
> design and localization harder, I'd say, as you can only find out where
> your translated string ends up by reading both the xul and the js.

I don't think we should make anyones job harder :) My merge marker might
be a bit unwieldy in format, perhaps others have better ideas? But
we've lived with this for a while so I'd place it low on the pecking
order.

My priority in terms of things to consider are:

* Clarity/standardise on DONT_LOCALIZE (Also seems transliteration)
* Configuration related items: VALUE=
* Variables
* Split items

> >> http://developer.mozilla.org/en/docs/XUL_Coding_Style_Guidelines doesn't
> >> give any guidelines on how localization notes should look, I would like
> >> us to gather some information on the formats we have out there, what
> >> they try to convey and then find a common denominator and possibly
> >> enhance that to make it clearer what values should be used by localizers.
> >
> > In the Translate Toolkit we found that they were quite consistent.
> > There are a few anomalies though.
> >
> > The bigger issue is to make them consistent between .properties
> > and .dtd's
> >
> > Its easy to pull the items from a DTD as they are in very clear 
> > comments so you can easily tell the start and end. However,
> > in .properties as comments as simply ^# is is difficult to tell if
> > a wrapped line is a continuation of the localisation note or not. This
> > could be solved by either insiting that comments in the localisation not
> > be closed or by using a closing tag. Eg /NOTE I'd be in favour of the
> > second.
> >
>
> I'd consider everything following the LOCALIZATION NOTE up to the
> localizable entry to be part of the LOCALIZATION NOTE.

So that is the first step in standardisation :)

Do we need to start a Wiki page to outline the items under consideration
and those that are standardised?

Axel Hecht

unread,

Mar 27, 2006, 7:00:31 PM3/27/06

to

Of course ;-)

Axel

Axel Hecht

unread,

Mar 29, 2006, 1:28:37 AM3/29/06

to

So, from my point of view, there is a significant cost per locale in
terms of build resources, QA resources, ftp-and-mirror resources,
release-schedule-approval time. In addition, our process doesn't provide
means to generate one localization from another, so that'd partially
double the load on your team, too.

On top of that, there are localized items like dialog sizes (and maybe
font preferences) that need to be adapted to the script.

I'd say that having the cyrrilic version sounds like the one that I
would suggest, as going from there to latin is easier than the other way
around, plus being a tad more official.
If you want to create a latin-scripted version based on the cyrrilic
one, we need to talk about that in much detail, and given our current
situation with build and QA being totally stressed out, I don't see a
short term solution there. That is, we don't have the resources to
develop a technical solution, let alone deploy and maintain it.

That said, I'm curious on how the mapping of cyrrilic to latin works
technically, so that I can give a better guestimate on the cost.

Axel

Filip Miletic

unread,

Mar 29, 2006, 6:06:09 PM3/29/06

to

Axel Hecht wrote:
> So, from my point of view, there is a significant cost per locale in
> terms of build resources, QA resources, ftp-and-mirror resources,

[...]

> I'd say that having the cyrrilic version sounds like the one that I
> would suggest, as going from there to latin is easier than the other way
> around, plus being a tad more official.

My interest is to provide the cyrillic translation, so I am in principle
willing to stop there.

However, leaving it at that would beyond doubt inspire criticism from
the users. Therefore, I want to at least negotiate the way to add the
latin translation to the cyrillic one in the future, in case it is not
possible to do so from the beginning.

I understand your concern about inserting the new locales and the burden
it incurs on the release schedule and logistics. But I also admit I kind
of thought that's the whole point of localization: trade the user
convenience off for the extra resources used. Could you be so kind to
explain what complications these two new locales would introduce? Or
better yet, if there's a document, point me to so I can read it?

> short term solution there. That is, we don't have the resources to
> develop a technical solution, let alone deploy and maintain it.

Would pushing this task towards the L10N team not lift some burden off
of you? Think of it as one team that provides two locales. I cannot see
a reason why your development process would not support it.

From where I stand, there is no (special) problem to deliver two locales
instead of one. In that case you should not care what process the L10N
team uses to make them.

> That said, I'm curious on how the mapping of cyrrilic to latin works
> technically, so that I can give a better guestimate on the cost.

Given a file containing cyrillic UTF-8 encoded text as input, a sed
script along the following lines is used to output an equivalent file in
latin script.

#! /bin/sed -f
s/а/a/g
s/б/b/g
# ...
# 60 lines in total. Each letter of the cyrillic
# alphabet maps to either a one-character or
# a two-character latin string.
# ...
s/Ш/Š/g

hth,
f

Axel Hecht

unread,

Mar 30, 2006, 3:20:12 PM3/30/06

to

Filip Miletic wrote:
> Axel Hecht wrote:
>> So, from my point of view, there is a significant cost per locale in
>> terms of build resources, QA resources, ftp-and-mirror resources,
> [...]
>> I'd say that having the cyrrilic version sounds like the one that I
>> would suggest, as going from there to latin is easier than the other way
>> around, plus being a tad more official.
>
> My interest is to provide the cyrillic translation, so I am in principle
> willing to stop there.
>
> However, leaving it at that would beyond doubt inspire criticism from
> the users. Therefore, I want to at least negotiate the way to add the
> latin translation to the cyrillic one in the future, in case it is not
> possible to do so from the beginning.
>
> I understand your concern about inserting the new locales and the burden
> it incurs on the release schedule and logistics. But I also admit I kind
> of thought that's the whole point of localization: trade the user
> convenience off for the extra resources used. Could you be so kind to
> explain what complications these two new locales would introduce? Or
> better yet, if there's a document, point me to so I can read it?

QA and trademarks approval are one, build resources are others,
administrational efforts during release etc. are things that need to be
done extra for each locale.

I will be investigating if we can in some time in the future support
derived localizations, but that is not really a short time goal. And, to
admit that, I'm not going to drive that on behalf of the latin script of
the serbian locale. We could use something like this for the japanese
mac locale, though, so that's why I'm going to look at something in the
direction. Without any timeline, though.

>> short term solution there. That is, we don't have the resources to
>> develop a technical solution, let alone deploy and maintain it.
>
> Would pushing this task towards the L10N team not lift some burden off
> of you? Think of it as one team that provides two locales. I cannot see
> a reason why your development process would not support it.
>
> From where I stand, there is no (special) problem to deliver two locales
> instead of one. In that case you should not care what process the L10N
> team uses to make them.
>
>> That said, I'm curious on how the mapping of cyrrilic to latin works
>> technically, so that I can give a better guestimate on the cost.
>
> Given a file containing cyrillic UTF-8 encoded text as input, a sed
> script along the following lines is used to output an equivalent file in
> latin script.
>
> #! /bin/sed -f
> s/а/a/g
> s/б/b/g
> # ...
> # 60 lines in total. Each letter of the cyrillic
> # alphabet maps to either a one-character or
> # a two-character latin string.
> # ...
> s/Ш/Š/g

This only holds for the utf-8 encoded files, so there is a limit to this
algorithm. That doesn't make it impossible, just more involved.

Axel