Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

supporting obscure languages

0 views
Skip to first unread message

Albert Cahalan

unread,
Nov 26, 2009, 1:28:50 PM11/26/09
to bug-gnu...@gnu.org
First, you may assume that the locale is UTF-8. We only care about
the messages and getting stuff like iswprint or towupper to work in
the default (no Turkish i, etc.) Unicode way.

Given a fairly normal program, how can the user force the use of a
specific known messages file? Consider /tmp/testfile.mo in a locale
that isn't otherwise defined in any way.

Now suppose that the file is stored in the expected place. The user
wants to use /usr/share/locale/zam/LC_MESSAGES/someprog.mo with a
program that claims to be someprog. Again, the locale isn't supported
in any other way; there is merely a *.mo file installed. Without
giving the full path, and with minimal complexity, how can the user
get this file to be used?

How can a program offer a non-environment way to override the source
of messages? The obvious setlocale(LC_ALL,"zam") does not work, nor
does the troublesome (because other locales need more) substitution
of setlocale(LC_MESSAGES,"zam").

BTW, please consider it a bug that that doesn't just work.


Bruno Haible

unread,
Nov 27, 2009, 7:42:38 AM11/27/09
to Albert Cahalan, bug-gnu...@gnu.org
Hi Albert,

You did not state what you are trying to do. I understand it like this:
"How do I add support for a specific, rarely used language to my system
in such a way that I can localize programs for this language?"

The answer is:

1) You need to define a locale identifier for it. This is important,
because the users and all translators must agree on it - if a
translator uses a different identifier than the user, her
translations will not be found. The standardized identifiers
are those in ISO 639-1 and ISO 639-2, and also found in glibc's
glibc/locale/iso-639.def.

If your language is a distinct one, you should find the language
identifier in this list. If your language is a dialect of another
language, you can use a variant tag. For example, if by "zam" you
mean the language "Zapotec, Miahuatlán", it is a dialect of
Zapotec, which has the identifier "zap". So you will likely
choose the language identifier "zap@miahuatlan" (all ASCII please).

2) You may need to define a glibc locale. This is necessary for a
distinct language and optional for a variant (need it only if you
want to override some localizations). You need it because things
like month name, time display rules and the like are not defined
by .po files but through a locale definition. To create a locale, use the
"localedef" command together with a locale definition file. There
are dozens of examples of these locale definition files in a
directory mentioned in the output of "localedef --help".

3) Then you can create .mo files from .po files for that language,
as described in the GNU gettext documentation.

Albert Cahalan wrote:
> Given a fairly normal program, how can the user force the use of a
> specific known messages file? Consider /tmp/testfile.mo in a locale
> that isn't otherwise defined in any way.

You can set the environment variable LOCALEDIR to, say, /tmp/locale,
and then store the file as /tmp/locale/$localeID/LC_MESSAGES/testfile.mo.

> Now suppose that the file is stored in the expected place. The user
> wants to use /usr/share/locale/zam/LC_MESSAGES/someprog.mo with a
> program that claims to be someprog. Again, the locale isn't supported
> in any other way; there is merely a *.mo file installed.

This is only supported in the dialect case. If you choose the
language identifier "zap@miahuatlan" and the user's locale identifier
is zap_MX.UTF-8@miahuatlan, you need only a zap_MX.UTF-8 locale; a
zap_MX.UTF-8@miahuatlan locale does not need to exist.

> How can a program offer a non-environment way to override the source
> of messages? The obvious setlocale(LC_ALL,"zam") does not work, nor
> does the troublesome (because other locales need more) substitution
> of setlocale(LC_MESSAGES,"zam").

There is setlocale, and there is bindtextdomain. But you should have a
locale first.

> BTW, please consider it a bug that that doesn't just work.

No, not a bug. This is the way locales are designed.

Bruno


John Cowan

unread,
Nov 27, 2009, 1:17:20 PM11/27/09
to Bruno Haible, Albert Cahalan, bug-gnu...@gnu.org
Bruno Haible scripsit:

> 1) You need to define a locale identifier for it. This is important,
> because the users and all translators must agree on it - if a
> translator uses a different identifier than the user, her
> translations will not be found. The standardized identifiers
> are those in ISO 639-1 and ISO 639-2, and also found in glibc's
> glibc/locale/iso-639.def.

Is there any reason why ISO 639-3 identifiers cannot be used for
appropriate languages? 639-3 is much more comprehensive than 639-2, and
the identifiers correspond (that is, since 'haw' is Hawaiian in 639-2,
it has the same meaning in 639-3).

> If your language is a distinct one, you should find the language
> identifier in this list. If your language is a dialect of another
> language, you can use a variant tag. For example, if by "zam" you

> mean the language "Zapotec, Miahuatl�n", it is a dialect of


> Zapotec, which has the identifier "zap". So you will likely
> choose the language identifier "zap@miahuatlan" (all ASCII please).

"Zapotec" is what 639-3 calls a macrolanguage: that is, it is a collection
of closely related languages that is for some purposes treated as a
single language. The Zapotec macrolanguage encompasses 58 languages.
I emphasize that these are distinct languages, not at all mutually
intelligible. Calling them "dialects of Zapotec" is exactly like calling
French, Spanish, and Italian "dialects of Latin": it reflects an old
unity that has long since been lost.

Furthermore, no one Zapotec language is either numerically or culturally
dominant: Isthmus Zapotec (zai), the largest, has perhaps 85,000 speakers
out of a total Zapotec-speaking population of 500,000. This makes it
quite different from better known macrolanguages such as Arabic (which
encompasses about 30 languages, with Standard Arabic culturally but
not numerically dominant) and Chinese (which encompasses 13 languages,
with Mandarin both culturally and numerically dominant).

In short, unless there is some technical barrier to using 639-3 code
elements, it is more appropriate to code this language as "zam" rather
than as "zap@miahuatlan".

--
The Imperials are decadent, 300 pound John Cowan <co...@ccil.org>
free-range chickens (except they have http://www.ccil.org/~cowan
teeth, arms instead of wings, and
dinosaurlike tails). --Elyse Grasso


Albert Cahalan

unread,
Nov 27, 2009, 6:09:12 PM11/27/09
to Bruno Haible, bug-gnu...@gnu.org
On Fri, Nov 27, 2009 at 7:42 AM, Bruno Haible <br...@clisp.org> wrote:

> You did not state what you are trying to do. I understand it like this:
> "How do I add support for a specific, rarely used language to my system
> in such a way that I can localize programs for this language?"

I'm interested in that, and I think it should be trivial, but I'm
actually dealing with this from the view of a software developer
with existing *.mo files. I'm working on Tux Paint.

(think of the children)

At this point I'm seriously considering ripping out the gettext
stuff because it is fighting me every step of the way. It looks
like less trouble to write my own; we already do this for audio
and fonts. I hope you wish for gettext to be easy to work with.

> 1) You need to define a locale identifier for it. This is important,
> because the users and all translators must agree on it - if a
> translator uses a different identifier than the user, her
> translations will not be found. The standardized identifiers
> are those in ISO 639-1 and ISO 639-2, and also found in glibc's
> glibc/locale/iso-639.def.

Done. It's some Zapotec thing that I know very little about.
I'm not the translator. The translator(s) decided, and I'm
certainly not about to argue.

Well, that's the language I'm currently using for testing.
I'm sure it's not the only thing failing. I have:

af.po de.po fi.po id.po nb.po shs.po tlh.po
ar.po el.po fo.po is.po nl.po sk.po tr.po
ast.po en_AU.po fr.po it.po nn.po sl.po twi.po
az.po en_CA.po ga.po ja.po nr.po son.po uk.po
be.po en_GB.po gd.po ka.po oc.po sq.po ve.po
bg.po en_ZA.po gl.po km.po oj.po sr.po vi.po
bo.po eo.po gos.po ko.po pl.po sv.po wa.po
br.po es.po gu.po ku.po pt.po sw.po wo.po
ca.po es_MX.po he.po lt.po pt_BR.po ta.po xh.po
cs.po et.po hi.po lv.po ro.po te.po zam.po
cy.po eu.po hr.po mk.po ru.po th.po zh_CN.po
da.po fa.po hu.po ms.po rw.po tl.po zh_TW.po

(ever see that many at once before?)

> 2) You may need to define a glibc locale. This is necessary for a
> distinct language and optional for a variant (need it only if you
> want to override some localizations). You need it because things
> like month name, time display rules and the like are not defined
> by .po files but through a locale definition.

Frankly, I don't give a shit. If somebody decides they care, they
can define these things. Tux Paint sure doesn't need any of that.
I don't need month name, time display rules, telephone formats...

All I care about: LC_MESSAGES for "zam", LC_CTYPE not lobotomized

I need two ways to make this happen. First, via the environment.
Second, via function calls so that I can have the --locale=zam
and --lang=zapotec options work.

> To create a locale, use the
> "localedef" command together with a locale definition file. There
> are dozens of examples of these locale definition files in a
> directory mentioned in the output of "localedef --help".

That would be complicated sysadmin work. These machines probably
run es_MX.UTF-8 most of the time, or maybe C. Nobody wants to wait
for Fedora or Debian to take their sweet time adding "zam".
This also, somehow, needs to work for Windows and MacOS X. It will
be a cold day in Hell before Microsoft or Apple supports Zapotec.

> 3) Then you can create .mo files from .po files for that language,
> as described in the GNU gettext documentation.

Done. Tux Paint includes 84 translations.

BTW, we'd like fallback to similar translations in case something
is missing. When zh_TW.mo lacks something, zh_CN.mo should be the
next place to look.

>> How can a program offer a non-environment way to override the source
>> of messages? The obvious setlocale(LC_ALL,"zam") does not work, nor
>> does the troublesome (because other locales need more) substitution
>> of setlocale(LC_MESSAGES,"zam").
>
> There is setlocale, and there is bindtextdomain. But you should have a
> locale first.

It doesn't work. I end up with glibc's broken "C" locale.
Tux Paint's code does this now:

setlocale(LC_ALL, loc); // loc="" or loc="zam"
ctype_utf8(); // setlocale(LC_CTYPE,x) for many x until iswprint works
bindtextdomain("tuxpaint", LOCALEDIR);
bind_textdomain_codeset("tuxpaint", "UTF-8");
textdomain("tuxpaint");

The i18n source is here:
http://tuxpaint.cvs.sf.net/viewvc/tuxpaint/tuxpaint/src/i18n.c?revision=1.72

The interesting stuff starts in the set_current_locale(char *locale)
function, with the requested locale being "" or from the command line.

>> BTW, please consider it a bug that that doesn't just work.
>
> No, not a bug. This is the way locales are designed.

That makes it a design bug.

My current hack: LANGUAGE=zam LC_ALL=fr_FR.UTF-8

Yep, I'm telling gettext that this is French. That's disgusting.

There are quite a few design bugs here, none of which would cause
huge problems all by itself. Together, they are a disaster.

a. The implementation-specific "" locale is "C". (it need not be)
b. The "C" locale is not UTF-8. (this need not be the case)
c. The "C" locale makes iswprint((wchar_t)0xf7) be false. (very bad)
d. The "C" locale ignores LC_MESSAGES, even if not "C".
e. The locale reverts to "C" if some portion is missing/unknown.

The result is that none of these work:

a. setlocale(LC_ALL,"zam");
b. setlocale(LC_MESSAGES,"zam");
c. setlocale(LC_MESSAGES,"zam"); setlocale(LC_CTYPE,"UTF-8");

All should do the job, using any info that is available and picking
generic modern choices for the rest.

There just doesn't seem to be any reasonable way to kick gettext into
UTF-8 mode and feed it a *.mo file. This should be more than easy; it
should be what you tend to end up with when things aren't consistant.

I could see "C" being Latin-1 by default instead of UTF-8 (though wide
character functions should still support full Unicode), and I could see
having message lookup disabled if **nothing** non-C is enabled. Once I
call bind_textdomain_codeset("tuxpaint","UTF-8") or setlocale(foo,"zam")
though, it should be obvious what gettext needs to do.


Bruno Haible

unread,
Nov 28, 2009, 4:28:57 AM11/28/09
to John Cowan, Albert Cahalan, bug-gnu...@gnu.org
John Cowan wrote:
> Is there any reason why ISO 639-3 identifiers cannot be used for
> appropriate languages? 639-3 is much more comprehensive than 639-2, and
> the identifiers correspond (that is, since 'haw' is Hawaiian in 639-2,
> it has the same meaning in 639-3).

ISO 639-2 has stronger rules about stability: Codes in ISO 639-2/B will not
be changed, and the other ISO 639-2 can be changed, but an old code will not
be reused for another language for 5 years. Whereas in ISO 639-3 there are
large changes every year [1].

> In short, unless there is some technical barrier to using 639-3 code
> elements, it is more appropriate to code this language as "zam" rather
> than as "zap@miahuatlan".

From the linguistic point of view, you may be right.

From the point of stability of the code, if you choose "zam", you have to
consider the possibility that the code be changed at some point in the
future. This is not impossible to do - we had a change from 'no@nynorsk' to
'nn' a couple of years ago - but it causes trouble to the users of that
language for some years.

Bruno

[1] http://www.sil.org/iso639-3/changes.asp


Bruno Haible

unread,
Nov 28, 2009, 5:34:55 AM11/28/09
to Albert Cahalan, bug-gnu...@gnu.org
Albert Cahalan wrote:
> Well, that's the language I'm currently using for testing.
> I'm sure it's not the only thing failing. I have:
>
> af.po de.po fi.po id.po nb.po shs.po tlh.po
> ar.po el.po fo.po is.po nl.po sk.po tr.po
> ast.po en_AU.po fr.po it.po nn.po sl.po twi.po
> az.po en_CA.po ga.po ja.po nr.po son.po uk.po
> be.po en_GB.po gd.po ka.po oc.po sq.po ve.po
> bg.po en_ZA.po gl.po km.po oj.po sr.po vi.po
> bo.po eo.po gos.po ko.po pl.po sv.po wa.po
> br.po es.po gu.po ku.po pt.po sw.po wo.po
> ca.po es_MX.po he.po lt.po pt_BR.po ta.po xh.po
> cs.po et.po hi.po lv.po ro.po te.po zam.po
> cy.po eu.po hr.po mk.po ru.po th.po zh_CN.po
> da.po fa.po hu.po ms.po rw.po tl.po zh_TW.po

84 languages! That is impressing. The largest number of translations
of a package in the Translation Project is currently 58 languages.

> > 2) You may need to define a glibc locale. This is necessary for a
> > distinct language and optional for a variant (need it only if you
> > want to override some localizations). You need it because things
> > like month name, time display rules and the like are not defined
> > by .po files but through a locale definition.
>

> ... Tux Paint sure doesn't need any of that.


> I don't need month name, time display rules, telephone formats...
>
> All I care about: LC_MESSAGES for "zam", LC_CTYPE not lobotomized

Then your workaround of doing
LANGUAGE=zam LC_ALL=fr_FR.UTF-8
is just fine.

> I need two ways to make this happen. First, via the environment.
> Second, via function calls so that I can have the --locale=zam
> and --lang=zapotec options work.

For the first way, you can refer the user to the GNU gettext documentation
http://www.gnu.org/software/gettext/manual/html_node/Users.html
or tell them to set LANGUAGE, if you prefer that.

For the second way, you can call
setenv ("LC_ALL", "fr_FR.UTF-8", 1);
setenv ("LANGUAGE", "zam", 1);

> BTW, we'd like fallback to similar translations in case something
> is missing. When zh_TW.mo lacks something, zh_CN.mo should be the
> next place to look.

That's a built-in feature in GNU gettext: just set the LANGUAGE variable to
zh_TW:zh_CN
and you're done.

> I end up with glibc's broken "C" locale.
> Tux Paint's code does this now:
>
> setlocale(LC_ALL, loc); // loc="" or loc="zam"
> ctype_utf8(); // setlocale(LC_CTYPE,x) for many x until iswprint works

Yes, you have no guarantee that a particular locale is installed on the user's
system. You have to try some. setlocale(LC_ALL, "") is a good first guess.

> bindtextdomain("tuxpaint", LOCALEDIR);
> bind_textdomain_codeset("tuxpaint", "UTF-8");
> textdomain("tuxpaint");

Right.

> The i18n source is here:
> http://tuxpaint.cvs.sf.net/viewvc/tuxpaint/tuxpaint/src/i18n.c?revision=1.72
>
> The interesting stuff starts in the set_current_locale(char *locale)
> function, with the requested locale being "" or from the command line.

Looks reasonable.

> My current hack: LANGUAGE=zam LC_ALL=fr_FR.UTF-8
>
> Yep, I'm telling gettext that this is French. That's disgusting.

No, you are telling the system to use an UTF-8 encoding for strings,
French rules for time, sorting, numbers etc, and Zapotec for messages.
If it fits well with your program, all fine.

> There are quite a few design bugs here, none of which would cause
> huge problems all by itself. Together, they are a disaster.
>
> a. The implementation-specific "" locale is "C". (it need not be)

No, when you call setlocale(LC_ALL,"") it uses the locale that the
user has set, not "C".

> b. The "C" locale is not UTF-8. (this need not be the case)

The "C" locale was defined at a time when there was no UTF-8. This
choice accommodates for output devices that cannot display arbitrary
Unicode characters (think of ssh into an older Unix system).

> c. The "C" locale makes iswprint((wchar_t)0xf7) be false. (very bad)

I agree with you that wide characters are a mess in ISO C, because the
meaning of (wchar_t)0xf7 depends on locales: in some locale it may be
a DIVISION SIGN, in another one a CYRILLIC SMALL LETTER YI, in another
one a LATIN SMALL LETTER S WITH ACUTE, and in another one it's invalid.

> d. The "C" locale ignores LC_MESSAGES, even if not "C".

What do you expect the system to do when you set LC_ALL to "C" and
then LC_MESSAGES to "zh_CN"? All characters are US-ASCII but messages
should be in Chinese? In earlier versions of glibc, the Chinese strings
were converted to "?????? ??? ?????? 32 ?????" before being displayed.
This was not really helpful; so now the translations are ignored
entirely in this case.

> e. The locale reverts to "C" if some portion is missing/unknown.

What's wrong with having a fallback if some portion is missing?

> The result is that none of these work:
>
> a. setlocale(LC_ALL,"zam");
> b. setlocale(LC_MESSAGES,"zam");
> c. setlocale(LC_MESSAGES,"zam"); setlocale(LC_CTYPE,"UTF-8");

None of these work because you don't have a "zam" locale installed in
the first place. setlocale is about designating locales to use.

> There just doesn't seem to be any reasonable way to kick gettext into
> UTF-8 mode and feed it a *.mo file.

You found the way and showed it to us.

Bruno


Eric Blake

unread,
Nov 28, 2009, 8:14:09 AM11/28/09
to Bruno Haible, Albert Cahalan, bug-gnu...@gnu.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Bruno Haible on 11/28/2009 3:34 AM:


>> b. The "C" locale is not UTF-8. (this need not be the case)
>
> The "C" locale was defined at a time when there was no UTF-8. This
> choice accommodates for output devices that cannot display arbitrary
> Unicode characters (think of ssh into an older Unix system).

But POSIX explicitly allows the "C" locale to use UTF-8, and in fact, that
is the case on cygwin 1.7. Per POSIX, the "C" locale is only portable for
character operations for bytes < 128.

- --
Don't work too hard, make some time for fun as well!

Eric Blake eb...@byu.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksRIiEACgkQ84KuGfSFAYAyPQCdHbfP/UQ57rkWeqVRLrwpN/PN
xEYAn1pkhQtHYFYA0yUGRgvha9EDhOca
=sAbk
-----END PGP SIGNATURE-----


Bruno Haible

unread,
Nov 28, 2009, 8:47:21 AM11/28/09
to Eric Blake, Albert Cahalan, bug-gnu...@gnu.org
Eric Blake wrote:
> >> b. The "C" locale is not UTF-8. (this need not be the case)
> >
> > The "C" locale was defined at a time when there was no UTF-8. This
> > choice accommodates for output devices that cannot display arbitrary
> > Unicode characters (think of ssh into an older Unix system).
>
> But POSIX explicitly allows the "C" locale to use UTF-8, and in fact, that
> is the case on cygwin 1.7.

True. It's an implementation choice whether the "C" locale is in US-ASCII or
UTF-8. In glibc, you would only have to change 1 file: glibc/locale/C-ctype.c.
Or you can create a C.UTF-8 locale for yourself, using 'localedef'.

But it would not help Albert's problem: When the LC_MESSAGES locale is "C",
translations are disabled, regardless of the LANGUAGE environment variable.
This is required for POSIX compatibility of tools (such as "cp" or "tar")
which use gettext() for their internationalization.

Bruno


Bruno Haible

unread,
Nov 28, 2009, 10:49:03 AM11/28/09
to Albert Cahalan, bug-gnu...@gnu.org
Hello Albert,

> >> All I care about: LC_MESSAGES for "zam", LC_CTYPE not lobotomized
> >
> > Then your workaround of doing
> > LANGUAGE=zam LC_ALL=fr_FR.UTF-8
> > is just fine.
>

> Don't you think that is terribly gross? (French with
> different words!)

It is similar to LC_MESSAGES=zam_MX.UTF-8 LANG=fr_FR.UTF-8, which would
be a perfectly reasonable choice for a user with French preferences but
Zapotec language. POSIX allows users to combines different aspects of
locales in this way.

> Don't you think it's doubly gross to have a program
> calling setenv() to control a library via environment
> variables intended for users instead of a proper API?
> ... setenv as an API
> is really disturbing. I greatly prefer to treat the environment
> as read-only.

It is gross, but it is consequence of your desire to use a language
for which the locale is not existent or not installed, and therefore
to do in your program what normally the users do in their system. This is
not typical. The normal case is that users set their preferences in a
central location and these preferences get transmitted to the programs via
environment variables.

> The library doesn't even get immediate notice that there
> has been a change unless you have evil hooks into the
> setenv and getenv functions.

You don't have such hooks in the setlocale function either. Sadly.

> I'm depending on some random unrelated locale
> just to get normal UTF-8 behavior.

Yes, this is worrying. But nowadays, on most desktop systems, at least
one user locale is installed, it uses UTF-8 encoding, and you can
enquire it through setlocale(LC_ALL,"").

The systems with only the "C" locale are small-memory devices like
routers.

> > No, when you call setlocale(LC_ALL,"") it uses the locale that the
> > user has set, not "C".
>

> I mean when the user has done nothing either. The "" doesn't
> get filled in by some environment variable. You make it all the
> way to the lowest-priority environment variable ("LANG") and
> still have "". At that point, the implementation-specific locale
> is chosen... and it is "C".

If you are in this case, you are either on a misconfigured desktop
system, or on a small-memory system on which your program is likely
not meant to run.

> Basically: use what is there, and assume something close
> to "C.UTF-8" for anything missing/broken. Maybe you could
> find choices that are more generic than "C", like 24-hour time
> and PA4 paper size. Maybe round-trip the case for U+1E9E,
> avoiding expansion troubles. You could call it "default.UTF-8".
>
> The details aren't terribly critical; the main thing is to let a
> random loose UTF-8 *.mo file work without hacks or fuss,
> along with the wchar_t functions working beyond ASCII.

Internationalization of a program consists of three parts:
1) Make use of the Unicode character set.
2) Provide translations for messages.
3) Do the following in a locale dependent way: display of time,
display of currency, computations with calendar, display of
Hanzi ideographs (Chinese vs. Japanese - same Unicode code
point, different glyphs), form for entering a postal address,
arrangement of GUI components (right-to-left), etc.

With a "C" locale in UTF-8 encoding, you would get part 1). You would
not get part 2), because gettext() must not use the translation message
catalogs in the "C" locale. You would also not get part 3), because
strftime etc. also must not use localized values in the "C" locale.
That's because in POSIX, the "C" locale is the locale to be set when you
want to know ahead of time the output format of "ls", "df", "date" etc.

Conclusion: In general, a program cannot be internationalized if it
relies on the "C" locale.

Therefore only few program would profit from a "C" locale in UTF-8
encoding.

But I agree with you that it would be useful if more Linux distributors
would install an en_US.UTF-8 locale always.

Bruno


Albert Cahalan

unread,
Nov 28, 2009, 8:02:40 AM11/28/09
to Bruno Haible, bug-gnu...@gnu.org
On Sat, Nov 28, 2009 at 5:34 AM, Bruno Haible <br...@clisp.org> wrote:
> Albert Cahalan wrote:

>> I don't need month name, time display rules, telephone formats...
>>

>> All I care about: LC_MESSAGES for "zam", LC_CTYPE not lobotomized
>
> Then your workaround of doing
> LANGUAGE=zam LC_ALL=fr_FR.UTF-8
> is just fine.

Don't you think that is terribly gross? (French with
different words!)

Don't you think it's doubly gross to have a program


calling setenv() to control a library via environment
variables intended for users instead of a proper API?

>> BTW, we'd like fallback to similar translations in case something


>> is missing. When zh_TW.mo lacks something, zh_CN.mo should be the
>> next place to look.
>
> That's a built-in feature in GNU gettext: just set the LANGUAGE variable to
> zh_TW:zh_CN
> and you're done.

I guess we'll probably do that. Still, setenv as an API


is really disturbing. I greatly prefer to treat the environment
as read-only.

The library doesn't even get immediate notice that there


has been a change unless you have evil hooks into the

setenv and getenv functions. You'd have to either do a
slow getenv each time, or cache the value and hope the
program doesn't try to change things later.

>> setlocale(LC_ALL, loc); // loc="" or loc="zam"
>> ctype_utf8(); // setlocale(LC_CTYPE,x) for many x until iswprint works
>
> Yes, you have no guarantee that a particular locale is installed on the user's
> system. You have to try some. setlocale(LC_ALL, "") is a good first guess.

That guess is just "C" on my system.

>> My current hack: LANGUAGE=zam LC_ALL=fr_FR.UTF-8
>>
>> Yep, I'm telling gettext that this is French. That's disgusting.
>
> No, you are telling the system to use an UTF-8 encoding for strings,
> French rules for time, sorting, numbers etc, and Zapotec for messages.
> If it fits well with your program, all fine.

Eh, the Zapotec dialect of French. It does work, as long as
the user happens to have fr_FR.UTF-8 installed.

That's trouble. I'm depending on some random unrelated locale


just to get normal UTF-8 behavior.

>> There are quite a few design bugs here, none of which would cause


>> huge problems all by itself. Together, they are a disaster.
>>
>> a. The implementation-specific "" locale is "C". (it need not be)
>

> No, when you call setlocale(LC_ALL,"") it uses the locale that the
> user has set, not "C".

I mean when the user has done nothing either. The "" doesn't
get filled in by some environment variable. You make it all the
way to the lowest-priority environment variable ("LANG") and
still have "". At that point, the implementation-specific locale
is chosen... and it is "C".

>> b. The "C" locale is not UTF-8. (this need not be the case)


>
> The "C" locale was defined at a time when there was no UTF-8. This
> choice accommodates for output devices that cannot display arbitrary
> Unicode characters (think of ssh into an older Unix system).

I can sort of understand this. I own a real VT510 terminal.

It's not a working protection though. Linux distributions often
set a UTF-8 locale, then fail to translate or otherwise protect
logins on the serial tty devices. This happens to be why procps
replaces UTF-8 characters containing the 0x9b byte. (but of
course that is potentially hostile data, not translations, and
Red Hat patches out the protection anyway)

Having "C" not be i18n-friendly (serving up UTF-8 messages
and full Unicode on wchar_t) wouldn't be a big deal except
for the fact that the locale so easily ends up being "C".
(when unspecified, when a locale is broken/unknown, etc.)

>> c. The "C" locale makes iswprint((wchar_t)0xf7) be false. (very bad)
>
> I agree with you that wide characters are a mess in ISO C, because the
> meaning of (wchar_t)0xf7 depends on locales: in some locale it may be
> a DIVISION SIGN, in another one a CYRILLIC SMALL LETTER YI, in another
> one a LATIN SMALL LETTER S WITH ACUTE, and in another one it's invalid.

Locales with non-Unicode wchar_t are far worse than locales
with non-UTF-8 char. Lots of software breaks, and nobody will
fix it. There comes a time to deprecate dysfunctional locales.

>> d. The "C" locale ignores LC_MESSAGES, even if not "C".
>
> What do you expect the system to do when you set LC_ALL to "C" and
> then LC_MESSAGES to "zh_CN"? All characters are US-ASCII but messages
> should be in Chinese? In earlier versions of glibc, the Chinese strings
> were converted to "?????? ??? ?????? 32 ?????" before being displayed.
> This was not really helpful; so now the translations are ignored
> entirely in this case.

Just be binary-clean. Remember why UTF-8 was invented.
If glibc were binary clean, messages would normally just work.
They would certainly work for typical GUI stuff using Pango,
and would even work in many terminal situations.

>> e. The locale reverts to "C" if some portion is missing/unknown.
>
> What's wrong with having a fallback if some portion is missing?

Nothing. The problem is how this interacts with the other stuff.
If the fallback were something like "C.UTF-8" or the "C" locale
wasn't severely limited, there would be no problem.

It's only the combination of all these design issues that results in
a problem. Individually, no one design issue is really a problem.

>> The result is that none of these work:
>>
>> a. setlocale(LC_ALL,"zam");
>> b. setlocale(LC_MESSAGES,"zam");
>> c. setlocale(LC_MESSAGES,"zam"); setlocale(LC_CTYPE,"UTF-8");
>
> None of these work because you don't have a "zam" locale installed in
> the first place. setlocale is about designating locales to use.

I have a piece of a locale installed. (my "zam.mo" file)
To use that, I mainly just need a binary-clean library.
Getting iswprint() and towupper() would be nice too, but
it's not a huge problem for me to write my own.

Basically: use what is there, and assume something close
to "C.UTF-8" for anything missing/broken. Maybe you could
find choices that are more generic than "C", like 24-hour time
and PA4 paper size. Maybe round-trip the case for U+1E9E,
avoiding expansion troubles. You could call it "default.UTF-8".

The details aren't terribly critical; the main thing is to let a
random loose UTF-8 *.mo file work without hacks or fuss,
along with the wchar_t functions working beyond ASCII.

>> There just doesn't seem to be any reasonable way to kick gettext into


>> UTF-8 mode and feed it a *.mo file.
>
> You found the way and showed it to us.

Trying random unrelated locales and calling putenv() is
pretty far from reasonable IMHO.


Bruno Haible

unread,
Nov 28, 2009, 3:11:25 PM11/28/09
to Albert Cahalan, bug-gnu...@gnu.org
Albert Cahalan wrote:
> Maybe round-trip the case for U+1E9E, avoiding expansion troubles.

Unicode 5.0 has introduced the character U+1E9E "LATIN CAPITAL LETTER SHARP S",
but the habits in Germany have not changed. The upper-case variant of "Ruß"
is still "RUSS". German people don't care about whether this round-trips
or not. "ß" uppercases to "SS". It has been like this for centuries.

Therefore if you want your program to do case conversions right for German
(and Turkish, Greek, Lithuanian etc.), you need to perform case conversions
on entire strings, not merely on characters one by one. In C programs,
you can use GNU libunistring [1] for this purpose. It has all the special cases
built-in.

Bruno

[1] http://www.gnu.org/software/libunistring/


Albert Cahalan

unread,
Nov 28, 2009, 2:53:26 PM11/28/09
to Bruno Haible, bug-gnu...@gnu.org
On Sat, Nov 28, 2009 at 10:49 AM, Bruno Haible <br...@clisp.org> wrote:

> It is similar to LC_MESSAGES=zam_MX.UTF-8 LANG=fr_FR.UTF-8, which would
> be a perfectly reasonable choice for a user with French preferences but
> Zapotec language. POSIX allows users to combines different aspects of
> locales in this way.

POSIX does, but the library does not. If the library followed POSIX
then I could combine LC_MESSAGES=zam with LANG=C.

In other words, this looks like a POSIX violation to me.

> It is gross, but it is consequence of your desire to use a language
> for which the locale is not existent or not installed, and therefore
> to do in your program what normally the users do in their system. This is
> not typical. The normal case is that users set their preferences in a
> central location and these preferences get transmitted to the programs via
> environment variables.

The only part I need is installed: zam.mo

Since I never try to format time, the library shouldn't even try
to load the data for that. The missing stuff shouldn't affect
anything since I'm not attempting to use it. Supposing I did
try to format time though, that could do some typical thing.

Basically this isn't fail-safe. Some chunk of locale data goes
missing, and suddenly the whole thing dies.

>> I'm depending on some random unrelated locale
>> just to get normal UTF-8 behavior.
>

> Yes, this is worrying. But nowadays, on most desktop systems, at least
> one user locale is installed, it uses UTF-8 encoding, and you can
> enquire it through setlocale(LC_ALL,"").
>
> The systems with only the "C" locale are small-memory devices like
> routers.

That was my system until I started debugging this problem,
and in fact an apt-get hook wipes out locales every time I
install packages.

This is because en_US.UTF-8 has defective collation order,
and because I don't normally need translations. If I were to
set either LANGUAGE or LC_MESSAGES alone though,
that ought to get me translations despite anything else.

> Internationalization of a program consists of three parts:
> 1) Make use of the Unicode character set.
> 2) Provide translations for messages.
> 3) Do the following in a locale dependent way: display of time,
> display of currency, computations with calendar, display of
> Hanzi ideographs (Chinese vs. Japanese - same Unicode code
> point, different glyphs), form for entering a postal address,
> arrangement of GUI components (right-to-left), etc.

Well no, not unless the program needs it. OTOH, Tux Paint
localizes things you don't even handle: audio clips, fonts,
font size, font vertical position, and right-to-left text rendering.

In any case, part of a locale is better than none. Right now
you're essentially saying that incomplete localization isn't
allowed; it's all or nothing.

> With a "C" locale in UTF-8 encoding, you would get part 1). You would
> not get part 2), because gettext() must not use the translation message
> catalogs in the "C" locale. You would also not get part 3), because
> strftime etc. also must not use localized values in the "C" locale.
> That's because in POSIX, the "C" locale is the locale to be set when you
> want to know ahead of time the output format of "ls", "df", "date" etc.

Ah, but I asked for a different locale.

LANGUAGE: not set to "C"
LC_ALL: not set to "C"
LC_MESSAGES: not set to "C"
LANG: not set to "C"
setlocale's 2nd parameter: not set to "C"

That right there means I didn't want the "C" locale. Additionally,
at least one of those things is not blank/empty/missing, so you
certainly know which locale I want. I expect best-effort.
I even called bind_textdomain_codeset, so UTF-8 is explicit.

Had I set nothing, I still wouldn't be asking for "C". You could
give me a "generic.UTF-8" or "NULL.UTF-8" locale that works.

BTW, even the strings being passed to gettext() are UTF-8.
I have things like the elipsis, so it's still UTF-8 even when the
translation is dumped on the floor.

> But I agree with you that it would be useful if more Linux distributors
> would install an en_US.UTF-8 locale always.

Debian seems to have chosen to add C.UTF-8. From my reading of
the code, it looks like that will fail. They'll patch it I'm sure.


Bruno Haible

unread,
Nov 28, 2009, 5:51:18 PM11/28/09
to Albert Cahalan, bug-gnu...@gnu.org
Albert Cahalan wrote:
> Sooner or later, a de...@SHARPS.UTF-8 locale will be demanded.

Yes, certainly. Maybe in 5 years, or 10 years, or in 20 years. But currently,
hardly any font contains the U+1E9E "LATIN CAPITAL LETTER SHARP S" character.
Therefore currently, we should stay with the traditional rule of "ß" -> "SS".

> In any case, you won't be getting "SS" out of towupper.

Yes. It is for this reason that
1. towupper('ß') == 'ß',
2. a simple loop that calls towupper is *not* the right way to uppercase an
arbitrary string.

And lowercasing does not work with a simple loop over towlower either, because
of GREEK CAPITAL LETTER SIGMA that needs special treatment.

> I hope libunistring doesn't impede the evolution of languages.

libunistring is free software: it can be changed to fit particular needs.

Bruno


Albert Cahalan

unread,
Nov 28, 2009, 4:36:39 PM11/28/09
to Bruno Haible, bug-gnu...@gnu.org
On Sat, Nov 28, 2009 at 3:11 PM, Bruno Haible <br...@clisp.org> wrote:
> Albert Cahalan wrote:

>> Maybe round-trip the case for U+1E9E, avoiding expansion troubles.
>
> Unicode 5.0 has introduced the character U+1E9E "LATIN CAPITAL LETTER SHARP S",
> but the habits in Germany have not changed. The upper-case variant of "Ruß"
> is still "RUSS". German people don't care about whether this round-trips
> or not. "ß" uppercases to "SS". It has been like this for centuries.

Germans with "ß" in their last name are people too, and they care.
U+1E9E exists solely because there is real evidence that people care.
It is pretty common to uppercase "ß" as itself; clearly people care.

Sooner or later, a de...@SHARPS.UTF-8 locale will be demanded.

German rules have changed a number of times in the 1900s, and
they certainly can change again.

In any case, you won't be getting "SS" out of towupper.

> Therefore if you want your program to do case conversions right for German


> (and Turkish, Greek, Lithuanian etc.), you need to perform case conversions
> on entire strings, not merely on characters one by one. In C programs,
> you can use GNU libunistring [1] for this purpose. It has all the special cases
> built-in.

Yes, of course, but that doesn't work for towupper.

John Cowan

unread,
Nov 28, 2009, 11:44:07 PM11/28/09
to Albert Cahalan, bug-gnu...@gnu.org, Bruno Haible
Albert Cahalan scripsit:

> > Unicode 5.0 has introduced the character U+1E9E "LATIN CAPITAL LETTER SHARP S",

> > but the habits in Germany have not changed. The upper-case variant of "Ru�"


> > is still "RUSS". German people don't care about whether this round-trips

> > or not. "�" uppercases to "SS". It has been like this for centuries.
>
> Germans with "�" in their last name are people too, and they care.

I don't see why those people would expect to see their names in upper case.

> U+1E9E exists solely because there is real evidence that people care.

True, but the people in question are mostly the designers of
advertisements who want to put headlines in all caps.

As a certified alter kocker, I personally think that German looks horrible
in all caps, but that battle was lost a generation ago.

> It is pretty common to uppercase "�" as itself; clearly people care.

Especially if they have blindly applied simple Unicode uppercasing rather
than proper German uppercasing.

Turkic, though, is a more complex problem. Knowing how to properly recase
German or Greek just requires applying the algorithm: you don't have to
know that the text actually is German or Greek. To correctly change the
case of a Turkic language, you have to know for sure that you are dealing
with Turkic text. I can well understand if people fail to get that right,

--
John Cowan co...@ccil.org http://www.ccil.org/~cowan
Is it not written, "That which is written, is written"?


0 new messages