Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Bug#522776: Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

1 view

Skip to first unread message

Albert Cahalan

unread,

Nov 26, 2009, 10:30:02 PM11/26/09

Roger Leigh writes:
> On Tue, Apr 07, 2009 at 09:24:38PM +0200, Adeodato Simó wrote:
>> + Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +0000):

>>> Except the ton which sets LC_ALL=C to get sane (parsable,
>>> dependable, historically compatible) output.
>>
>>> These would then unset all other LC_* and LANG and LANGUAGE,
>>> and only set LC_CTYPE to C.UTF-8 to get "old" behaviour but
>>> with UTF-8 (and mbrtowc and iswctype and and and) available.
>>
>> Isn't setting LC_ALL=C.UTF-8 going to be about the same and less work?
>> I'm genuinely interested if that would behave any different to what you
>> said (unsetting all, setting LC_CTYPE).
>
> % sudo localedef -c -i POSIX -f UTF-8 C.UTF-8
>
> % LANG=C.UTF8 locale charmap
> UTF-8
>
> % LANG=C locale charmap
> ANSI_X3.4-1968
>
> This appears to work correctly at first glance.
>
> However, I would ideally like the C/POSIX locales to be UTF-8
> by default as on other systems (with a C.ASCII variant if required).

By far the most critical thing is that the <wctype.h> functions
work in the normal Unicode manner, with wchar_t assumed to be
purely Unicode. This means iswupper() works, towupper() works, etc.

This applies for locales called "", "C", and "some-unknown-junk".
The only possible exception would be when there are environment
variables set which are known to need something else. Unrecognized
locales and all other defaults have to support full Unicode.

Note that none of the above necessarily requires UTF-8, though UTF-8
seems desirable. You could use Latin-1 and still have wchar_t work.
This could all be configurable of course. Suppose /etc/locale had:

"" UTF-8 # setlocale with "" and no environment variables
"C" Latin-1 # if the "C" locale is specifically requested
unknown UTF-8 # if we don't recognize the locale
broken UTF-8 # if parts of the locale info are missing/broken

Right now, gettext doesn't even distinguish those cases. This could
be considered part of the problem. When I put a zam.mo file (Zapotec)
in the right place and set LC_ALL to "zam", I get the "C" locale!!!
Any imperfection in a locale results in "C", as ASCII as can be.

--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Albert Cahalan

unread,

Nov 26, 2009, 11:10:02 PM11/26/09

Andrew McMillan writes:
> On Wed, 2009-04-08 at 10:15 +0200, Giacomo A. Catenazzi wrote:

>> So I've a question: what does UTF-8 mean in this context (C.UTF-8) ?
...
> So given a character which is outside of the 0x00 <= 0x7f range, in an
> environment which does not specify an encoding, I would like to one day
> be able to categorically state that "Debian will by default assume that
> character is unicode, encoded according to UTF-8".

Damn right. The obscure languages of the world are numerous. Unlike
the languages of countries that were wealthy enough to participate
in native-language computing prior to UTF-8, these less-popular
languages are getting done in UTF-8. We mostly aren't inventing
new incompatible encodings.

> In such an environment, with a C.UTF-8 encoding selected, when I start a
> word processing program and insert an a-umlaut in there, I would expect
> that my file will be written with a UTF-8 encoded unicode character in
> it. I would not expect that if I sort the lines in that file, that the
> lines beginning with a-umlaut would sort before 'z'.

Right...

> I would not expect
> that if I grep such a file for '^[[:alpha:]]$' that my a-umlaut line
> would appear.

No. It's a letter in the Unicode spec.

> The proposal, at this stage is only that the C.UTF-8 locale is
> *installed* and *available* by default. Not that it *be* the default,
> but that it *be there* as a default. People will naturally continue to
> be free to uninstall it, or to leave their locale to 'C'.

What if you don't set your locale to anything, or if you set it
to something that isn't recognized? You should get UTF-8 in any
of those cases.

The mechanism isn't so important. It could be that the fallback
locale used by gettext is no longer "C" (perhaps "C.UTF-8"), or it
could be that the "C" locale does UTF-8.

LC_ALL=pirate --> you get UTF-8, with messages from pirate.mo

> Yes, I think that the C.UTF-8 locale offers something different that the
> C locale doesn't. Primarily it offers us a way out of the current
> default encodings which are legacy encodings, without jumping boots and
> all into a world where suddenly our sort ordering is changed, and our
> users are screaming at us that en_US.UTF-8 is wrong for *them*, or that
> 'sort' is suddenly putting 'A' next to 'a' and all of their legacy shell
> scripts expect are broken because they expect a different behaviour.

> I believe that the list above might be the set of smallest useful
> incremental changes in this process. I would really like to see that
> second step taken too, where the default locale is set to the most basic
> UTF-8 locale possible, but I'm happy to see a second bug and further
> discussion, if that's what we need to do to get agreement.

There are different meanings of "default".

By default, the locale should not be set in the environment.
That should give UTF-8. It could map to "C", "C.UTF-8", "(nil)",
or whatever.

>> I still think that "en_US.UTF-8" is the right default (note:
>> I'm not a US citizen, nor I speak English).

As a US citizen who does speak English, I guess I'm an authority
on the en_US.UTF-8 locale. It is offensively defective. It sorts
stuff in a crazy order designed by some moronic committee.
I doubt it even accepts Cyrillic and Korean as having letters.

Albert Cahalan

unread,

Nov 26, 2009, 11:30:01 PM11/26/09

Giacomo A. Catenazzi writes:
> [Andrew McMillan probably]

> I think nobody should use "C" or "C.UTF-8" as user encoding.
> And I really hope that Debian will try to convince user to
> use a proper locale.

Debian doesn't ship a proper locale. I want sorting according
to the raw Unicode values. I want iswprint() to return non-zero
for a Cyrillic character, a Korean character, etc.

Debian shouldn't be setting locale-related environment variables
unless the user specifically chooses. The implementation-specific
defaults, applied in the absense of any environment variables,
should support Unicode.

>> * All ISO8859 locales are moved to a new locales-legacy-encodings
>> package.
>
> This encoding is used also on CD/, floppy, remote filesystems,
> USB pens, on a lot of internet pages, etc.

Nope.

It's actually UTF-16 in VFAT, Joliet, CIFS, and so on. Linux has
mount options to control how that gets make POSIX-compatible.
You can choose UTF-8. (this should be Debian's default)

> But an ASCII7 "C" encoding allow you to do the same things. It doesn't
> forbid 8 bit characters (thus UTF-8). Unix is transparent on characters
> (i.e. binary and text are the same, you can grep binaries, ...).
>
> So scripts should use LANG=C on most cases.

That leaves iswprint() and towupper() broken. (not that it must)

Thorsten Glaser

unread,

Nov 27, 2009, 6:10:01 AM11/27/09

Albert Cahalan dixit:

>Any imperfection in a locale results in "C", as ASCII as can be.

Yes, and "C" shall not imply latin1 but 7-bit ASCII but 8-bit
transparent.

//mirabilos
--
Sometimes they [people] care too much: pretty printers [and syntax highligh-
ting, d.A.] mechanically produce pretty output that accentuates irrelevant
detail in the program, which is as sensible as putting all the prepositions
in English text in bold font. -- Rob Pike in "Notes on Programming in C"

Thorsten Glaser

unread,

Nov 27, 2009, 6:10:03 AM11/27/09

Albert Cahalan dixit:

>Giacomo A. Catenazzi writes:

>> I think nobody should use "C" or "C.UTF-8" as user encoding.

I’d use it.

>Debian doesn't ship a proper locale. I want sorting according
>to the raw Unicode values.

Also called ASCIIbetically ☺ But C exists, C.UTF-8 doesn’t.

>>> * All ISO8859 locales are moved to a new locales-legacy-encodings
>>> package.
>>
>> This encoding is used also on CD/, floppy, remote filesystems,
>> USB pens, on a lot of internet pages, etc.
>
>Nope.
>
>It's actually UTF-16 in VFAT, Joliet, CIFS, and so on.

And cp437 (or, worse, cp850) in FAT SFNs.

>> So scripts should use LANG=C on most cases.
>
>That leaves iswprint() and towupper() broken. (not that it must)

No, LANG is *also* wrong. Scripts relying on certain behaviour
use LC_ALL=C (and, on GNU OSes, also must “unset LANGUAGE”), but
some things just require UTF-8, so the current approach is to
unset everything beginning with LC_*, setting LANG=C (or unsetting
it) and LC_ALL=en_US.UTF-8 or en_GB.UTF-8 or whatever and hoping
that that locale is installed… not acceptable!

bye,
//mirabilos
--
16:47⎜«mika:#grml» .oO(mira ist einfach gut....) 23:22⎜«mikap:#grml»
mirabilos: und dein bootloader ist geil :) 23:29⎜«mikap:#grml» und ich
finds saugeil dass ich ein bsd zum booten mit grml hab, das muss ich dann
gleich mal auf usb-stick installieren -- Michael Prokop über MirOS bsd4grml

0 new messages