Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

Albert Cahalan

unread,

Nov 26, 2009, 9:20:01 PM11/26/09

to

Steve Langasek writes:
> On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:

>> If you need a specific locale (as seems from "mksh", not
>> sure if it is a bug in that program), you need to set it.
>>
>> You can only set a locale on a glibc-based system if it's
>> installed beforehand, which root needs to do.

This is of course a horrid bug. I'm fighting it right now.
I install a zam.mo file, nothing else, and I damn well expect
that file to get used for messages! Obviously, it's UTF-8.
Obviously, I expect towupper() to follow Unicode defaults.

> You can build-depend on the locales package and generate the locales
> you want locally, using LOCPATH to reference them. There's no need
> for Debian to guarantee the presence of a particular locale ahead of
> time - particularly one that isn't actually useful to end users,
> as C.UTF-8 would be.

Unless plain "C" goes UTF-8, that's exactly the locale I need.
The stupid broken en_US.UTF-8 fucks up the sort order.

Granted, fixing en_US.UTF-8 would be sweet, but it may be far too late.

We really need a do-nothing locale that follows the Unicode spec
using the UTF-8 encoding. We could also use a do-nothing locale
that follows the Unicode spec using the Latin-1 encoding.

--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Thorsten Glaser

unread,

Nov 27, 2009, 6:10:03 AM11/27/09

to

Albert Cahalan dixit:

>Unless plain "C" goes UTF-8

Not going to happen, it’s not binary-safe. (I fought that in
MirBSD with the OPTU-8/16 encoding scheme.)

>The stupid broken en_US.UTF-8 fucks up the sort order.

So true… (and paper size!)

>We really need a do-nothing locale that follows the Unicode spec
>using the UTF-8 encoding.

Yes, my proposal exactly.

>We could also use a do-nothing locale
>that follows the Unicode spec using the Latin-1 encoding.

No, for two reasons:
① legacy encodings must die
② then you need one for EVERY legacy encoding (why special-case one?)

bye,
//mirabilos
--
Sometimes they [people] care too much: pretty printers [and syntax highligh-
ting, d.A.] mechanically produce pretty output that accentuates irrelevant
detail in the program, which is as sensible as putting all the prepositions
in English text in bold font. -- Rob Pike in "Notes on Programming in C"

Giacomo A. Catenazzi

unread,

Dec 1, 2009, 12:00:02 PM12/1/09

to

Thorsten Glaser wrote:
> Albert Cahalan dixit:
>
>> Unless plain "C" goes UTF-8
>
> Not going to happen, it’s not binary-safe. (I fought that in
> MirBSD with the OPTU-8/16 encoding scheme.)

Why not? Note that usual functions work on bytes, not on characters, and
on POSIX utilities the old/classical options work on bytes by default.
POSIX introduced new options for characters. E.g. the -c in 'wc' means
really bytes, not characters (which is given by -m). Not so logical, but
compatible with the expected old behaviour.

POSIX was discussing if is is "legal" to have a UTF-8 POSIX/C locale.
IIRC the doubts was about the language in the standard, not about real
problems. OTOH they acknowledged that real bugs could appear.

OTOH I use by default the UTF-8 locale, because I don't expect that
Debian will corrupt my data. And I think system utilities will do
the right things with locale.

I start to think that moving C to UTF-8 will be the real simpler and
faster way to *hide* most of the encoding bugs.

ciao
cate

Thorsten Glaser

unread,

Dec 1, 2009, 1:40:01 PM12/1/09

to

Giacomo A. Catenazzi dixit:

>> Not going to happen, it’s not binary-safe. (I fought that in
>> MirBSD with the OPTU-8/16 encoding scheme.)
>
> Why not? Note that usual functions work on bytes

Not really.

The difference between 'tr u x' on binary files can, depending on
the implementation of tr (if it does 'tr ¥ €' correctly in an UTF-8
locale), trash it because it must use mbsrtowcs then, which is, by
POSIX, required to fail for non-representable strings.

In MirBSD, we have solved that by clever use of the PUA.

//mirabilos
--
Sometimes they [people] care too much: pretty printers [and syntax highligh-
ting, d.A.] mechanically produce pretty output that accentuates irrelevant
detail in the program, which is as sensible as putting all the prepositions
in English text in bold font. -- Rob Pike in "Notes on Programming in C"

--