According to Vincent Lefevre on 11/14/2008 7:12 PM:
Hello Vincent, and thanks for the report.
> A number of utilities, including bison, m4 and findutils, install a
> charset.alias file under Mac OS X, and this breaks the locale system
> (any locale is regarded as UTF-8) [*]. The problem has been around
> for a couple of years.
You are better off filing this report against gettext, which is the source
of this file, and/or gnulib, which ships the latest version of this file
even in packages (such as m4) that have not yet undergone the I18n
conversion to use gettext. Feel free to drop everything besides
bug-gnulib and bug-gnu-gettext when replying.
At any rate, the charsets.alias file has been shipped by gettext and any
project that uses gettext for years, in order to accomplish message
translation. If this file interferes with correct locale behavior, then
it is a bug that should be fixed at the gettext source, rather than
writing a bug report against every package that happens to use gettext or
gnulib, but where the subsidiary packages have no control over the problem.
>
> Moreover, even if it didn't break anything, the file is overwritten
> when another such software is installed.
The installation process is supposed to modify, not overwrite, the
charset.alias file to append the names of additional packages that are
using it.
>
> [*]
> http://trac.macports.org/ticket/11474
> http://trac.macports.org/ticket/11968
> http://trac.macports.org/ticket/17084
>
- --
Don't work too hard, make some time for fun as well!
Eric Blake eb...@byu.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkkeTAwACgkQ84KuGfSFAYCKmACg0VG4i4HtiDMydG2Ofq454qNi
wl0AoL1qjC4MtA0zr68aoc0mThfFJ8KK
=AtpR
-----END PGP SIGNATURE-----
On 2008-11-14 21:11:56 -0700, Eric Blake wrote:
> You are better off filing this report against gettext, which is the
> source of this file, and/or gnulib, which ships the latest version
> of this file even in packages (such as m4) that have not yet
> undergone the I18n conversion to use gettext.
I've just reported a bug against gettext:
https://savannah.gnu.org/bugs/index.php?25235
> > Moreover, even if it didn't break anything, the file is overwritten
> > when another such software is installed.
>
> The installation process is supposed to modify, not overwrite, the
> charset.alias file to append the names of additional packages that are
> using it.
The problem is that MacPorts works by installing everything in
some private directory (hence different versions of a same port
can be "installed" at the same time), then MacPorts can activate
some version of a port by adding hard links. This is a bit like
how GNU Stow works (though Stow uses symbolic links).
With a file that is shared by various software, this system leads
to conflicts (a file in standard directories can belong to only one
port). I wonder if there's a standard way to declare a file like
charset.alias as shared by various software.
I've seen that a charset.alias file is not installed by utilities
like m4 under GNU/Linux, so that the above problem cannot appear
on this system. If the correction of the gettext bug makes the
charset.alias file also disappear under Mac OS X, then everything
will be fine.
--
Vincent Lefèvre <vin...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)
You brought up two different issues:
1) The fact that config.charset for MacOS X recognizes only UTF-8 locales.
2) The conflicts when packaging software sees a charset.alias file that
belongs to different packages.
Ad 1)
Vincent Lefevre wrote:
> > You are better off filing this report against gettext, which is the
> > source of this file, and/or gnulib, which ships the latest version
> > of this file even in packages (such as m4) that have not yet
> > undergone the I18n conversion to use gettext.
>
> I've just reported a bug against gettext:
>
> https://savannah.gnu.org/bugs/index.php?25235
See my response there. In summary, locales with an encoding other than
UTF-8 are not supported by MacOS X because filenames MUST be in UTF-8 on
this platform.
The generated charset is user-editable, though. (That's the very reason
why charset.alias is a separate file.) You can edit it as you like.
Ad 2)
> > The installation process is supposed to modify, not overwrite, the
> > charset.alias file to append the names of additional packages that are
> > using it.
> ...
> With a file that is shared by various software, this system leads
> to conflicts (a file in standard directories can belong to only one
> port). I wonder if there's a standard way to declare a file like
> charset.alias as shared by various software.
>
> I've seen that a charset.alias file is not installed by utilities
> like m4 under GNU/Linux, so that the above problem cannot appear
> on this system. If the correction of the gettext bug makes the
> charset.alias file also disappear under Mac OS X, then everything
> will be fine.
I hardcoded the contents of charset.alias for glibc systems, so as to
avoid such conflicts for RPM and Debian packaging tools. Shall I do
the same for MacOS X? It will resolve the problem with the conflict,
but users will then not be able to customize the file.
Bruno
I've replied. I don't use non-ASCII characters in filenames, so that
there's no problem with non-UTF-8 locales. Anyway there would the same
problems under Linux: try to re-read UTF-8 encoded filenames (e.g.
created by a graphical application or by another user) under non-UTF-8
locales...
> The generated charset is user-editable, though. (That's the very reason
> why charset.alias is a separate file.) You can edit it as you like.
But the default file makes utilities fail to give correct messages
under non-UTF-8 locales, whereas without this file, there are no
problems at all (whatever locales are used).
> I hardcoded the contents of charset.alias for glibc systems, so as to
> avoid such conflicts for RPM and Debian packaging tools. Shall I do
> the same for MacOS X? It will resolve the problem with the conflict,
> but users will then not be able to customize the file.
I don't see why they would need to customize it.
Your argument that people may want to use 'grep' in ISO-8859-1 encoded text
files is convincing. I'm applying this patch, to support non-UTF-8 locales
on MacOS X:
2009-01-24 Bruno Haible <br...@clisp.org>
Add support for non-UTF-8 locales on MacOS X.
* lib/config.charset: Add CP1131, ARMSCII-8, PT154 to the list of
canonical encodings. For Darwin 7 and newer, don't map traditional
encodings to UTF-8.
Reported by Vincent Lefevre <vin...@vinc17.org>
at <http://savannah.gnu.org/bugs/?25235>.
--- lib/config.charset.orig 2009-01-25 00:48:05.000000000 +0100
+++ lib/config.charset 2009-01-25 00:43:46.000000000 +0100
@@ -1,7 +1,7 @@
#! /bin/sh
# Output a system dependent table of character encoding aliases.
#
-# Copyright (C) 2000-2004, 2006-2008 Free Software Foundation, Inc.
+# Copyright (C) 2000-2004, 2006-2009 Free Software Foundation, Inc.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@@ -63,12 +63,13 @@
# CP922 aix
# CP932 aix woe32 dos
# CP943 aix
-# CP949 osf woe32 dos
+# CP949 osf darwin woe32 dos
# CP950 woe32 dos
# CP1046 aix
# CP1124 aix
# CP1125 dos
# CP1129 aix
+# CP1131 darwin
# CP1250 woe32
# CP1251 glibc solaris netbsd openbsd darwin woe32
# CP1252 aix woe32
@@ -82,15 +83,17 @@
# EUC-KR Y glibc aix hpux irix osf solaris freebsd netbsd darwin
# EUC-TW glibc aix hpux irix osf solaris netbsd
# BIG5 Y glibc aix hpux osf solaris freebsd netbsd darwin
-# BIG5-HKSCS glibc solaris
-# GBK glibc aix osf solaris woe32 dos
-# GB18030 glibc solaris netbsd
+# BIG5-HKSCS glibc solaris darwin
+# GBK glibc aix osf solaris darwin woe32 dos
+# GB18030 glibc solaris netbsd darwin
# SHIFT_JIS Y hpux osf solaris freebsd netbsd darwin
# JOHAB glibc solaris woe32
# TIS-620 glibc aix hpux osf solaris
# VISCII Y glibc
# TCVN5712-1 glibc
+# ARMSCII-8 glibc darwin
# GEORGIAN-PS glibc
+# PT154 glibc
# HP-ROMAN8 hpux
# HP-ARABIC8 hpux
# HP-GREEK8 hpux
@@ -449,7 +452,8 @@
echo "ko_KR.EUC EUC-KR"
;;
darwin*)
- # Darwin 7.5 has nl_langinfo(CODESET), but it is useless:
+ # Darwin 7.5 has nl_langinfo(CODESET), but sometimes its value is
+ # useless:
# - It returns the empty string when LANG is set to a locale of the
# form ll_CC, although ll_CC/LC_CTYPE is a symlink to an UTF-8
# LC_CTYPE file.
@@ -476,6 +480,36 @@
# minimize the use of decomposed Unicode. Unfortunately, through the
# Darwin file system, decomposed UTF-8 strings are leaked into user
# space nevertheless.
+ # Then there are also the locales with encodings other than US-ASCII
+ # and UTF-8. These locales can be occasionally useful to users (e.g.
+ # when grepping through ISO-8859-1 encoded text files), when all their
+ # file names are in US-ASCII.
+ echo "ISO8859-1 ISO-8859-1"
+ echo "ISO8859-2 ISO-8859-2"
+ echo "ISO8859-4 ISO-8859-4"
+ echo "ISO8859-5 ISO-8859-5"
+ echo "ISO8859-7 ISO-8859-7"
+ echo "ISO8859-9 ISO-8859-9"
+ echo "ISO8859-13 ISO-8859-13"
+ echo "ISO8859-15 ISO-8859-15"
+ echo "KOI8-R KOI8-R"
+ echo "KOI8-U KOI8-U"
+ echo "CP866 CP866"
+ echo "CP949 CP949"
+ echo "CP1131 CP1131"
+ echo "CP1251 CP1251"
+ echo "eucCN GB2312"
+ echo "GB2312 GB2312"
+ echo "eucJP EUC-JP"
+ echo "eucKR EUC-KR"
+ echo "Big5 BIG5"
+ echo "Big5HKSCS BIG5-HKSCS"
+ echo "GBK GBK"
+ echo "GB18030 GB18030"
+ echo "SJIS SHIFT_JIS"
+ echo "ARMSCII-8 ARMSCII-8"
+ echo "PT154 PT154"
+ #echo "ISCII-DEV ?"
echo "* UTF-8"
;;
beos* | haiku*)
OK. Then a hardcoded aliases list will do. I'm applying this.
2009-01-25 Bruno Haible <br...@clisp.org>
Don't install charset.alias on MacOS X >= 10.3.
* lib/localcharset.c (DARWIN7): New macro.
(get_charset_aliases): Hardcode the result for Darwin7.
* modules/localcharset (install-exec-local): Don't install
charset.alias on MacOS X >= 10.3, if the file does not yet exist.
--- lib/localcharset.c.orig 2009-01-25 20:13:24.000000000 +0100
+++ lib/localcharset.c 2009-01-25 18:54:07.000000000 +0100
@@ -1,6 +1,6 @@
/* Determine a canonical name for the current locale's character encoding.
- Copyright (C) 2000-2006, 2008 Free Software Foundation, Inc.
+ Copyright (C) 2000-2006, 2008-2009 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -28,6 +28,10 @@
#include <string.h>
#include <stdlib.h>
+#if defined __APPLE__ && defined __MACH__ && HAVE_LANGINFO_CODESET
+# define DARWIN7 /* Darwin 7 or newer, i.e. MacOS X 10.3 or newer */
+#endif
+
#if defined _WIN32 || defined __WIN32__
# define WIN32_NATIVE
#endif
@@ -112,7 +116,7 @@
cp = charset_aliases;
if (cp == NULL)
{
-#if !(defined VMS || defined WIN32_NATIVE || defined __CYGWIN__)
+#if !(defined DARWIN7 || defined VMS || defined WIN32_NATIVE || defined __CYGWIN__)
FILE *fp;
const char *dir;
const char *base = "charset.alias";
@@ -213,6 +217,39 @@
#else
+# if defined DARWIN7
+ /* To avoid the trouble of installing a file that is shared by many
+ GNU packages -- many packaging systems have problems with this --,
+ simply inline the aliases here. */
+ cp = "ISO8859-1" "\0" "ISO-8859-1" "\0"
+ "ISO8859-2" "\0" "ISO-8859-2" "\0"
+ "ISO8859-4" "\0" "ISO-8859-4" "\0"
+ "ISO8859-5" "\0" "ISO-8859-5" "\0"
+ "ISO8859-7" "\0" "ISO-8859-7" "\0"
+ "ISO8859-9" "\0" "ISO-8859-9" "\0"
+ "ISO8859-13" "\0" "ISO-8859-13" "\0"
+ "ISO8859-15" "\0" "ISO-8859-15" "\0"
+ "KOI8-R" "\0" "KOI8-R" "\0"
+ "KOI8-U" "\0" "KOI8-U" "\0"
+ "CP866" "\0" "CP866" "\0"
+ "CP949" "\0" "CP949" "\0"
+ "CP1131" "\0" "CP1131" "\0"
+ "CP1251" "\0" "CP1251" "\0"
+ "eucCN" "\0" "GB2312" "\0"
+ "GB2312" "\0" "GB2312" "\0"
+ "eucJP" "\0" "EUC-JP" "\0"
+ "eucKR" "\0" "EUC-KR" "\0"
+ "Big5" "\0" "BIG5" "\0"
+ "Big5HKSCS" "\0" "BIG5-HKSCS" "\0"
+ "GBK" "\0" "GBK" "\0"
+ "GB18030" "\0" "GB18030" "\0"
+ "SJIS" "\0" "SHIFT_JIS" "\0"
+ "ARMSCII-8" "\0" "ARMSCII-8" "\0"
+ "PT154" "\0" "PT154" "\0"
+ /*"ISCII-DEV" "\0" "?" "\0"*/
+ "*" "\0" "UTF-8" "\0";
+# endif
+
# if defined VMS
/* To avoid the troubles of an extra file charset.alias_vms in the
sources of many GNU packages, simply inline the aliases here. */
--- modules/localcharset.orig 2009-01-25 20:13:25.000000000 +0100
+++ modules/localcharset 2009-01-25 18:45:42.000000000 +0100
@@ -42,7 +42,9 @@
install-exec-local: all-local
if test $(GLIBC21) = no; then \
case '$(host_os)' in \
- cygwin* | mingw* | pw32* | cegcc*) \
+ darwin[56]*) \
+ need_charset_alias=true ;; \
+ darwin* | cygwin* | mingw* | pw32* | cegcc*) \
need_charset_alias=false ;; \
*) \
need_charset_alias=true ;; \