The output of locale is;
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
What can I do to fix this problem please?
Syd
OK. Stop right there. *WHY* are you using a 6 year old operating
system for anything you care about? Seriously, if at all possible,
update to CentOS 4.7 at a minimum, preferably 5.3. You'll get much
better international language support.
> However there are some special characters ( u with 2 dots overhead,
> for example) in the data which appear as ? in the linux file created.
> I am told the database uses CP 1252, which means the u with 2 dots
> overhead,is character 252,
>
> The output of locale is;
> $ locale
> LANG=C
> LC_CTYPE="C"
> LC_NUMERIC="C"
> LC_TIME="C"
> LC_COLLATE="C"
> LC_MONETARY="C"
> LC_MESSAGES="C"
> LC_PAPER="C"
> LC_NAME="C"
> LC_ADDRESS="C"
> LC_TELEPHONE="C"
> LC_MEASUREMENT="C"
> LC_IDENTIFICATION="C"
> LC_ALL=
>
> What can I do to fix this problem please?
Well, it depends. The strings you are handling are not 7-bit ASCII
text, which is what the 'C' format is generally for, they're
effectively binary data. Treat them as such. If you need them to be
visiable, consider setting your LANG and other settings to German or
whatever language with umlauts they were originally written in.
What are you passing this data to? Is it possible that your viewer for
the Linux text file is simply mishandling the generated non-English
character set?
*WHY* are you using a 6 year old operating system?
Well it is a 5 year old install and only now are we being fed data
with "odd" characters.
There is a new 5.3 platform coming soon - but the old system will be
around until next year at least.
Based on the example I mentioned using LANG=de would be a possible
solution.
But we are seeing French, Spanish and German "special" characters
which are supported by MS's CP 1252.
Any other ideas?
TIA
Syd
> Based on the example I mentioned using LANG=de would be
> a possible solution.
No, the default CTYPE for de is ISO-8859-1.
> But we are seeing French, Spanish and German "special"
> characters which are supported by MS's CP 1252.
Check if your libc supports CP1252 :
$ locale -m | grep '^CP'
CP10007
CP1125
CP1250
CP1251
CP1252
CP1253
CP1254
CP1255
CP1256
CP1257
CP1258
CP737
CP775
CP949
If it does : LANG=en_US.CP1252
Of course, you can replace "en_US" with wathever you prefer.
The important part here is the ".CP1252", which defines the
locale's character set (and encoding). This is independent
from language (the "en") and region (the "_US").
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #
Exactly. Amongst those ‘other things’ are the frequently used
quotation marks (U+2018..U+201F) :
→ printf '“„”\n' | iconv -tlatin1 | iconv -flatin1
iconv: Séquence d'échappement illégale à la position 0
→ printf '“„”\n' | iconv -tcp1252 | iconv -fcp1252
“„”
$ locale -m | grep '^CP'
CP10007
CP1125
CP1250
CP1251
CP1252
CP1253
CP1254
CP1255
CP1256
CP1257
CP1258
CP737
CP775
CP949
>> → printf '“„”\n' | iconv -tlatin1 | iconv -flatin1
>> iconv: Séquence d'échappement illégale à la position 0
>> → printf '“„”\n' | iconv -tcp1252 | iconv -fcp1252
>> “„”
>
> Not quite sure how you did the printf above tho.
The three quotes above are actually encoded in UTF-8,
because that is what my terminal understands.
The first iconv on the second printf line converts from
UTF-8 (my default in LANG) to CP1252 and doesn't
report an error, meaning that those characters are
valid in CP1252 encoding. The second iconv does the
inverse : translate from CP1252 to UTF-8, and the
result is the original string.
The first printf passes the same UTF-8 encoded quotes
to iconv, but asks to convert to latin1 (ISO-8859-1), and
this time iconv says "illegal input sequence", because
these quotes do not exist in latin1.
> And not quite sure what I should set to say LANG
> and LC_ALL to en_us first and check that out?
Try,
LANG=en_US.CP1252 locale
LANG=en_US.ISO-8859-15 locale
LANG=en_US.ISO-8859-1 locale
LANG=en_US.UTF-8 locale
and see, if any of these does *not* produce an error
like this :
$ LANG=en_US.FOO locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
Obviously, character encoding FOO doesn't exist.
> I did not originally set up the box (actually there are 6 or 8
> of them) but I think that LANG=C was done cos there was
> a problem with LANG-en_us.
Anything is possible, but centos 3.8 isn't that old.
In your OP you write :
« However there are some special characters (u with 2 dots
» overhead, for example) in the data which appear as ? in
» the linux file created. »
Is that a normal question mark, or is it inverse (white in
a black hexagon or square), like this : �
In the latter case, all you would have to do is convert the
output from the db application with 'iconv -fcp1252 -tutf8'.
It is a normal question mark.
I entered the commands as suggested
> LANG=en_US.CP1252 locale -> Bad
> LANG=en_US.ISO-8859-15 locale -> Good
> LANG=en_US.ISO-8859-1 locale -> Good
> LANG=en_US.UTF-8 locale -> Good
++++
$ LANG=en_US.CP1252 locale
locale: Cannot set LC_CTYPE to default locale: No such file or
directory
locale: Cannot set LC_MESSAGES to default locale: No such file or
directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.CP1252
LC_CTYPE="en_US.CP1252"
LC_NUMERIC="en_US.CP1252"
LC_TIME="en_US.CP1252"
LC_COLLATE="en_US.CP1252"
LC_MONETARY="en_US.CP1252"
LC_MESSAGES="en_US.CP1252"
LC_PAPER="en_US.CP1252"
LC_NAME="en_US.CP1252"
LC_ADDRESS="en_US.CP1252"
LC_TELEPHONE="en_US.CP1252"
LC_MEASUREMENT="en_US.CP1252"
LC_IDENTIFICATION="en_US.CP1252"
LC_ALL=
]$ LANG=en_US.CP1252 locale
locale: Cannot set LC_CTYPE to default locale: No such file or
directory
locale: Cannot set LC_MESSAGES to default locale: No such file or
directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.CP1252
LC_CTYPE="en_US.CP1252"
LC_NUMERIC="en_US.CP1252"
LC_TIME="en_US.CP1252"
LC_COLLATE="en_US.CP1252"
LC_MONETARY="en_US.CP1252"
LC_MESSAGES="en_US.CP1252"
LC_PAPER="en_US.CP1252"
LC_NAME="en_US.CP1252"
LC_ADDRESS="en_US.CP1252"
LC_TELEPHONE="en_US.CP1252"
LC_MEASUREMENT="en_US.CP1252"
LC_IDENTIFICATION="en_US.CP1252"
LC_ALL=
]$ LANG=en_US.ISO-8859-15 locale
LANG=en_US.ISO-8859-15
LC_CTYPE="en_US.ISO-8859-15"
LC_NUMERIC="en_US.ISO-8859-15"
LC_TIME="en_US.ISO-8859-15"
LC_COLLATE="en_US.ISO-8859-15"
LC_MONETARY="en_US.ISO-8859-15"
LC_MESSAGES="en_US.ISO-8859-15"
LC_PAPER="en_US.ISO-8859-15"
LC_NAME="en_US.ISO-8859-15"
LC_ADDRESS="en_US.ISO-8859-15"
LC_TELEPHONE="en_US.ISO-8859-15"
LC_MEASUREMENT="en_US.ISO-8859-15"
LC_IDENTIFICATION="en_US.ISO-8859-15"
LC_ALL=
[netcool@impact01 netcool]$ LANG=en_US.ISO-8859-1 locale
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_PAPER="en_US.ISO-8859-1"
LC_NAME="en_US.ISO-8859-1"
LC_ADDRESS="en_US.ISO-8859-1"
LC_TELEPHONE="en_US.ISO-8859-1"
LC_MEASUREMENT="en_US.ISO-8859-1"
LC_IDENTIFICATION="en_US.ISO-8859-1"
LC_ALL=
$ LANG=en_US.UTF-8 locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
But I am not sure how to proceed. I have this from CP1252 "ë 00EB 235"
which I want to handle in centos 3.8.
And the glibc supports CP1252
$ locale -m | grep '^CP'
...
CP1252
But "LANG=en_US.CP1252 locale" does not work.
But with LANG=C which I thought was only 7 bits the following printfs
work just fine.
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character ë
printf "(octal 361) is the character \0361\n"
(octal 361) is the character ñ
These are two of the characters in the MSSQL db which the application
(not open source) handles as "?".
Puzzled now!
Please help!
> I entered the commands as suggested
>> LANG=en_US.CP1252 locale -> Bad
>> LANG=en_US.ISO-8859-15 locale -> Good
>> LANG=en_US.ISO-8859-1 locale -> Good
>> LANG=en_US.UTF-8 locale -> Good
Then run your application with latin9 :
LANG=en_US.ISO-8859-15 application ...
It should no longer convert the ‘special characters’ to
question marks. Simple test :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <file | od -b
> But with LANG=C which I thought was only 7 bits the following
> printfs work just fine.
>
> $ printf "(octal 353) is the character \0353\n"
> (octal 353) is the character ë
> printf "(octal 361) is the character \0361\n"
> (octal 361) is the character ñ
Good, the typeface (font) has the characters you need.
> These are two of the characters in the MSSQL db which the
> application (not open source) handles as "?".
Check if the application is really the cause of the problem.
For file 'foo', generated by application, run :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
$ LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
0000000
not quite sure what this does ;-)
And on a Centos 5.3 box I just invoked
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character 3
on the centos 3.8 box I got the expected output of ë
>> $ od -b foo
>> 0000000 101 077 117 012
>> 0000004
>> octal 101 = A
>> octal 077 = ?
>> octal 077 = O
>> middle char should be capital ñ
Ok, the problem is caused by your application.
Try running it with latin9 ctype :
LANG=en_US.ISO-8859-15 application ...
>> $ LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
>> 0000000
>> not quite sure what this does ;-)
> Ah yes I am - it deletes all "normal" chars and passes
> the remainder to od...
Yes, I thought 'foo' might be a big file, with mostly us-ascii.
In a file with 10000 ascii characters and only 10 non-ascii
the output of od (without the tr filter) might be a bit of a
challenge. ;-)
> not quite sure what all zeros as the out means tho...
Od starts eachs line with an address, the offset of the first
byte in that line. The first line starts at offset 0, unless you
invoke od with the -j option.
> And on a Centos 5.3 box I just invoked
> $ printf "(octal 353) is the character \0353\n"
> (octal 353) is the character 3
Yes, that is what posix printf is required to do. From,
http://www.opengroup.org/onlinepubs/9699919799/utilities/printf.html
« [...] "\ddd", where ddd is a one, two, or three-digit octal
» number, shall be written as a byte with the numeric value
» specified by the octal number. »
In the printf above, "\035" is an escape sequence, and the
following "3" is a normal digit. To write octal 353, the
printf format string should be '(octal 353) ... \353\n'.
> on the centos 3.8 box I got the expected output of ë
Are you using zsh on the centos 3.8 box?
The zsh built-in printf expects octal escape sequences to
start with '\0' followed by zero, one, three or four octal
digits.
posix: \353 => zsh: \0353
posix: \75 => zsh: \075
> $ printf "(hex EB) is the character \xEB\n"
> (hex EB) is the character ë
> this works on 3.8 and 5.3
This a nice one to try. It shows the terminal mapping:
printf '\xa4 \x80\n'
To understand that:
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/ISO-8859-15.gz
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/ISO-8859-1.gz
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/CP1252.gz
Thanks very much for your input.
Syd