LANG setting for MS CP 1252

syd_p

unread,

Aug 5, 2009, 6:20:10 PM8/5/09

to

Hello,
I have application running on centos 3.8 which brings back some data
from a MS SQL server db, and writes it to disk.
However there are some special characters ( u with 2 dots overhead,
for example) in the data which appear as ? in the linux file created.
I am told the database uses CP 1252, which means the u with 2 dots
overhead,is character 252,

The output of locale is;
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

What can I do to fix this problem please?

Syd

Nico Kadel-Garcia

unread,

Aug 5, 2009, 10:51:04 PM8/5/09

to

On Aug 5, 6:20 pm, syd_p <sydneypue...@yahoo.com> wrote:
> Hello,
> I have application running on centos 3.8 which brings back some data
> from a MS SQL server db, and writes it to disk.

OK. Stop right there. *WHY* are you using a 6 year old operating
system for anything you care about? Seriously, if at all possible,
update to CentOS 4.7 at a minimum, preferably 5.3. You'll get much
better international language support.

> However there are some special characters ( u with 2 dots overhead,
> for example) in the data which appear as ? in the linux file created.
> I am told the database uses CP 1252, which means the u with 2 dots
> overhead,is character 252,
>
> The output of locale is;
> $ locale
> LANG=C
> LC_CTYPE="C"
> LC_NUMERIC="C"
> LC_TIME="C"
> LC_COLLATE="C"
> LC_MONETARY="C"
> LC_MESSAGES="C"
> LC_PAPER="C"
> LC_NAME="C"
> LC_ADDRESS="C"
> LC_TELEPHONE="C"
> LC_MEASUREMENT="C"
> LC_IDENTIFICATION="C"
> LC_ALL=
>
> What can I do to fix this problem please?

Well, it depends. The strings you are handling are not 7-bit ASCII
text, which is what the 'C' format is generally for, they're
effectively binary data. Treat them as such. If you need them to be
visiable, consider setting your LANG and other settings to German or
whatever language with umlauts they were originally written in.

What are you passing this data to? Is it possible that your viewer for
the Linux text file is simply mishandling the generated non-English
character set?

syd_p

unread,

Sep 24, 2009, 5:57:12 AM9/24/09

to

Thanks very much for your response - which I missed until now.
This is still a problem - just more urgent.

*WHY* are you using a 6 year old operating system?
Well it is a 5 year old install and only now are we being fed data
with "odd" characters.
There is a new 5.3 platform coming soon - but the old system will be
around until next year at least.

Based on the example I mentioned using LANG=de would be a possible
solution.
But we are seeing French, Spanish and German "special" characters
which are supported by MS's CP 1252.

Any other ideas?

TIA

Syd

syd_p

unread,

Sep 24, 2009, 5:57:47 AM9/24/09

to

On 6 Aug, 03:51, Nico Kadel-Garcia <nka...@gmail.com> wrote:

syd_p

unread,

Sep 24, 2009, 5:59:39 AM9/24/09

to

PS
just used od (octal dump) to look at the output - it is not fault the
viewer
the blame lies elsewhere (with me probably ;-)

Marcel Bruinsma

unread,

Sep 24, 2009, 5:41:32 PM9/24/09

to

Am Donnerstag 24 September 2009 11:57, syd_p a écrit :

> Based on the example I mentioned using LANG=de would be
> a possible solution.

No, the default CTYPE for de is ISO-8859-1.

> But we are seeing French, Spanish and German "special"
> characters which are supported by MS's CP 1252.

Check if your libc supports CP1252 :

$ locale -m | grep '^CP'
CP10007
CP1125
CP1250
CP1251
CP1252
CP1253
CP1254
CP1255
CP1256
CP1257
CP1258
CP737
CP775
CP949

If it does : LANG=en_US.CP1252
Of course, you can replace "en_US" with wathever you prefer.
The important part here is the ".CP1252", which defines the
locale's character set (and encoding). This is independent
from language (the "en") and region (the "_US").

--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #

Bill Marcum

unread,

Sep 24, 2009, 7:06:48 PM9/24/09

to

On 2009-09-24, Marcel Bruinsma <m...@nomail.afraid.org> wrote:
> Am Donnerstag 24 September 2009 11:57, syd_p a écrit :
>
>> Based on the example I mentioned using LANG=de would be
>> a possible solution.
>
> No, the default CTYPE for de is ISO-8859-1.
>

CP1252 is a superset of ISO-8859-1. The accented letters are the same.
CP1252 has additional punctuation marks and copyright and trademark
symbols, among other things (code values 128-159 which are undefined
in the ISO-8859-* character sets.)

Marcel Bruinsma

unread,

Sep 24, 2009, 10:31:11 PM9/24/09

to

Exactly. Amongst those ‘other things’ are the frequently used
quotation marks (U+2018..U+201F) :

→ printf '“„”\n' | iconv -tlatin1 | iconv -flatin1
iconv: Séquence d'échappement illégale à la position 0
→ printf '“„”\n' | iconv -tcp1252 | iconv -fcp1252
“„”

syd_p

unread,

Sep 27, 2009, 5:40:58 PM9/27/09

to

Aha! I get the output below-
Not quite sure how you did the printf above tho.
And not quite sure what I should set to say LANG and LC_ALL to en_us
first and check that out?
then set to en_us.CP1252.
I did not originally set up the box (actually there are 6 or 8 of
them) but I think that LANG=C was done cos there was a problem with
LANG-en_us.
Gotta go careful here, cos I guess I have to reboot to test.
Thanks a lot for the help - I am getting there!!!

$ locale -m | grep '^CP'
CP10007
CP1125
CP1250
CP1251
CP1252
CP1253
CP1254
CP1255
CP1256
CP1257
CP1258
CP737
CP775
CP949

Marcel Bruinsma

unread,

Sep 27, 2009, 8:04:24 PM9/27/09

to

Am Sonntag, 27. September 2009 23:40, syd_p a écrit :

>> → printf '“„”\n' | iconv -tlatin1 | iconv -flatin1
>> iconv: Séquence d'échappement illégale à la position 0
>> → printf '“„”\n' | iconv -tcp1252 | iconv -fcp1252
>> “„”
>

> Not quite sure how you did the printf above tho.

The three quotes above are actually encoded in UTF-8,
because that is what my terminal understands.

The first iconv on the second printf line converts from
UTF-8 (my default in LANG) to CP1252 and doesn't
report an error, meaning that those characters are
valid in CP1252 encoding. The second iconv does the
inverse : translate from CP1252 to UTF-8, and the
result is the original string.

The first printf passes the same UTF-8 encoded quotes
to iconv, but asks to convert to latin1 (ISO-8859-1), and
this time iconv says "illegal input sequence", because
these quotes do not exist in latin1.

> And not quite sure what I should set to say LANG
> and LC_ALL to en_us first and check that out?

Try,

LANG=en_US.CP1252 locale
LANG=en_US.ISO-8859-15 locale
LANG=en_US.ISO-8859-1 locale
LANG=en_US.UTF-8 locale

and see, if any of these does *not* produce an error
like this :

$ LANG=en_US.FOO locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory

Obviously, character encoding FOO doesn't exist.

> I did not originally set up the box (actually there are 6 or 8
> of them) but I think that LANG=C was done cos there was
> a problem with LANG-en_us.

Anything is possible, but centos 3.8 isn't that old.

In your OP you write :
« However there are some special characters (u with 2 dots

» overhead, for example) in the data which appear as ? in

» the linux file created. »

Is that a normal question mark, or is it inverse (white in
a black hexagon or square), like this : �

In the latter case, all you would have to do is convert the
output from the db application with 'iconv -fcp1252 -tutf8'.

syd_p

unread,

Sep 28, 2009, 6:52:09 AM9/28/09

to

Thanks!!!!!

It is a normal question mark.

I entered the commands as suggested
> LANG=en_US.CP1252 locale -> Bad
> LANG=en_US.ISO-8859-15 locale -> Good
> LANG=en_US.ISO-8859-1 locale -> Good
> LANG=en_US.UTF-8 locale -> Good

++++
$ LANG=en_US.CP1252 locale

locale: Cannot set LC_CTYPE to default locale: No such file or
directory
locale: Cannot set LC_MESSAGES to default locale: No such file or
directory
locale: Cannot set LC_ALL to default locale: No such file or directory

LANG=en_US.CP1252
LC_CTYPE="en_US.CP1252"
LC_NUMERIC="en_US.CP1252"
LC_TIME="en_US.CP1252"
LC_COLLATE="en_US.CP1252"
LC_MONETARY="en_US.CP1252"
LC_MESSAGES="en_US.CP1252"
LC_PAPER="en_US.CP1252"
LC_NAME="en_US.CP1252"
LC_ADDRESS="en_US.CP1252"
LC_TELEPHONE="en_US.CP1252"
LC_MEASUREMENT="en_US.CP1252"
LC_IDENTIFICATION="en_US.CP1252"
LC_ALL=

]$ LANG=en_US.CP1252 locale

locale: Cannot set LC_CTYPE to default locale: No such file or
directory
locale: Cannot set LC_MESSAGES to default locale: No such file or
directory
locale: Cannot set LC_ALL to default locale: No such file or directory

LANG=en_US.CP1252
LC_CTYPE="en_US.CP1252"
LC_NUMERIC="en_US.CP1252"
LC_TIME="en_US.CP1252"
LC_COLLATE="en_US.CP1252"
LC_MONETARY="en_US.CP1252"
LC_MESSAGES="en_US.CP1252"
LC_PAPER="en_US.CP1252"
LC_NAME="en_US.CP1252"
LC_ADDRESS="en_US.CP1252"
LC_TELEPHONE="en_US.CP1252"
LC_MEASUREMENT="en_US.CP1252"
LC_IDENTIFICATION="en_US.CP1252"
LC_ALL=

]$ LANG=en_US.ISO-8859-15 locale
LANG=en_US.ISO-8859-15
LC_CTYPE="en_US.ISO-8859-15"
LC_NUMERIC="en_US.ISO-8859-15"
LC_TIME="en_US.ISO-8859-15"
LC_COLLATE="en_US.ISO-8859-15"
LC_MONETARY="en_US.ISO-8859-15"
LC_MESSAGES="en_US.ISO-8859-15"
LC_PAPER="en_US.ISO-8859-15"
LC_NAME="en_US.ISO-8859-15"
LC_ADDRESS="en_US.ISO-8859-15"
LC_TELEPHONE="en_US.ISO-8859-15"
LC_MEASUREMENT="en_US.ISO-8859-15"
LC_IDENTIFICATION="en_US.ISO-8859-15"
LC_ALL=
[netcool@impact01 netcool]$ LANG=en_US.ISO-8859-1 locale
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_PAPER="en_US.ISO-8859-1"
LC_NAME="en_US.ISO-8859-1"
LC_ADDRESS="en_US.ISO-8859-1"
LC_TELEPHONE="en_US.ISO-8859-1"
LC_MEASUREMENT="en_US.ISO-8859-1"
LC_IDENTIFICATION="en_US.ISO-8859-1"
LC_ALL=

$ LANG=en_US.UTF-8 locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

syd_p

unread,

Sep 29, 2009, 4:31:24 AM9/29/09

to

OK I got this:

> LANG=en_US.CP1252 locale -> Bad
> LANG=en_US.ISO-8859-15 locale -> Good
> LANG=en_US.ISO-8859-1 locale -> Good
> LANG=en_US.UTF-8 locale -> Good

But I am not sure how to proceed. I have this from CP1252 "ë 00EB 235"
which I want to handle in centos 3.8.
And the glibc supports CP1252

$ locale -m | grep '^CP'

...
CP1252

But "LANG=en_US.CP1252 locale" does not work.

But with LANG=C which I thought was only 7 bits the following printfs
work just fine.

$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character ë
printf "(octal 361) is the character \0361\n"
(octal 361) is the character ñ
These are two of the characters in the MSSQL db which the application
(not open source) handles as "?".

Puzzled now!
Please help!

Marcel Bruinsma

unread,

Sep 29, 2009, 5:20:25 AM9/29/09

to

Am Montag, 28. September 2009 12:52, syd_p a écrit :

> I entered the commands as suggested
>> LANG=en_US.CP1252 locale -> Bad
>> LANG=en_US.ISO-8859-15 locale -> Good
>> LANG=en_US.ISO-8859-1 locale -> Good
>> LANG=en_US.UTF-8 locale -> Good

Then run your application with latin9 :
LANG=en_US.ISO-8859-15 application ...

It should no longer convert the ‘special characters’ to
question marks. Simple test :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <file | od -b

Marcel Bruinsma

unread,

Sep 29, 2009, 5:37:35 AM9/29/09

to

Am Dienstag, 29. September 2009 10:31, syd_p a écrit :

> But with LANG=C which I thought was only 7 bits the following
> printfs work just fine.
>
> $ printf "(octal 353) is the character \0353\n"
> (octal 353) is the character ë
> printf "(octal 361) is the character \0361\n"
> (octal 361) is the character ñ

Good, the typeface (font) has the characters you need.

> These are two of the characters in the MSSQL db which the
> application (not open source) handles as "?".

Check if the application is really the cause of the problem.
For file 'foo', generated by application, run :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b

syd_p

unread,

Sep 29, 2009, 7:43:17 AM9/29/09

to

$ cat foo
A?O
$ od -b foo
0000000 101 077 117 012
0000004
octal 101 = A
octal 077 = ?
octal 077 = O
middle char should be capital ñ

$ LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
0000000
not quite sure what this does ;-)

syd_p

unread,

Sep 29, 2009, 7:49:38 AM9/29/09

to

On 29 Sep, 10:37, Marcel Bruinsma <m...@nomail.afraid.org> wrote:

syd_p

unread,

Sep 29, 2009, 7:53:45 AM9/29/09

to

Ah yes I am - it deletes all "normal" chars and passes the remainder
to od...
not quite sure what all zeros as the out means tho...

And on a Centos 5.3 box I just invoked

$ printf "(octal 353) is the character \0353\n"

(octal 353) is the character 3
on the centos 3.8 box I got the expected output of ë

syd_p

unread,

Sep 29, 2009, 9:35:33 AM9/29/09

to

Ahh - I see need to specify the hex value thusly:
$ printf "(hex EB) is the character \xEB\n"
(hex EB) is the character ë
this works on 3.8 and 5.3

syd_p

unread,

Sep 29, 2009, 9:54:05 AM9/29/09

to

Marcel Bruinsma

unread,

Sep 29, 2009, 10:26:20 AM9/29/09

to

Am Dienstag, 29. September 2009 13:53, syd_p a écrit :

>> $ od -b foo
>> 0000000 101 077 117 012
>> 0000004
>> octal 101 = A
>> octal 077 = ?
>> octal 077 = O
>> middle char should be capital ñ

Ok, the problem is caused by your application.
Try running it with latin9 ctype :
LANG=en_US.ISO-8859-15 application ...

>> $ LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
>> 0000000
>> not quite sure what this does ;-)
> Ah yes I am - it deletes all "normal" chars and passes
> the remainder to od...

Yes, I thought 'foo' might be a big file, with mostly us-ascii.
In a file with 10000 ascii characters and only 10 non-ascii
the output of od (without the tr filter) might be a bit of a
challenge. ;-)

> not quite sure what all zeros as the out means tho...

Od starts eachs line with an address, the offset of the first
byte in that line. The first line starts at offset 0, unless you
invoke od with the -j option.

> And on a Centos 5.3 box I just invoked
> $ printf "(octal 353) is the character \0353\n"
> (octal 353) is the character 3

Yes, that is what posix printf is required to do. From,
http://www.opengroup.org/onlinepubs/9699919799/utilities/printf.html
« [...] "\ddd", where ddd is a one, two, or three-digit octal
» number, shall be written as a byte with the numeric value
» specified by the octal number. »

In the printf above, "\035" is an escape sequence, and the
following "3" is a normal digit. To write octal 353, the
printf format string should be '(octal 353) ... \353\n'.

> on the centos 3.8 box I got the expected output of ë

Are you using zsh on the centos 3.8 box?
The zsh built-in printf expects octal escape sequences to
start with '\0' followed by zero, one, three or four octal
digits.

posix: \353 => zsh: \0353
posix: \75 => zsh: \075

Marcel Bruinsma

unread,

Sep 29, 2009, 11:17:15 AM9/29/09

to

Am Dienstag, 29. September 2009 15:54, syd_p a écrit :

> $ printf "(hex EB) is the character \xEB\n"
> (hex EB) is the character ë
> this works on 3.8 and 5.3

This a nice one to try. It shows the terminal mapping:
printf '\xa4 \x80\n'

To understand that:
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/ISO-8859-15.gz
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/ISO-8859-1.gz
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/CP1252.gz

syd_p

unread,

Nov 1, 2009, 7:44:31 AM11/1/09

to

Finally I have had some success with:
LC_ALL=en_GB.iso88591
LANG=en_GB.iso88591
Does that make sense?
Have not yet tried en_US.iso88591
presumably that should work too? What difference would replacing GB
with US have - in general?

Thanks very much for your input.

Syd