Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unicode Character not displaying on Dialog Screen

66 views
Skip to first unread message

Tony C.

unread,
Nov 25, 2009, 12:27:39 AM11/25/09
to

Hi All,

I'm tryingt to display a degree symbol character 176 -
I've converted my app to unicode and can get the TCHAR into a
CStringW - I can see the character in the Variable when I debug it.

But when I go to display it on the screen, I get a garbage character.
(A square or something ).

I'm using an Arial Font.. Here is some code: ( the Arial font works
in VB to display character 176)

===========================================

strDegr.Format( _T("%d"), CD.m_intNormDegree);


TCHAR ds = 176;
CStringW strDegSymbol(ds);

strDegr = strDegr + L" " + strDegSymbol;

dlgDC->TextOutW(ptY.x, ptY.y, strDegr );
============================================

any ideas why it might be printing garbage instead of a "degree"
symbol?


Thanks in advance...
Tony C.

Tony C.

unread,
Nov 25, 2009, 1:01:03 AM11/25/09
to

Oops!
Cancel that!!

I just ran across one of Newcomer's posts from 2005...

I forgot to select the new font into my DC...
I didn't realize I wasn't using the right font.

and that did the trick..!!

Tony C.

Mikel

unread,
Nov 25, 2009, 4:25:54 AM11/25/09
to

Hi, I see you've already got a solution.
But I would like to point out that you should use WCHAR with CStringW,
or use TCHAR and CString, but mixing both can lead to problems.

Giovanni Dicanio

unread,
Nov 25, 2009, 9:05:43 AM11/25/09
to
"Mikel" <mikel...@gmail.com> ha scritto nel messaggio
news:35169e67-e4bf-41fb...@d21g2000yqn.googlegroups.com...

>>
>> strDegr.Format( _T("%d"), CD.m_intNormDegree);
>>
>> TCHAR ds = 176;
>> CStringW strDegSymbol(ds);
>>
>> strDegr = strDegr + L" " + strDegSymbol;
>>

I think that a simple CStringW.Format() would be OK, embedding the Unicode
Character 'DEGREE SIGN' (U+00B0) (
http://www.fileformat.info/info/unicode/char/00b0/index.htm ) in string
format argument:

CStringW strDegr;
strDegr.Format( L"%d \u00B0", CD.m_intNormDegree );

(or load the string from resources as well...).

Giovanni

Joseph M. Newcomer

unread,
Nov 25, 2009, 12:10:24 PM11/25/09
to
I would have written

TCHAR ds = _T("�");
and not bothered with the rest of the code, but the extra concatenation is unnecessary; I
would have written

CString strDegr;
strDegr.Format(_T("%d �"), CD.m_intNormDegree);

note that "degree" is not a "Unicode" character; in the ISO-8859-1 standard it is code
0xB0, which is a valid character in even an 8-bit character string. I was writing code
like this in 16-bit Windows.
joe

Joseph M. Newcomer [MVP]
email: newc...@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Giovanni Dicanio

unread,
Nov 25, 2009, 1:27:12 PM11/25/09
to
"Joseph M. Newcomer" <newc...@flounder.com> ha scritto nel messaggio
news:e1pqg5l8ptdce5h5f...@4ax.com...

> I would have written
>
> TCHAR ds = _T("�");
> and not bothered with the rest of the code, but the extra concatenation is
> unnecessary; I
> would have written
>
> CString strDegr;
> strDegr.Format(_T("%d �"), CD.m_intNormDegree);
>
> note that "degree" is not a "Unicode" character; in the ISO-8859-1
> standard it is code
> 0xB0, which is a valid character in even an 8-bit character string. I was
> writing code
> like this in 16-bit Windows.

I would not insert non-ASCII characters in C/C++ source codes.

0xB0 has the most significant bit set, it is not a pure-ASCII (I mean: 7-bit
ASCII) character; so I would prefer using the \u00B0 encoding in source
code, instead of directly using � .

My 2 cents,
Giovanni

Mihai N.

unread,
Nov 26, 2009, 2:09:29 AM11/26/09
to

> TCHAR ds = _T("�");

This is not portable.
You will have problems if you try to compile the thing on a non-English OS.
(correct: "OS with the system code page set to something else than 1252")


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Mihai N.

unread,
Nov 26, 2009, 2:12:14 AM11/26/09
to

> (or load the string from resources as well...).
++1

In your example you have a space between the number and the degree.
In some countries that should not be there.
And in some other countries they might actually use the degree
sign *and* the "C" for Celsius.
So it is localizable thing.

Goran

unread,
Nov 26, 2009, 7:26:16 AM11/26/09
to

Yes, but that should still work with at least MS's compiler if file is
saved in Unicode (e.g.UTF-8). I used "Save with encoding" with
success, or so I think ;-) ).

No idea what would happen on other compilers/systems, though...

Goran.

Tony C.

unread,
Nov 26, 2009, 11:34:22 AM11/26/09
to
Thanks for all the good info...

Tony C.

Mihai N.

unread,
Nov 26, 2009, 1:55:35 PM11/26/09
to
> Yes, but that should still work with at least MS's compiler if file is
> saved in Unicode (e.g.UTF-8). I used "Save with encoding" with
> success, or so I think ;-) ).

It works ok if you save as UTF-8 (or UTF-16) with BOM.
(and if you compile yor app as Unicode)

> No idea what would happen on other compilers/systems, though...

On Linux you will have to set the LANG to a UTF-8 locale
(like en_US.UTF-8 or ja_JP.UTF-8, but not en_US.ISO-8859-1 or
ja_JP.Shift_JIS) The .UTF-8 locales are quite often the default,
but there is no guarantees.

I have no clue about Mac OS.


But the thing uses TCHAR, and _T, so it is not too portable anyway.
In this case the danger is just moving between different Windows machines.

Joseph M. Newcomer

unread,
Nov 26, 2009, 2:20:43 PM11/26/09
to
Actually, the Microsoft C compiler can accept in a quoted string any character in the
range of 32-255, and historically every C compiler I have worked with since the 1980s has
worked this way. This includes C compilers from a variety of vendors, on Unix, MS-DOS,
Windows, and several embedded platforms. (Below 32 is dangerous; while the MS compiler
can accept a tab character, I met one C compiler that rejected *any* character code below
32, and we had to use \t, which was always my preference anyway. I have grown to despise
the tab character in my career, and only use it in very limited contexts, usually using it
as a field-delimiting token rather than as something that controls layout. I always,
without exception, turn off tabs in source code and ONLY use spaces for doing layout. Any
source file that passes through my editor is automatically detabified on save)

\u00b0 should work only on L" " or L' ' contexts since it would not be legal in 8-bit
character strings.

The only problem arises when the code page doesn't support the degree symbol in its font,
e.g., MS-DOS vs. Windows code pages (CP_OEM vs. CP_ANSI) but the degree symbol at code
point 0xB0 is the only one listed in Unicode (there are �ソスF and �ソスC in the 2100 area, but
suprisingly, not a �ソスK).

Across dozens of text editors over the years, I've never found one that failed to properly
handle non-7-bit characters in quoted strings. And no C compiler since at least 1983 has
failed to accept these in a quoted string, across a huge number of vendors, included
embedded systems.
joe

On Wed, 25 Nov 2009 19:27:12 +0100, "Giovanni Dicanio"
<giovanniD...@REMOVEMEgmail.com> wrote:

>"Joseph M. Newcomer" <newc...@flounder.com> ha scritto nel messaggio
>news:e1pqg5l8ptdce5h5f...@4ax.com...
>
>> I would have written
>>

>> TCHAR ds = _T("�ソス");


>> and not bothered with the rest of the code, but the extra concatenation is
>> unnecessary; I
>> would have written
>>
>> CString strDegr;

>> strDegr.Format(_T("%d �ソス"), CD.m_intNormDegree);


>>
>> note that "degree" is not a "Unicode" character; in the ISO-8859-1
>> standard it is code
>> 0xB0, which is a valid character in even an 8-bit character string. I was
>> writing code
>> like this in 16-bit Windows.
>
>I would not insert non-ASCII characters in C/C++ source codes.
>
>0xB0 has the most significant bit set, it is not a pure-ASCII (I mean: 7-bit
>ASCII) character; so I would prefer using the \u00B0 encoding in source

>code, instead of directly using �ソス .
>
>My 2 cents,
>Giovanni
>

Joseph M. Newcomer

unread,
Nov 26, 2009, 3:44:54 PM11/26/09
to
Actually, as I pointed out in an earlier reply, I have not encountered a C compiler in the
last 26 years that failed to accept characters in the range of 128-255 in quoted strings.
It may not be "portable" across platforms for the reason I pointed out (different code
pages) but I found the degree symbol is at code point 0xB0 in the following code pages:

1250 Central Europe
1251Cyrillic (Slavik)
1252 Latin-1
1253 Greek
1254 Turkish
1255 Hebrew
1256 Arabic
1257 Baltic Rim
20621 (T.61)
28591 ISO-8859-1 Latin-1
28592 ISO-8859-2 Latin-2
28594 ISO-8859-4 Baltic
28597 ISO-8859-7 Greek
28599 ISO-8859-9 Latin-5
28605 ISO-8859-15 Latin-9


However, it is not in any of the MS-DOS code pages (which should no longer be of
interest).

Note that since Unicode does not embody any other encoding at any other character set
position for degree, anyone using Unicode would see that same symbol at that position.

I also tried it in my Locale Explorer. Of course, it did not convert properly from 8-bit
code 0xB0 to anything for any MAC or OEM code page, but it did for all the ANSI and
Windows code pages. It failed for the following code pages:

1CP_OEM U2591, (medium gray block)
2 CP-MACCP U221E, (infinity symbol)
37 IBM EBCDIC (U005E, circumflex accent)
42 CP_SYMBOL (UF0B0, x inside black circle)
437 OEM - US (MS-DOS; U2591, medium gray block)
500 IBM EBCDIC International (U00A2, cent sign)
737 OEM Greek (MS-DOS; U2591, medium gray block)
775 OEM Baltic (ditto)
850 OEM Multilingual (ditto)
852 OEM Latin II (ditto)
855 OEM Cyrillic (ditto)
857 OEM Turkish (ditto)
860 OEM Portugeuse (ditto)
861 OEM Icelandic (ditto)
863 OEM Candian French (ditto)
865 OEM Nordic (ditto)
866 OEM Russian (ditto)
869 OEM Modern Greek (ditto)
874 OEM Thai (converted to Unicode U0E10, THO THAN)
875 IBM EBCDIC Modern (Pound Sterling symbol, U00A3)
932 ANSI/OEM Japanese (UFF70)
936 ANSI/OEM Simplified Chinese (translated as the empty string!)
949 ANSI/OEM Korean (ditto)
950 ANSI/OEM Traditional Chinese (ditto)
1026 IBM EBCDIC Turkish (U00A2, cent symbol)
10000 MAC Roman (U221E, infinity symbol)
10001 MAC Greek 1 (U0391, Alpha)
10007 MAC Cyrillic (U221E, infinity symbol)
10010 MAC Romania (ditto)
10017 MAC Ukraine (ditto)
10029 MAC Latin II (U012F, Small Letter I with Ogonek)
10079 MAC Icelandic (U221E, infinity symbol)
10081 MAC Turkish (ditto)
10082 MAC Croatia (ditto)
20107 US-ASCII (U0030, zero)
20866 Russian KO18 (U255F, box drawing, double vertical with centered single horizontal)
21866 Ukrainian KO18-U (ditto)
28595 ISO-8859-5 Cyrillic (U0410, CYRILLIC CAPITAL LETTER A)

Note that these are the indicate translations from 0xB0 using MultiByteToWideChar. OTOH,
if the application is written in Unicode and compiled by the Microsoft compiler, the
character I show is translated into U00B0 in the binary. THEN, if the code pages are used
to translate back (WideCharToMultiByte)

1 CP_OEMCP => 0xF8, degree symbol
2 CP_MACCP => 0xA1, degree symbol
37 IBM EBCDIC => 0x90, degree symbol
42 CP_SYMBOL => error, no translation, no character appears
437 OEM United States => 0xF8, degree symbol
500 IBM EBCDIC International => 0x90, degree symbol
737 OEM Greek => 0xF8, OEM degree symbol
775 OEM Baltic (ditto)
850 OEM Multilingual Latin (ditto)
852 OEM Latin II (ditto)
855 OEM Cyrillic => 0x6F
857 OEM Turkish => 0xF8
860 OEM Portugeuse => 0xF8
861 OEM Icelandic => 0xF8
863 OEM Canadian French => 0xF8
865 OEM Nordic => 0xF8
866 OEM Russian => 0xF8
869 OEM Modern Greek => 0xF8
874 ANSI/OEM Thai => 0x3F (?)
875 IBM/EBCDIC Modern => 0x90 (degree)
832 ANSI/OEM Japanese => 0x81 0x8B
936 ANSI/OEM Simplified Chinese => 0xA1 0xA3
949 ANSI/OEM Korean => 0xA1 0xC6
950 ANSI/OEM Traditional Chinese => 0xA2 0x58
1026 IBM EBCDIC - Turkish => 0x90 (degree)
1250 ANSI-Central Europe => 0xB0
1251 ANSI-Cyrillic => 0xB0
1252 ANSI-Latin-1 => 0xB0
1253 ANSI-Greek => 0xB0
1254 ANSI-Turkish => 0xB0
1255 ANSI-Hebrew => 0xB0
1256 ANSI-Arabic => 0xB0
1257 ANSI-Baltic => 0xB0
1258 ANSI-Viet Nam => 0xB0
10000 MAC Roman => 0xA1
10006 MAC Greek 1 => 0xAE
10007 MAC Cyrillic => 0xA1
10010 MAC Romania => 0xA1
10017 MAC Ukraine => 0xA1
10029 MAC Latin II => 0xA1
10079 MAC Icelanding => 0xA1
10081 MAC Turkish => 0xA1
10082 MAC Croatia => 0xA1
20127 US-ASCII => 3F (?)
20261 T.16 => 0xB0
20866 Russian KO18 => 0x9C
21866 Ukraine KO18-U => 0x9C
28591 ISO-8859-1 Latin I => 0xB0
28592 ISO-8859-2 Central Europe => 0xB0
28594 ISO-8859-4 Baltic => 0xB0
28595 ISO-8859-5 Cyrillic => 3F (?)
28597 ISO-8859-7 Greek => 0xB0
28599 ISO-8859-9 Latin 5 => 0xB0
28605 ISO-8859-15 Latin 9 => 0xB0

So if I were concerned about "portability" to any other platform using any editor using
any other code page, there is some slight concern about the appearance. But I'm writing
an MFC program. The chances that someone using other than a Windows platform with the
Windows fonts is vanishingly small. So it isn't worth worrying about. I use the
characters, so I can see what they are in my text and don't have to keep looking them up.

In the days when I cared in the slightest, I would write

#define COPYRIGHT "\xA9"

and then write

"This product is Copyright " COPYRIGHT " 1987, Flounder International, Ltd. All Rights
Reserved"

(in those days I programmed only 8-bit apps because Unicode was not available on
platforms) then I might have a variety of #ifdefs around the #define for different
platforms, e.g.,

#ifdef MS_DOS
#define COPYRIGHT "(c)"
#else
#ifdef WINDOWS
#define COPYRIGHT "\xA9"
#else
#ifdef MACINTOSH
#define COPYRIGHT "\xA9"
#endif
#endif
#endif

but I no longer care about "portability" in that sense; today I program one kind of app,
Windows apps. I don't care if they don't run on linux, or a Mac, or MS-DOS, and I don't
care if someone using the Atari editor under OS/2 and using the Small-C compiler can't
read the character or compile it for an IBM Mainframe.
joe

Joseph M. Newcomer

unread,
Nov 26, 2009, 3:58:35 PM11/26/09
to
The last major program I wrote using temperatures allowed the user to see �F, �C and �K.
Of course, I had to be careful on a copy-and-paste to deal with copying a temperature from
a �C edit control and pasting it into a �F edit control. Since the sensors were using
calibrations of 1/64�C, I converted the temperatures from the text form to a new clipboard
format, CF_JMN_TEMPERATURE. A paste first looked for CF_JMN_TEMPERATURE format to paste
that, and did the conversion to the selected base of the paste control. Because it was
safety related, we did NOT allow pasting of text values into these controls unless the
text contained the � symbol (optional) followed by F, C or K. That is, I could accept
text of the form

46�F
46F
46�C
46C
46�K
46K

but I would not accept "46". Therefore, to enable the "paste" button, the program first
checked for CF_JMN_TEMPERATURE format; if it was present, the paste was enabled. But if
that format was missing, I actually had to read the CF_TEXT format data, and if it was not
in the form

<whitespace>*<digit>+ <whitespace>* '�'? <whitespace>* {'C' | 'F' | 'K'} <whitespace>*

then the control was not enabled. (The device was a liquid CO2 gas chromatograph, and the
metal pressure vessel was subjected to 10000 PSI (~70 Megapascals or ~680 Atmospheres),
and if an incorrect temperature was used it could explode (they were behind 1-inch thick
Lexan plastic). So knowing the right temperature was important!

Note this was a Windows app. I would not care in the slightest if it could not be
compiled for an IBM mainframe using EBCDIC encoding.
joe

On Wed, 25 Nov 2009 23:12:14 -0800, "Mihai N." <nmihai_y...@yahoo.com> wrote:

>
>> (or load the string from resources as well...).
>++1
>
>In your example you have a space between the number and the degree.
>In some countries that should not be there.
>And in some other countries they might actually use the degree
>sign *and* the "C" for Celsius.
>So it is localizable thing.

Giovanni Dicanio

unread,
Nov 26, 2009, 3:56:25 PM11/26/09
to
"Joseph M. Newcomer" <newc...@flounder.com> ha scritto nel messaggio
news:iektg590eflsall7t...@4ax.com...

> Actually, the Microsoft C compiler can accept in a quoted string any
> character in the
> range of 32-255, and historically every C compiler I have worked with
> since the 1980s has
> worked this way.

Joe: I don't doubt that, of course.

But the problem (I think...) is that if you give your source code with the
0xB0 � character to someone like a Japanese programmer who uses e.g. Windows
code page 932, instead of the degree symbol he finds a different symbol, in
fact, B0 in Windows code page 932 maps to U+FF70:

http://www.fileformat.info/info/unicode/char/ff70/index.htm

So, I think the problem is not the fact that you can embed characters in
range 128-255 in your strings in your source code, but the *portability* of
that source code to different code pages.

> I have grown to despise
> the tab character in my career, and only use it in very limited contexts,
> usually using it
> as a field-delimiting token rather than as something that controls layout.
> I always,
> without exception, turn off tabs in source code and ONLY use spaces for
> doing layout. Any
> source file that passes through my editor is automatically detabified on
> save)

I completely agree with you here!
Unfortunately I started using the tabs, but I discovered that there were
problems because some editors used 4 spaces for a tab, other editors used 8
spaces, other 2 spaces... so the layout of the source code appeared in a
wrong way on different editors. Instead, using spaces instead of tabs works
fine.


> (there are �F and �C in the 2100 area, but
> suprisingly, not a �K).

Because I think that �K is wrong notation. The kelvin unit for temperatures
is not accompanied by the degree symbol � (unlike Fahrenheit and Celsius
units).

0�C = 273.15 K (no �K)

http://en.wikipedia.org/wiki/Kelvin


Giovanni

Joseph M. Newcomer

unread,
Nov 26, 2009, 6:23:01 PM11/26/09
to
See below...

On Thu, 26 Nov 2009 21:56:25 +0100, "Giovanni Dicanio"
<giovanniD...@REMOVEMEgmail.com> wrote:

>"Joseph M. Newcomer" <newc...@flounder.com> ha scritto nel messaggio
>news:iektg590eflsall7t...@4ax.com...
>
>> Actually, the Microsoft C compiler can accept in a quoted string any
>> character in the
>> range of 32-255, and historically every C compiler I have worked with
>> since the 1980s has
>> worked this way.
>
>Joe: I don't doubt that, of course.
>
>But the problem (I think...) is that if you give your source code with the
>0xB0 � character to someone like a Japanese programmer who uses e.g. Windows
>code page 932, instead of the degree symbol he finds a different symbol, in
>fact, B0 in Windows code page 932 maps to U+FF70:
>
> http://www.fileformat.info/info/unicode/char/ff70/index.htm
>
>So, I think the problem is not the fact that you can embed characters in
>range 128-255 in your strings in your source code, but the *portability* of
>that source code to different code pages.

****
If the Japanese had similar concerns I might be somewhat more motivated. But some years
ago I had the pleasure of working with a program that was written in Japan, and not only
were none of the strings readable, none of the variables were readable. I sent it back to
the client and asked for an English translation. It took them six months, and they sent
it back to me in the middle of a major contract, so I had to send it back with "try me
again in six months, I'll schedule you a slot around February" and they decided not to
wait, now that they had a translation, so did it in-house.

If the worst a Japanese programmer has to deal with is one character in one code page, I
don't see it as a major crisis. I see things like this all the time; for example, people
who used the "character drawing" set to do their pictures in comments in MS-DOS, and I get
these things with funny accented characters. I don't freak out. So I consider it far
more important that I can read the code on my machine, and my customers can read the code
on their machines, than if someone with an unusual code page can read my code.

Terribly ethnocentric, I admit, but I got tired of using \x (now \u) notations. Perhaps
someday we will be able to have C code in Unicode, with portable descriptors of what
characters are legal in identifiers so I can just add someone's "charset.h" and see all
the characters in the native character set (which, admittedly, would not be terribly
useful to me because I can only read one character set, but it would help everyone else)
joe


>
>> I have grown to despise
>> the tab character in my career, and only use it in very limited contexts,
>> usually using it
>> as a field-delimiting token rather than as something that controls layout.
>> I always,
>> without exception, turn off tabs in source code and ONLY use spaces for
>> doing layout. Any
>> source file that passes through my editor is automatically detabified on
>> save)
>
>I completely agree with you here!
>Unfortunately I started using the tabs, but I discovered that there were
>problems because some editors used 4 spaces for a tab, other editors used 8
>spaces, other 2 spaces... so the layout of the source code appeared in a
>wrong way on different editors. Instead, using spaces instead of tabs works
>fine.
>
>
>> (there are �F and �C in the 2100 area, but
>> suprisingly, not a �K).
>
>Because I think that �K is wrong notation. The kelvin unit for temperatures
>is not accompanied by the degree symbol � (unlike Fahrenheit and Celsius
>units).
>
> 0�C = 273.15 K (no �K)
>
>http://en.wikipedia.org/wiki/Kelvin

****
Interesting...oh well, the program has been in use for a decade, I doubt if they'd want to
change it...
joe
****

Mihai N.

unread,
Nov 27, 2009, 4:54:15 AM11/27/09
to

874: 0xB0 U+0E10 # THAI CHARACTER THO THAN
932: 0xB0 U+FF70 # HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
936: 0xB0 # DBCS LEAD BYTE
949: 0xB0 # DBCS LEAD BYTE
950: 0xB0 # DBCS LEAD BYTE

When you compile, the .cpp file is interpreted as being ANSI
(unless there is a BOM).
If the string is L"...", then it is converted from ANSI to Unicode
and then stored in the binary.
If the ANSI code page is 932 (Japanese), the stuff in the binary will
be U+FF70.
If the ANSI code page is Chinese (Traditional or Simplified) or Korean,
then B0 is a lead byte, the trailing byte is the ", and the compilation
fails.


So if the code will ever be compiled on a Chinese/Japanese/Korean Windows
system (same version, compiler, etc.), it is worth worrying about.
And with the current world


> #define COPYRIGHT "\xA9"
> and then write

> "This product is Copyright " COPYRIGHT " 1987, Flounder International..."

The problem remains the same if you compile to Unicode on a CCJK/Thai
machine.

Mihai N.

unread,
Nov 27, 2009, 5:15:12 AM11/27/09
to

> not only
> were none of the strings readable, none of the variables were readable.

It is not about being readable, it is about getting the wrong result in
the binary, or even compilation failure.
If the string is Unicode (L"..") and the source file is not Unicode,
then the Unicode value to store in the executable is obtained by
converting your string using the ANSI code page.
So you get the wrong Unicode value in the binary.
Or (depending on the character and the string) the compilation fails.

Mihai N.

unread,
Nov 27, 2009, 5:24:37 AM11/27/09
to

> B0 is a lead byte, the trailing byte is the ", and the compilation fails.

Checked: the " is invalid as trailing byte, so the code page conversion
fails and you get '?'
But if you are not at the end of the string the next character might
be a valid lead byte. So it is consumed and you get a Kanji.
In fact, 0x5C ('\') is a valid trail byte in Japanese.

Joseph M. Newcomer

unread,
Nov 27, 2009, 12:09:41 PM11/27/09
to
Interesting problem! No, I had not considered how the compiler determines the code page.
I now see the problem.
joe

On Fri, 27 Nov 2009 01:54:15 -0800, "Mihai N." <nmihai_y...@yahoo.com> wrote:

>
>874: 0xB0 U+0E10 # THAI CHARACTER THO THAN
>932: 0xB0 U+FF70 # HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
>936: 0xB0 # DBCS LEAD BYTE
>949: 0xB0 # DBCS LEAD BYTE
>950: 0xB0 # DBCS LEAD BYTE
>
>When you compile, the .cpp file is interpreted as being ANSI
>(unless there is a BOM).
>If the string is L"...", then it is converted from ANSI to Unicode
>and then stored in the binary.
>If the ANSI code page is 932 (Japanese), the stuff in the binary will
>be U+FF70.
>If the ANSI code page is Chinese (Traditional or Simplified) or Korean,
>then B0 is a lead byte, the trailing byte is the ", and the compilation
>fails.
>
>
>So if the code will ever be compiled on a Chinese/Japanese/Korean Windows
>system (same version, compiler, etc.), it is worth worrying about.
>And with the current world
>
>
>> #define COPYRIGHT "\xA9"
>> and then write
>> "This product is Copyright " COPYRIGHT " 1987, Flounder International..."
>
>The problem remains the same if you compile to Unicode on a CCJK/Thai
>machine.

Joseph M. Newcomer

unread,
Nov 27, 2009, 12:12:22 PM11/27/09
to
Actually, this seems to suggest that there ought to be a

#pragma(encoding="1252")
or
#pragma(encoding="en-us")

or similar feature available to indicate what code page should be used to make the
conversion!
joe

On Thu, 26 Nov 2009 10:55:35 -0800, "Mihai N." <nmihai_y...@yahoo.com> wrote:

>> Yes, but that should still work with at least MS's compiler if file is
>> saved in Unicode (e.g.UTF-8). I used "Save with encoding" with
>> success, or so I think ;-) ).
>
>It works ok if you save as UTF-8 (or UTF-16) with BOM.
>(and if you compile yor app as Unicode)
>
>> No idea what would happen on other compilers/systems, though...
>
>On Linux you will have to set the LANG to a UTF-8 locale
>(like en_US.UTF-8 or ja_JP.UTF-8, but not en_US.ISO-8859-1 or
>ja_JP.Shift_JIS) The .UTF-8 locales are quite often the default,
>but there is no guarantees.
>
>I have no clue about Mac OS.
>
>
>But the thing uses TCHAR, and _T, so it is not too portable anyway.
>In this case the danger is just moving between different Windows machines.

Mihai N.

unread,
Nov 28, 2009, 12:54:02 AM11/28/09
to
> Actually, this seems to suggest that there ought to be a
>
> #pragma(encoding="1252")
> or
> #pragma(encoding="en-us")
>
> or similar feature available to indicate what code page should be
> used to make the conversion!

For VS there is #pragma setlocale( "english" )
And the code page is the ansi code page for that locale.
Nothing to force a code page, so no utf-8
(for that you need a BOM).

Nothing similar in the gcc (Linux/UNIX) world.

Mihai N.

unread,
Nov 28, 2009, 1:03:51 AM11/28/09
to
> I now see the problem.

Well, it is just some experience.
This (like many other lessons) is based on something
that happened for some software I have seen.

That software stored the error messaged in .h files,
each one in the ansi code pages for each translation.
(I know, horrible).

And all was nice and dandy, but somewhere at version 6
the Japanese stopped compiling. All were puzzled "but
that's how we did it for years!"

Well, it just happened that one of the new
Japanese strings ended in a kanji with the second
byte 5c ('\'). Since the next character was the " to
end the string, and they compiled on an English
(where '\' was not a trailing byte, but a
stand-alone character, resulting in \" ) ...

0 new messages