Traditional Chinese vs. Unicode

0 views
Skip to first unread message

csm...@us.ibm.com

unread,
Dec 4, 2002, 6:15:04 PM12/4/02
to icu-ch...@www-126.southbury.usf.ibm.com
[re-submitted by srlo...@us.ibm.com]

Hello,

I recently ran into an issue with a customer where they were using some
sort of "extended" characters in Traditional Chinese that were not
displaying correctly in their browser (Internet Explorer). After some
digging, I discovered that:

(1) The Big5 -> Unicode and the Cp950 -> Unicode mappings that Java uses
are different from each other. The extended characters the customer was
complaining about do not exist in Java's Big5 mapping, but do exist in
the
Cp950 mapping.

(2) Internet Explorer and Netscape under Windows treat the Big5 charset
as
if it is Cp950 underneath, but they do not understand a charset
explicitly
set to Cp950.

(3) Microsoft's Unicode -> Cp950 mapping is different than Java's
Unicode
-> Cp950 mapping. I believe this is confirmed by the 95% roundtrip
rating
found here: http://oss.software.ibm.com/icu/charset/roundtripIndex.html

It would appear that IBM (and hence Java) are mapping these extended
characters to the private use area in Unicode, whereas Microsoft maps
them
to the typical CJK range.

I was wondering which mapping is supposedly "correct". For example, the
EBCDIC 937 codepoint 0xE2DB maps to the Unicode codepoint 0xF819. Java
then maps this to the Cp950 codepoint 0xF9DC. However, Microsoft maps
the
Unicode codepoint 0x5AFA to the same Cp950 codepoint Is 0x5AFA the
correct
codepoint, or is 0xF819? At first, it seems strange that the characters
would be in the private use area, but I imagine they exist there for
backwards compatibility to a time before the 0x5AFA mapping was
defined...?

Unicode Cp950
F819 <--> F9DC // IBM and Java
5AFA <--> F9DC // Microsoft

In any case, what this means is that a Microsoft client that tries to
display Traditional Chinese data from a web server that has sent it in
UTF-8, will display substitution characters for many codepoints. And
the
reverse is true... if a Microsoft application sends Traditional Chinese
data encoded in UTF-8 to an IBM or Java application, substitution
characters can readily appear.

We would definitely like to solve this problem, as UTF is intended to
be an
encoding that allows disparate systems to share their data, not lose it
to
conversion issues. I also think this problem ties in closely with this
bulletin:
http://w3.gcoc.yamato.ibm.com/library/bulletin/unicodeweb/UTF-8Guide.htm

Any explanation and recommended courses of action to avoid the above
discrepancies are much appreciated!


Christopher R. Smith
JTOpen - http://oss.software.ibm.com/developerworks/projects/jt400
IBM Toolbox for Java - http://www.ibm.com/eserver/iseries/toolbox
iSeries Access for Web - http://www.ibm.com/eserver/iseries/access/web
csm...@us.ibm.com




George Rhoten

unread,
Dec 4, 2002, 7:15:55 PM12/4/02
to Christopher Smith, icu-ch...@www-126.southbury.usf.ibm.com
Christopher,

Alas there is no mapping that is really correct for big5. Every platform
vendor that has a big5 implementation (ibm-950, windows-950, Java's CP950,
etc.) have a mapping table that they view as the correct mapping to and
from Unicode. You can have a big5 mapping table that is correct for a
specific platform, but there isn't one that is known as being the correct
mapping table. This problem is not unique to just big5. When you mix
platforms and standards together, the problem gets worse. Before Unicode
existed, the problem was even bigger than this one that you're seeing now.

Netscape, Mozilla and Internet Explorer all eventually convert HTML pages
into Unicode for internal processing. The best solution would be to keep
all data in a Unicode encoding, and then these conversion problems would
_not_ occur. This means storing, manipulating, transmitting and
displaying all the data as Unicode.

After saying that, unless you have the correct font to display \uF819, you
probably won't be able to display that PUA character. PUA characters
aren't very portable either.

If you were using ICU4C or ICU4JNI, you could always download the
windows-950 UCM file on that roundtripIndex.html page and add it to your
copy of ICU. When you want to use big5, you could use that converter
instead. This would help you get around this problem temporarily.

FYI Sun's JDK CP950 loosely means IBM's implementation of big5 (ibm-950),
and big5 in Sun's JDK means Solaris big5. CP950 is closer to the Windows
implementation (according to roundtipIndex.html).

I hope that information helps.

George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA




Christopher Smith/Rochester/IBM@IBMUS
Sent by: icu-chars...@www-124.southbury.usf.ibm.com
12/04/2002 03:15 PM


To: icu-ch...@www-124.southbury.usf.ibm.com
cc:
Subject: Traditional Chinese vs. Unicode
_______________________________________________
icu-charsets mailing list
icu-ch...@oss.software.ibm.com
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu-charsets


Reply all
Reply to author
Forward
0 new messages