[re-submitted by
srlo...@us.ibm.com]
Hello,
I recently ran into an issue with a customer where they were using some
sort of "extended" characters in Traditional Chinese that were not
displaying correctly in their browser (Internet Explorer). After some
digging, I discovered that:
(1) The Big5 -> Unicode and the Cp950 -> Unicode mappings that Java uses
are different from each other. The extended characters the customer was
complaining about do not exist in Java's Big5 mapping, but do exist in
the
Cp950 mapping.
(2) Internet Explorer and Netscape under Windows treat the Big5 charset
as
if it is Cp950 underneath, but they do not understand a charset
explicitly
set to Cp950.
(3) Microsoft's Unicode -> Cp950 mapping is different than Java's
Unicode
-> Cp950 mapping. I believe this is confirmed by the 95% roundtrip
rating
found here:
http://oss.software.ibm.com/icu/charset/roundtripIndex.html
It would appear that IBM (and hence Java) are mapping these extended
characters to the private use area in Unicode, whereas Microsoft maps
them
to the typical CJK range.
I was wondering which mapping is supposedly "correct". For example, the
EBCDIC 937 codepoint 0xE2DB maps to the Unicode codepoint 0xF819. Java
then maps this to the Cp950 codepoint 0xF9DC. However, Microsoft maps
the
Unicode codepoint 0x5AFA to the same Cp950 codepoint Is 0x5AFA the
correct
codepoint, or is 0xF819? At first, it seems strange that the characters
would be in the private use area, but I imagine they exist there for
backwards compatibility to a time before the 0x5AFA mapping was
defined...?
Unicode Cp950
F819 <--> F9DC // IBM and Java
5AFA <--> F9DC // Microsoft
In any case, what this means is that a Microsoft client that tries to
display Traditional Chinese data from a web server that has sent it in
UTF-8, will display substitution characters for many codepoints. And
the
reverse is true... if a Microsoft application sends Traditional Chinese
data encoded in UTF-8 to an IBM or Java application, substitution
characters can readily appear.
We would definitely like to solve this problem, as UTF is intended to
be an
encoding that allows disparate systems to share their data, not lose it
to
conversion issues. I also think this problem ties in closely with this
bulletin:
http://w3.gcoc.yamato.ibm.com/library/bulletin/unicodeweb/UTF-8Guide.htm
Any explanation and recommended courses of action to avoid the above
discrepancies are much appreciated!
Christopher R. Smith
JTOpen -
http://oss.software.ibm.com/developerworks/projects/jt400
IBM Toolbox for Java -
http://www.ibm.com/eserver/iseries/toolbox
iSeries Access for Web -
http://www.ibm.com/eserver/iseries/access/web
csm...@us.ibm.com