Re: Trouble with GB2312

0 views
Skip to first unread message

George Rhoten

unread,
Jan 30, 2002, 1:47:03 PM1/30/02
to Paul Deuter, icu-ch...@www-126.southbury.usf.ibm.com
GB2312 in ICU is aliased to ibm-1383. The byte sequence "0xA1 0xA1" is the subchar in that code page, which is a double byte substitution character.

Here is what it looks like in our ibm-1383.ucm file.

<code_set_name> "IBM-1383"
<subchar> \xA1\xA1
<subchar1> \x1A

It could be considered to be the equivalent to \uFFFD in Unicode
(depending on your converter options). You can get these "0xA1 0xA1"
characters in GB2312 when the text was converted from another codepage or
encoding and the converter for GB2312 didn't have a mapping for that
character.

George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA




"Paul Deuter" <Pa...@plumtree.com>
Sent by: icu-chars...@www-126.southbury.usf.ibm.com
01/28/2002 04:53 PM


To: <icu-ch...@www-126.southbury.usf.ibm.com>
cc:
Subject: Troube with GB2312



We have a document in GB2312 that we are trying to convert to
Unicode (UCS-2). We are having trouble with the character
"0xA1 0xA1" which does not seem to have a representation in
Unicode.

Is anyone familiar with this character in GB2312? Is this
a bug in the ICU converter?

Paul Deuter
Internationalization Manager
Plumtree Software
paul....@plumtree.com

_______________________________________________
icu-charsets mailing list
icu-ch...@oss.software.ibm.com
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu-charsets



Reply all
Reply to author
Forward
0 new messages