GB2312 in ICU is aliased to ibm-1383. The byte sequence "0xA1 0xA1" is the subchar in that code page, which is a double byte substitution character.
Here is what it looks like in our ibm-1383.ucm file.
<code_set_name> "IBM-1383"
<subchar> \xA1\xA1
<subchar1> \x1A
It could be considered to be the equivalent to \uFFFD in Unicode
(depending on your converter options). You can get these "0xA1 0xA1"
characters in GB2312 when the text was converted from another codepage or
encoding and the converter for GB2312 didn't have a mapping for that
character.
George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA
"Paul Deuter" <
Pa...@plumtree.com>
Sent by:
icu-chars...@www-126.southbury.usf.ibm.com
01/28/2002 04:53 PM
To: <
icu-ch...@www-126.southbury.usf.ibm.com>
cc:
Subject: Troube with GB2312
We have a document in GB2312 that we are trying to convert to
Unicode (UCS-2). We are having trouble with the character
"0xA1 0xA1" which does not seem to have a representation in
Unicode.
Is anyone familiar with this character in GB2312? Is this
a bug in the ICU converter?
Paul Deuter
Internationalization Manager
Plumtree Software
paul....@plumtree.com
_______________________________________________
icu-charsets mailing list
icu-ch...@oss.software.ibm.com
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu-charsets