Troube with GB2312

1 view
Skip to first unread message

Paul Deuter

unread,
Jan 28, 2002, 7:53:59 PM1/28/02
to icu-ch...@www-126.southbury.usf.ibm.com
We have a document in GB2312 that we are trying to convert to
Unicode (UCS-2). We are having trouble with the character
"0xA1 0xA1" which does not seem to have a representation in
Unicode.

Is anyone familiar with this character in GB2312? Is this
a bug in the ICU converter?

Paul Deuter
Internationalization Manager
Plumtree Software
paul....@plumtree.com

Markus Scherer

unread,
Jan 30, 2002, 12:46:18 PM1/30/02
to Paul Deuter, icu-ch...@www-126.southbury.usf.ibm.com
Hi Paul,

I took a look at our convrtrs.txt and ibm-1383.ucm files.

GB 2312 is implemented in ICU with the IBM 1383 codepage. There, A1A1 is
the substitution character, which is used when mapping from a Unicode code
point that does not have a representation in the codepage. I suppose one
could roundtrip-map U+FFFD<->A1A1 but the mapping table does not do that.
The default converter callback will treat A1A1 as unassigned itself
however and substitute it with U+FFFD anyway.

Do you have any evidence that A1A1 is treated as a graphic character in
other implementations of GB 2312?

Best regards,
markus

Markus Scherer IBM GCoC-Unicode/ICU San José, CA
markus....@us.ibm.com (also for SameTime)





"Paul Deuter" <Pa...@plumtree.com>
Sent by: icu-chars...@www-126.southbury.usf.ibm.com
2002-01-28 04:53 PM


To: <icu-ch...@www-126.southbury.usf.ibm.com>
cc:
Subject: Troube with GB2312
_______________________________________________
icu-charsets mailing list
icu-ch...@oss.software.ibm.com
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu-charsets



Paul Deuter

unread,
Jan 30, 2002, 5:03:01 PM1/30/02
to Markus Scherer, George Rhoten, icu-ch...@www-126.southbury.usf.ibm.com

Hi Markus, George,
You both responded to my email and I am very appreciative for the help.
I have attached the HTML file that contains the A1 A1 in GB2312.
When you display this HTML in Internet Explorer, the A1 A1 is rendered
as if it were a space.
However since we are gatewaying this page and converting the text from
GB2312 to Unicode, we see the common ?? where the original just shows
spaces. Of course, the customer sees this as a bug because the original

does not have question marks. We are using ICU 1.8.1 to do the
transcoding and
we are seeing this problem. I have been told that the transcoding in
ICU 1.6 did *not* result in question marks. I guess there must be
a difference in the tables between 1.6 and 1.8.1?

I am not sure where the HTML originally came from. It might have been
a Word document. This issue is not the end of the world: we can simply
document it. But I would like to try to understand the failure as
thoroughly
as possible before deciding how to handle it.

Thanks for your help.
Paul



Paul Deuter
Internationalization Manager
Plumtree Software
paul....@plumtree.com



zh.html

Markus Scherer

unread,
Jan 30, 2002, 8:11:18 PM1/30/02
to Paul Deuter, George Rhoten, icu-ch...@www-126.southbury.usf.ibm.com
Paul, I investigated this some more. It turns out that we do have a
defective mapping table file in ICU.

The one that we got generated from a different team for ICU 1.7 (I think)
incorrectly marks the mapping of U+3000 to A1A1 as a "substitution
mapping", which gets ignored in ICU because it has user-customizable
handling of substitution etc.

The file in our mapping table repository has this pair marked properly as
a roundtrip mapping. Please download the correct file from
http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/ucm/ibm-1383_P110-2000.ucm?rev=1.1&content-type=text/plain
and replace the ibm-1383.ucm file in your ICU build.

This does not show up when converting from Unicode to GB 2312 because,
using the default converter fallback, you still get A1A1 as output for
U+3000 because that is the substitution character.
The problem is when converting from GB 2312 to Unicode, where A1A1 is
marked as unassigned and the default callback writes the Unicode
substitution character U+FFFD.

I am going to submit a bug report, and we will fix the mapping table for
ICU 2.1.

Thank you very much for bringing this to our attention!

Best regards,
markus


Markus Scherer IBM GCoC-Unicode/ICU San José, CA
markus....@us.ibm.com (also for SameTime)





"Paul Deuter" <Pa...@plumtree.com>
2002-01-30 02:03 PM


To: Markus Scherer/Cupertino/IBM@IBMUS, George Rhoten/Cupertino/IBM@IBMUS
cc: <icu-ch...@www-124.southbury.usf.ibm.com>
Subject: RE: Troube with GB2312
#### zh.html has been removed from this note on January 30 2002 by Markus
Scherer


Paul Deuter

unread,
Jan 30, 2002, 8:50:35 PM1/30/02
to Markus Scherer, George Rhoten, icu-ch...@www-126.southbury.usf.ibm.com
Thank you very much for your help. I am really happy to
get such a quick answer to this problem. The workaround
is a big help to us.

-Paul

Paul Deuter
Internationalization Manager
Plumtree Software
paul....@plumtree.com



-----Original Message-----
From: Markus Scherer [mailto:markus....@us.ibm.com]
Sent: Wednesday, January 30, 2002 5:11 PM
To: Paul Deuter
Cc: George Rhoten; icu-ch...@www-124.southbury.usf.ibm.com
Subject: RE: Troube with GB2312


Reply all
Reply to author
Forward
0 new messages