Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

RTF \fcharset??? to Unicode mapping tables

1,087 views

Skip to first unread message

Christian Roth

unread,

Apr 4, 2002, 6:21:42 PM4/4/02

Hello,

I am working on a Java application that implements an RTF reader capable
of reading international RTF (Rich Text Format) documents.

Basically, I am looking for definitive tables that describe the
\fcharsetN to Unicode mapping used by Word, especially for east-asia
character sets (with N = 128, 129, 130, 134, 136, 163, 222, but also for
the remaining values, if available). Any pointers? I'm looking for
something like the tables downloadable from e.g.

ftp://ftp.unicode.org//Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT

My specific problem at hand:

I have a document which defines in the font table a font like this:

{\f17\fnil\fcharset134\fprq2{\*\panose
02010600030101010101}\'cb\'ce\'cc\'e5{\*\falt SimSun};}

i.e. using fcharset 134, which is GB2312. Later on, I have the following
document text excerpt:

{\b\fs36\lang1033\langfe1028\langfenp1028 \loch\af0\hich\af0\dbch\f17
\'d6\'d0\'87\'f8\'9a\'76\'ca\'b7\'b5\'c4\'c3\'d8\'c3\'dc}

i.e. it uses GB2312 for all the text. The table I am using for
converting the GB2312 to Unicode is the one now located at

ftp://ftp.unicode.org//Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT

Unfortunately, the second and third two-byte characters in above text (=
0x87f8 and 0x9a76) are not part of that very mapping table: they are
neither valid two-byte characters (neither 0x87 nor 0x9a are valid DBCS
start characters) nor are the single characters in the one-byte range of
GB2312 (character positions 0-127).

Now, the document itself has a default encoding of

{\rtf1\ansi\ansicpg936

i.e. is using the default codepage CP936. I investigated that the above
characters (0x87f8 and 0x9a76) actually are two-byte characters and part
of the CP936 codepage, and they also map to the correct Unicode
characters in that encoding.

So I might be led to the assumption that the specified default encoding
for an RTF document serves as a "fallback" encoding when the font
encoding itself does not contain a specific character encountered in
body text. Is this correct (and always the case for any default- +
font-encoding combination)? Or this this just a coincidence in this
special case?

I'd appreciate any light you might be able to shed on this...

Regards, Christian.

--
ro...@infinity-loop.de
http://www.infinity-loop.de

--
Christian Roth
http://www.visualclick.de/

Mihai N.

unread,

Apr 5, 2002, 6:41:14 AM4/5/02

> i.e. using fcharset 134, which is GB2312.

Here is the problem. gcharset 134 is not GB2312. It is Windows 936,
which is "almost" gb2312, with some Microsoft improvements.

Same as Windows 1252 which is "almost" iso-8859-1,
same as Windows 932 which is "almost" shift-jis, and so on.

But MS is not the only one doing this. I also compared the Mac Japanese
code page, it is also "almost" shift-jis, slightly different (but not the
MS 936 code page).

As a general rule, I think it is safer to consider that all MS products
will use the MS codepages and not the standard ones:

1252, not iso-8859-1 (Latin1)
1250, not iso-8859-2 (Latin2)
...
932, not shift-jis
936, not gb2312
950, not big5
...
etc.

Mihai

mimos...@137.189.151.233

unread,

Apr 7, 2002, 6:05:33 AM4/7/02

Dear Roth,

※ 引述 ro...@visualclick.de (Christian Roth) 的銘言:
: {\b\fs36\lang1033\langfe1028\langfenp1028 \loch\af0\hich\af0\dbch\f17

: \'d6\'d0\'87\'f8\'9a\'76\'ca\'b7\'b5\'c4\'c3\'d8\'c3\'dc}
: i.e. it uses GB2312 for all the text. The table I am using for
: converting the GB2312 to Unicode is the one now located at
: ftp://ftp.unicode.org//Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT
: Unfortunately, the second and third two-byte characters in above text (=
: 0x87f8 and 0x9a76) are not part of that very mapping table: they are
: neither valid two-byte characters (neither 0x87 nor 0x9a are valid DBCS
: start characters) nor are the single characters in the one-byte range of
: GB2312 (character positions 0-127).

I believe I can answer your question, as I can read Chinese, and I am
currently involved into a project on RTF processing. I can tell you
that, there is nothing related to the RTF standard, but it is the
changes of National Standard that made you puzzled.

If you have some knowledge in Chinese, you shall know that there are
two kinds of Chinese: The traditional script (used by people in Taiwan
and Hong Kong), and the simplified script (used by people in Mainland
China and Singapore). People in Macau and Chinese in Malaysia used
either of them. GB2312 was used to be a standard for Simplified
Chinese. But it is not the complete story. There is several versions
of GB2312, although the later one has another standard number. For
the newer version of this standard, the authority in Mainland China
tried to incorporate the Traditional Chinese into their standard, so
all Chinese speaking people can use only one standard. In the Chinese
text quoted by you, the second character 0x87f8 and the third character
0x9a76 is using the traditional script, while for the other characters,
they look the same for both script. Therefore, the new standard do not
extra codespace to store these characters, except these two.

The oldest standard of GB2312 uses only \0xA1 to \0xFE, but for the
extended version, leading bytes in \0x81 to \0xA0 are used as well.

One additional thing: if you are trying to make your application to be
sold in Mainland China, they will require your application to satisfy
the newest GB18030 standard, which is a superset of GB2312 and the
existing Unicode standard. You can check this at the Microsoft website,
but I believe the context are all in Chinese. Tell us if you need any
further help on this topic.

mimosa

--
[m [1;31m※ 來源:‧香港含羞草私人會所 137.189.151.233‧ ※ [m
[m [m [1;35m[FROM: 127.0.0.1] [m

Jack

unread,

Apr 9, 2002, 2:30:40 PM4/9/02

Hello Christian,

Just wanted to let you know the hex decimal code in your email is not
Simplified Chinese(Codepage936, GB2312 Character set). It's Traditional
Chinese(BIG5 encoding). It seems to me the RTF was generated by Word 6 under
MS NT 4.0 platform with Simplified Chinese font called SimSum, but the
contents and encodings are traditional Chinese. Its English means "The
Secret of Chinese History". I'll send the screen dumps to you since I could
not send any attached image files to the Newsgroup.

BIG5 encodings from you: {\b\fs36\lang1033 \loch\af0\hich\af0\dbch\f17

\'d6\'d0\'87\'f8\'9a\'76\'ca\'b7\'b5\'c4\'c3\'d8\'c3\'dc}

This is GB 2312 encodings for that term: {\b\dbch\af17

\loch\af0\hich\af0\dbch\f17

\'d6\'d0\'b9\'fa\'c0\'fa\'ca\'b7\'b5\'c4\'c3\'d8\'c3\'dc}

I did a lot of jobs last time to do the conversion between Hex Decimal code
<-> characters to handle the S/T Chinese/Japanese/Korean with RTF and other
text format. At that time, RTF is 1.2/1.3/1.4 for Word6.0, Word 97. I think
now RTF is in the version of 1.6. Your code is still in RTF 1.3 if I am not
wrong. I think you should get the RTF version 1.6 from Microsoft. It
includes all new control words introduced by Microsoft Word for Windows 95
version 7.0, Word 97 for Windows, Word 98 for the Macintosh, and Word 2000
for Windows, as well as other Microsoft products.

Hope it helps,

Jack
"Christian Roth" <ro...@visualclick.de> wrote in message
news:1fa568x.ryao5g8mw1z4N%ro...@visualclick.de...

0 new messages