Getting Chinese characters from RTF

Tony Duff

unread,

Sep 21, 2009, 3:26:01 AM9/21/09

to

I've got an RTF import method in my wordprocessing program. I'm trying to
get Chinese characters out of the RTF and display them in the program.

I have found that RTF generated by English versions of Word always gives the
unicode equivalent of the Chinese and this makes it easy to deal with the
Chinese. However, to my amazement, Chinese Word under Chinese Windows, even
Word 2007, generates RTF without Unicode equivalents--it only contains the
composite font method for representing the characters. Perhaps there is a
way to change this (please let me know if there is) but I still have to parse
the older method of representing Chinese in RTF and display it in my program.

For instance, the RTF code typically contains

\lochar \afxxxx \hichar \afyyy \dbch \'nn \'zz

Where the represented character is a double byte code of nnzz, I can get the
code from the RTF and push it out to the display using textoutW. However,
the Chinese character that results is not the correct character. I tried
reversing the order to zznn but that still produced the wrong character.

I have searched high and low for a clear description of how to deal with
this RTF code for composite Chinese and also how to display it but can find
nothing. I'd like to start a discussion that will delve through all of this
and provide all of the needed answers.

Thanks for your help,
Tony Duff

Mihai N.

unread,

Sep 21, 2009, 4:28:58 AM9/21/09

to

> I have searched high and low for a clear description of how to deal with
> this RTF code for composite Chinese and also how to display it but can find
> nothing.

The RTF reference has everything you need.
It is the ultimate reference.
http://www.microsoft.com/downloads/details.aspx?FamilyID=dd422b8d-ff06-
4207-b476-6b5396a18a2b&displaylang=en

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Tony Duff

unread,

Sep 24, 2009, 5:50:10 AM9/24/09

to

"Mihai N." wrote:

> The RTF reference has everything you need.
> It is the ultimate reference.
> http://www.microsoft.com/downloads/details.aspx?FamilyID=dd422b8d-ff06-
> 4207-b476-6b5396a18a2b&displaylang=en

Well, it might be. However, I have read it from top to bottom prior to
sending my post. My question remains. I have grabbed the pair of characters
according to the RTF reference. However, I don't know which order that the
pair should be displayed in? Should they be displayed one after another in
the same order as they are encoded in RTF? And I have displayed them in both
possible orders but still get the wrong Chinese character even though I apply
the appropriate font. Are the characters meant to be displayed one after
another using something like TextOut or are their values to be combined
somehow and then combined value to be displayed with TextOutW? There is
absolutely no documentation of the procedure required either in the reference
you suggested, or earlier versions of the same, or anywhere on the net (at
least in English) that I can find. Therefore, your assistance with this
would be appreciated.

Thanks,
Tony Duff

Remy Lebeau

unread,

Sep 24, 2009, 2:42:10 PM9/24/09

to

"Tony Duff" <Tony...@discussions.microsoft.com> wrote in message
news:DB1F4DA1-7D0B-4540...@microsoft.com...

> For instance, the RTF code typically contains
>
> \lochar \afxxxx \hichar \afyyy \dbch \'nn \'zz
>
> Where the represented character is a double byte code of nnzz, I
> can get the code from the RTF and push it out to the display using
> textoutW. However, the Chinese character that results is not the
> correct character. I tried reversing the order to zznn but that still
> produced the wrong character.

The Win32 API is based on Unicode (UTF-16 or UCS2, depending on your Windows
version). The RTF you showed is using an older DBCS (Double-Byte Character
Set) mechanism instead, rather than the newer Unicode mechanim defined in
the current RTF spec. DBCS is not Unicode. In DBCS, a 2-byte value
consists of a special lead byte and trail byte. If you want to process the
DBCS manually, you have to know which lead/trail bytes are defined for the
particular codepage being used (specified in the RTF header) in order to
translate DBCS into UTF-16/UCS2, which you can then pass to the Win32 API.
You can use GetCPInfo() to retreive a given codepage's characteristics,
including its lead/trail byte ranges. Better, just use the
MultiByteToWideChar() function to convert the raw bytes from the RTF
(whether they be single-byte Ansi, multi-byte Ansi, or DBCS) into
UTF-16/UCS2 Unicode for the rest of the Win32 API to use. For example:

char RTFBytes[2]; // 2 for maximum bytes in a DBCS character
wchar_t UnicodeChars[2]; // 2 to handle UTF-16 surrogates
RTFBytes[0] = ...; // the \'nn value
RTFBytes[1] = ...; // the \'zz value
int len = MultiByteToWideChar(CodePageFromRTFHeader, 0, RTFBytes, 2,
UnicodeChars, 2);
TextOutW(..., UnicodeChars, len);

--
Remy Lebeau (TeamB)

Mihai N.

unread,

Sep 25, 2009, 6:37:13 AM9/25/09

to

> However, I don't know which order that the
> pair should be displayed in? Should they be displayed one after another in
> the same order as they are encoded in RTF?

Yes.

> Are the characters meant to be displayed one after
> another using something like TextOut or are their values to be combined
> somehow and then combined value to be displayed with TextOutW?

Characters can be represented in RTF in 2 differnet ways:
- if you see \u#### that is a Unicode value, use it as code unit
with TextOutW
- if you see \'##\'## that are bytes of the text represented in a
certain code page (determined by the font entry in the font table)

So if you take a Chinese RTF and try to render the \'## encoded characters,
they are probably in cp936 (simplified) or cp950 (traditional), and
you will have to convert them to Unicode and use TextOutW
(you can't display Chinese on non-Chinese OS without Unicode)

Example

{\fonttbl
{\f0\froman\fprq1\fcharset128 MS PGothic;}
{\f1\fswiss\fcharset1 Mangal;}
{\f2\fswiss\fcharset177 Arial;}
{\f3\fnil\fcharset0 Courier New;}
}

...
\ltrpar\lang1041\f0\fs20\'93\'fa\'96\'7b\'8c\'ea\par
\lang1081\f1\u2347?\u2327?\u2354?\u2381?\u2327?\par
\lang1037\f2\rtlch\'e9\'f7\'f0\'f8\'f7'\par

\ltrpar\lang1041\f0\fs20\'93\'fa\'96\'7b\'8c\'ea\par
\0 : Font 0: fcharset128 = Japanese, cp932
Take the 6 bytes (93 fa 96 7b 8c ea), convert them to
Unicode using MultiByteToWideChar with code page,
and you will obtain 3 Unicode characters: U+65e5 U+672c U+8a9e
that you can put on screen with TextOutW

\lang1081\f1\u2347?\u2327?\u2354?\u2381?\u2327?\par
You have directly Unicode values (U+2347 U+2327 U+2354 U+2381 U+2327),
so don't care about the charset of the font, just use TextOutW

\lang1037\f2\rtlch\'e9\'f7\'f0\'f8\'f7'\par
\f2 : Font 2: fcharset177 : Hebrew, cp1255
Convert bytes to Unicode using MultiByteToWideChar with code page 1255,
put on screen with TextOutW

You can get a list of all charsets from WinGDI.h
But you don't need that. Just use TranslateCharsetInfo with TCI_SRCCHARSET
to map the charset to a code page.

Mihai N.

unread,

Sep 25, 2009, 6:44:42 AM9/25/09

to

Might also be good to read how \uc<n> works, to
understand when you will see \' when \u and when both
(if you see both, you have to ignore one of them,
usually the \' stuff).
But you have to know in what "mode" you are (\uc),
otherwise you will end up skiping stuff, or showing
extra garbage characters.