Proposed HZ-like 7-bit rendition of Big-5 character set

Ya-Gui Wei

unread,

Apr 27, 1994, 11:46:55 PM4/27/94

to

Thanks to Fung Fung Lee for helping me recovering the following
document. Heeding Steve's suggestion, I am posting it here for
comment without further change.

Nonetheless, there are a few points that I have discussed with
Fung and others and would like present them here seperately:

1. The spec probably should be presented as a 7-bit rendition
of Big-5 without any inherent connection to HZ, i.e., a developer
should feel free to implement it seperately and independently
of HZ. Therefore the mention of conversion to HZ in the last
paragraph should be deleted.

2. The proposal that Big-5 escape sequences may occur within
GB context (and vice versa) probably should also be deleted.

3. The name of the spec probably should change to "HZ-Big5"
or something similar.

4. Some of the escape sequences proposed should probably be
changed. In particular "~|" so that users don't inadvertently
execute a Unix command when trying to send mail.

5. The escape mechanism should be kept compatible with HZ and
leave room (there are plenty of room) for future expansion.

BTW, CCNET-L has been rejecting my postings. I haven't been
a reader for a while. If someone deems it proper to forward
this message there, please feel free to do so.

Ya-Gui Wei

-------
Date: Sat, 3 Jul 1993 19:17:11 -0500
From: "Ya-Gui Wei" <ya...@cs.indiana.edu>
To: l...@fritter.stanford.edu
Subject: HZ-2 spec

*HZ2.TXT*

[This is an ASCII document.]

A Proposal For Extending The "HZ" Specification
(Draft 1.1b)

May 13, 1993
Last Revised: July 2, 1993

The "HZ" representation (Fung Fung Lee, 1989) of GB2312 Chinese text
has been implemented in various software packages under different
platforms, and tested in the usenet newsgroup "alt.chinese.text" and
various other forums with considerable success. This proposal seeks
to expand the specification to _simultaneously_ accommodate other
Hanzi coding standards under an umbrella scheme that is analogous to
the original "HZ" scheme (the current proposal only seeks to include
characters in the Big5 character set).

** ENCODING GB2312 **
=====================

The Encoding of GB2312 characters will be in accordance to the original
HZ specification (hereafter, HZ-1). All provisions in HZ-1, except as
otherwise amended below, will remain valid.

This proposed extended HZ specification may be referred to as HZ-2.

** NEW ESCAPE SEQUENCES **
==========================

Two new tilde- escape sequences will be defined in HZ-2:

~| -- Escape into Big5 block 1;
~+ -- Escape into Big5 block 2;

The above escape sequences may occur anywhere in the ASCII context.
They may also occur within the Hanzi context, provided that the
first byte of the escape sequence aligns with the first byte of a
Hanzi character which would have existed in the position.

The ~{ escape sequence defined in HZ-1 is extended so that it can
also occur within the Hanzi context, subject to the alignment
requirement described in the previous paragraph. Its definition
of "Escape to 7-bit GB" remains unchanged.

The ~} sequence, which only occurs within Hanzi context and must
be aligned as described above, serves to escape back to ASCII from
all Hanzi context.

Improperly aligned escape sequences in the Hanzi context will not
be interpreted as escape sequences, but rather, will be regarded
as components of Hanzi characters.

Examples of valid HZ-2 sequences follow:

~{UbJGR;>d;0!#~} => All characters from GB.
~|)})M.A.n~} => All characters from Big5.
~|Ps~{WVTuC4Dn#?~} => First character is Big5.
~{KDJ.SV3F~+!m~} => Last character is Big5 (block 2).

** ENCODING Big5 **
====================

The Big5 character set is divided into two blocks, with block 1
consisting of the following:

- the 471 symbols;
- the 5401 frequently used characters;
- the first 1099 infrequently used characters;
- and the unused code space between frequently used and
infrequently used characters.

and block 2 consists of the remaining infrequently used characters
and the unused code space that follows.

Big5 characters in Block 1 are encoded, in a HZ-2 document, by a
pair of printable ASCII characters as follows. Note that the encoded
block 1 characters must always follow the ~| escape sequence.

Let: b1 = 1st byte of Big5 input (range 0xA1-0xCF)
b2 = 2nd byte of Big5 input (range 0x40-0x7E, 0xA1-0xFE)
a1 = 1st byte of HZ-2 output (range 0x21-0x7E)
a2 = 2nd byte of HZ-2 output (range 0x21-0x7E)

Then, C code for Big5 -> HZ-2 Conversion:
a1 = ((b1 - 0xA1) << 1) + (b2 >> 7) + 0x21;
a2 = b2 & 0x7F;

Backward Conversion:
b1 = ((a1 - 0x21) >> 1) + 0xA1;
b2 = a2 | (((a1 - 0x21) & 1) << 7);

Characters in Big5 block 2 are encoded in a similar fashion as
described below. Note that the encoded block 2 characters must
always follow the ~+ escape sequence.

Let: b1, b2, a1, a2 same as for block 1, except that b1 now
ranges 0xD0-0xFE.

Then, Big5->HZ-2 Conversion:
a1 = ((b1 - 0xD0) << 1) + (b2 >> 7) + 0x21;
a2 = b2 & 0x7F;

Backward Conversion:
b1 = ((a1 - 0x21) >> 1) + 0xD0;
b2 = a2 | (((a1 - 0x21) & 1) << 7);

In words, the high bit of the second byte is shifted into the low
bit of the first byte, while the relevant bits in the first byte
(after converted to base 0 ordinal number) are shifted to the left
by 1 position. Adding back 0x21 converts it into printable ASCII.
Finally, byte 2 can be converted to ASCII by simply stripping off
the high bit.

[Note: An alternative would be not shifting the bits in the
first byte, while the 8th bit of the second byte is moved to
the 7th bit of the first byte. This has the disadvantage of
creating an discontinous range for the output, and the characters
would not remain in the same order as in the Big5 table. Moreover,
in some languages, shifting the high bit of byte 2 to the low
bit of byte 1, and the shifting of the rest of byte 1, can
all be accomplished by a single rotate instruction, while the
alternative is less straight forward. In C, the numbers of
operations for both schemes are the same.]

** CONSIDERATIONS IN CONVERSION **
==================================

A simple conversion application with the above algorithms can be
very easily implemented. However, to maintain maximal compatibility
with HZ-1, a Big5 -> HZ-2 conversion program is recommended to
include a Big5->GB conversion step, such that all Big5 characters
that can be reversibly converted to GB are converted to GB before
the conversion to HZ-2. In this way, the resulting HZ-2 file can,
in the most part, be read with existing HZ-1 compliant software
except for the few characters that cannot be reversibly converted
to GB.

[EOF]

Kenichi Handa

unread,

Apr 28, 1994, 6:51:34 AM4/28/94

to

Could someone tell me which Big5 is now standard, ETen or
HKU?

Long ago, I read an article in alt.chinese.text that ETen is
the correct one and HKU should not be used.

But, I just found that `big5' table in `csmaps' (which give
a map between many CJK character sets and Unicode) is based
on HKU. You can get the map from:
METIS.COM [140.186.33.40]: /pub/csmaps
Map of CNS->Unicode is also in the directory.

The difference of HKU and ETen is that HKU starts Level2
from 0xC740 whereas ETen starts Level2 from 0xC940.

---
Ken'ichi HANDA
ha...@etl.go.jp

Pinghua Young

unread,

Apr 28, 1994, 8:41:51 AM4/28/94

to

I took the liberty and forward Ya-Gui's recovered proposal posted
on alt.chinese.text to the CCNET-L. Readers of ACC please excuse
me for the repetition of the post.

*********************** begin of forwarding **************************
From: ya...@ifcss.org (Ya-Gui Wei)

Ya-Gui Wei

*HZ2.TXT*

[EOF]
*********************** end of forwarding **************************
--

Stephen G Simpson

unread,

Apr 28, 1994, 9:39:10 AM4/28/94

to

Kenichi Handa writes:
> The difference of HKU and ETen is that HKU starts Level2
> from 0xC740 whereas ETen starts Level2 from 0xC940.

In my HZ+S 0.3 posting, I chose C940 as the starting point of Block 2,
instead of C740 or C6A1. There were two reasons for this choice.
First, I wanted to stay compatible with ETen, the most popular Big5
software, and in ETen the 7652 less frequently used Chinese characters
start at C940. Another reason, which I didn't state previously, is
that in this way HZ+S can accommodate both ETen Big5 and HKU Big5.
The point is that HZ+S doesn't care whether you are using HKU Big5 or
ETen Big5, because in both cases the less frequently used Chinese
characters start at 2121 of HZ+S Block 2. In this way HZ+S can
perhaps heal or unify the split between ETen Big5 and HKU Big5.

-- Steve

Ken Lui

unread,

Apr 28, 1994, 12:21:07 PM4/28/94

to

In article <940428133...@cs.utexas.edu>,

Stephen G Simpson <sim...@math.psu.edu> wrote:
> In this way HZ+S can
>perhaps heal or unify the split between ETen Big5 and HKU Big5.
>

Is this the reason why when I view some documents in Big5, I get
characters that are "undefined"? This happens on my workstation
at work (HP9000) and on my Mac using the Chinese Language Kit.

Ken
--
Kenneth K.F. Lui, kl...@corp.hp.com 3000 Hanover Street M/S 20BJ
Corporate Financial Systems Palo Alto, CA 94304-1112 USA
Core Application Technologies 1.415.857.3230 Fax 1.415.852.8026

Ricky Yeung

unread,

Apr 28, 1994, 3:19:21 PM4/28/94

to

>Could someone tell me which Big5 is now standard, ETen or
>HKU?

>Long ago, I read an article in alt.chinese.text that ETen is
>the correct one and HKU should not be used.

Let me clarify the so-called "HKU standard" once more. It's a perfect
case of ~{RT6o4+6o~},~{O07GJ$JG~}. Maybe if I tell the entire story, it
would settle that once for all.

It went back to the June 4th period in 1989. One of the student
societies of HKU was feeding June 4th related news in Big5 code to the
outside world. A HKU student released a dos viewer program to read
those news. He bundled a bitmap font with his program and hardwired the
Big5 code range information in the code.

I was then a graduate student, and was requested to put an earlier
version of the font on my departmental machine for ftp. I examined the
font and discovered its original source. I raised the question of its
legitimacy and refused to archive it on my site. Later on, another
version of the font was created and achieved in other sites.

Afterward, someone converted the bitmap file to BDF (originally intended
to be used with my xhzview program) using the incorrect code range
information. Many software authors henceforth thought that it's a new
Big5 standard by "HKU", and started "supporting" it.

This is NOT a Big5 coding scheme released by HKU.

Let me appeal to the Chinese software authors, PLEASE DON'T support this
so-called "HKU" standard. It's too expensive to let this mistake
propagate. Let's stop it now before this wrong becomes "standard".

-Ricky