On a recent trip to Taipei, I had the opportunity to meet some of the
people involved in the design and implementation of CCCII and asked them
some of the questions I had accumulated over the time. Before I start to
expose this here further, I will give a short intro on CCCII to the
benefit of those who see this acronym for the first time.
CCCII stands for Chinese Character Code for Information Interchange.
As this name suggests, one of its design goals was to create one
single codeset which would include *all* Characters in existence, or
at least all Characters defined by other codesets, so as to make it
possible that CCCII would be useful as an intermediate code between
two particular codesets. A second design goal was to encode what is
today generally referred to as the character-glyph model: The notion
that one (abstract) character can have more than one concret glyph
representing it. In the case of Chinese Characters this goes even
further and includes not only differences in form of the glyph, but
also completely different characters, commonly known as borrowed
characters, which at certain times or places came to stand in place of
a certain character.
Both of these design goals are unique to CCCII and were appealing to
our team, which was looking for a convenient way to encode a
considerable of texts from the Chinese Buddhist Canon and other
sources. It seemed especially fit for the use of researchers working
with historical materials that contain a large number of premodern
characters.
Alas, we met a number of unexpected problems:
First of all, only one company was selling an implementation, which
was a clumsy hardware based solution, expensive and only available for
the PC. We bought it and came to dislike it, because it did'nt even
allow us to use a decent editor in place of the PE2 like tool that
came with it.
Second, we constantly came across characters, which were defined
several times in the codespace and we could not find a convenient and
consistent way to decide which character to use (BTW, this was one of
the reasons for the RLG group to part ways with CCCII and create their
one variations, which is now widely used in US research libraries
under the name EACC: they followed the principle one glyph = one
code).
Third we accumulated a fair amount of characters which we came across
in our texts, but which were missing from CCCII. We also found that
several hundred characters from the CJK UniHan Repertoire in Unicode
were not in CCCII. Later we learned of the amendments to CNS
11643-1992, which also included a fair number of characters
(approx. 2000 out of a total of 48000) not included in CCCII. The
information on CCCII was hard to get and we did not know, if it were
still supported at all. After a year of continuing effort, we could
not even get a complete printed documentation...
These were some of the questions I posed to the people I met in
Taipei. I will summarize here some of the things I heard:
1. CCCII is not dead, it is currently being revised to bring it up to
date as a true interchange code. After this revision is complete
(scheduled for mid 1995) it will once again be a superset of all
existing codes, compatible with CNS, the UniHan Repertoire and EACC.
It will also include the ca 7000 CNS characters currently under revision,
which will be added to CNS as the 8th level, bringing the number of
characters there to ca. 55000. I was told that the main work of this
revision is done (with public taiwanese funding) by a team working in
Beijing. Currently assigned codepoints will in general not be altered.
2. The possible multiple encoding of glyphs will stay, as this the
basis the code is build on. If it is not appropriate or possible to
use this feature (for example to encode modern text in simplified
characters, it would be unwanted to always assign a simplified glyph
to one specific traditional form, were several are possible) the
following ways are recommended to handle this problem:
- Unify all encoding to one specific codepoint and give this codepoint
the relevant semantic in the context of the project.
- Take whatever is convenient and take the according precautions when
searching or otherwise using the text. For printing it will not matter
:-)
The best way would of course be to encode it at the right position
according to the context. This requires the greatest effort and will
usually be done in the process of proofreading. All of these
strategies requires knowledge of what glyphs are encoded at multiple
places, and where. This information has not yet been available to us,
but the revision of CCCII will also produce such a list and hopefully
will make it available, a database produced at our institute will also
provide this kind of cross-reference.
3. With the increasing processing power of personal computers,
purely software based implementations of CCCII become possible. I saw
protoypes at two companies, some implementations are available. Be
aware though of the following facts:
- Some products implement only a part of CCCII (e.g. 22000 / 33000
characters, where the "full" implementation currently has ca. 58000,
some are only network-server based implementations).
- All I have seen was PC-based, although I was told that UNIX
implementations exist. Ports to Windows (as an add-in similar to
TWINBRIDGE) are on its way, the Macintosh seems to be a possible
target only further down the road.
- The prices are a fraction of what the hardware based systems costed.
Reasonable site licenses are available.
Information is available from:
Global Ventures (Asia) S.A.
Teresa Ju
Tel +886-3-352-9010
Fax +886-3-352-0891
(I have no connection whatsoever with this company)
- Some of the implementations let you use your favorite english
editor, you will have to put up with the fact that the characters are
somehow stretched to fill the same space as four single byte
characters.
Some bitmaps for CCCII are available as part of the public domain
CCDB, available at NCTUCCCA.EDU.TW in /CHINESE/CCDB. We are looking
into the possibility of using this material for porting CCCII to mule.
Advice in this matter would be highly appreciated. The bitmaps are of
rather different styles and the size is 64 by 64 dots. Input tables
for input by pronounciation (= Pinyin), FourCorner and Cangjie will
become available.
4. One of the options we were considering was switching to the use of
CNS instead of CCCII, since it's design seemed to be somewhat more
consistent. Given the fact, that CCCII will be a superset of all codes
including CNS, and that nobody seemed to be interested in supporting
CNS, we dropped that option, but we will continue to use
crossfreferences to CNS to validate our information on the characters
defined by CCCII.
Christian Wittern
g53...@sakura.kudpc.kyoto-u.ac.jp
International Research Institute for Zen Buddhism
Hanazono University
8-1 Tsubonouchi-cho Nishinokyo Nakakyo-ku
Kyoto, Japan
Tel (075) 811-5181 ext. 208
Fax (075) 811-9664