Why Unicode is a bad standard

Ben Bullock

unread,

Aug 9, 1998, 3:00:00 AM8/9/98

to

I've been reading the thread here on internationalization
with some interest, but I am dismayed that noone has come forward
to challenge some blatantly wrong assertions about Unicode, notably
by Per Bothner. Perhaps they are put off by the racist abuse flying
around here from the likes of Naggum.

I have tried to explain why Unicode is such a bad standard here
and I hope that this furthers understanding of Far eastern
languages among some contributors here.

===========

Why Unicode is a bad standard.

Unicode is supposed to be an all-encompassing human language encoding
standard for computers that will make all others redundant. A flaw in
the design of Unicode's encoding of Chinese characters means that it
can not do this.

The largest subset of characters in Unicode is the Chinese-derived
characters. Chinese characters, also known as hanzis (from the
Chinese) and kanjis (from the Japanese) are used in the People's
Republic of China (PRC), Taiwan, Japan, used but now somewhat archaic
in South Korea, and to a lesser extent in south-east Asian countries
such as Malaysia, Singapore, and Vietnam, and they are also used in
many countries around the world where Chinese, Japanese or Korean
communities exist. Chinese characters represent ideas rather than
sounds, and so the number of characters that people need to
communicate is proportional to the number of ideas that they need, and
hence is rather large. Chinese character based standards naturally
require sixteen bit encodings.

If there were no Chinese characters, then almost any human language
could be expressed using one eight bit byte. A Unicode-like standard
would be wasteful. In a Unicode-like standard, each character takes
up two bytes instead of one. That means twice as much consumption of
resources such as memory and disk space for Unicode-encoded text.
Without Chinese characters there would be little reason for Unicode.

Before Unicode appeared, there were well-established ways of using
Chinese-character based languages. Each Chinese character language
could be switched in and out of by an `escape code'. Unicode promised
to eliminate the need for them and to allow the various forms of
Chinese, Japanese, Korean, and other languages to be expressed in one
unified form. However, it did not do as promised. The reason was
something called `Han unification'.

Han unification means the mapping of variant forms of Chinese
characters onto one number in Unicode. The people who did this
apparently thought that the variant forms of the Chinese characters
could be treated as typographic differences. This was a disastrous
error.

The differences in the forms of Chinese characters arose relatively
recently, after the second world war. Before then, Chinese, Japanese
and Korean people all used basically the same forms of the characters.
However, postwar reforms meant that the form of the characters used
diverged into three branches. The People's Republic of China (PRC)
changed its characters by drastic, regular simplification based on
handwritten abbreviations. Japan changed its characters by less
drastic and less regular simplifications. North Korea abandoned
completely, and South Korea abandoned partially, Chinese characters.
Taiwan and Hong Kong kept the old forms. Singapore followed the PRC.
Thus there were three branches, the Japanese, Chinese and Traditional
branches. That the differences in the characters are not merely
typographic can be seen, for instance, from the inability of young
Japanese people to read the traditional character forms that were in
common use in Japan less than fifty years ago.

Because Unicode is merely a computer encoding standard, it does not
have the authority to either reverse the process of simplification or
to drag it in either the Chinese or Japanese directions. The notion
of Han unification is a fantasy of very little practical use. For
example, in Japanese encoding standards, the traditional forms of the
characters and their simplified forms both have their own code. Thus
Unicode's Han unification can not work. The notion of a Unicode
typeface is also absurd, since it will be more-or-less unreadable to
users of two of the three character groups.

There are ways around this problem, but the sum of Unicode and these
additions is more complex than what Unicode was meant to replace: the
use of escape codes and national encoding standards.

The sixteen bit Unicode standard is silly and wasteful for most human
languages, and has almost no benefit even for those languages which
actually do need to be encoded in sixteen bits.

The only group which Unicode satisfies is people who think that by
including `Unicode support' in their software that it will become
internationalized.

--
Ben Bullock

http://www.hayamasa.demon.co.uk/

Erik Naggum

unread,

Aug 9, 1998, 3:00:00 AM8/9/98

to

* Ben Bullock <b...@hayamasa.demon.co.uk>

| Perhaps they are put off by the racist abuse flying around here from the
| likes of Naggum.

if you hope to scare people off from reading a response to your pathetic
collection of misguided, unfounded, and counterfactual opinions, I hope
the people here aren't as likely to be put off by your rhetoric as you
hope they would be. if anything, I think they are perhaps a little more
scared of the few people who think that everything they have trouble
understanding must be racism. it's not particularly enriching to try to
put forth an opinion or even historic facts when the likes of Ben Bullock
and Bill Richter eagerly spring forth and call it "racism" whenever it
contains some words they dislike.

I find the extremely unfair, and often bordering on insane, charges of
racism to be worse than what is being called "racism". everything under
the sun that has to do with colliding cultures is "racism" in some
people's shallow minds. I wonder why _they_ think it is racism when
nobody else does. a person who is obsessing about race _must_ be much
more likely to be a racist than those who think race is _utterly_
irrelevant except for a very small number of problems. (such as in
medicine, where there _are_ important differences between races.)
"racism" is a label mostly used by people who _think_ they see it in
other people's minds. psychologists call that "projection".

but enough of that shit. Ben Bullock is presenting a whole slew of
amazingly untrue and unfounded opinions in guise of arguments against
Unicode. now, I'm _no_ die-hard fan of Unicode myself, since Unicode
killed the first ISO 10646 standard, which I thought was very good, but I
do know this standard well, and I know its history and its people well.

| I have tried to explain why Unicode is such a bad standard here and I
| hope that this furthers understanding of Far eastern languages among some
| contributors here.

let's see, first you accuse people of being racists because you have a
hard time reading stuff you don't like and then you expect to further
_understanding_? yeah, this is _really_ promising.

| Unicode is supposed to be an all-encompassing human language encoding
| standard for computers that will make all others redundant.

this is in fact not true. exaggerating your enemy is a well-known, yet
pretty obvious, rhetorical device used by people who know that they
couldn't hope to win a fight against the _real_ enemy, so they make up
one that is easier to argue against or, as is the case with charges of
"racism", hate or dislike for irrelevant reasons. strawmen are stupid.

| A flaw in the design of Unicode's encoding of Chinese characters means
| that it can not do this.

this is your opinion, not fact. it appears to be based on a severe lack
of correspondence between your other opinions and the facts, in fact so
grave a lack of correspondence as bordering on conscious lies, just like
the unfounded _opinions_ on "racism" that started this whole nonsense.

| Chinese character based standards naturally require sixteen bit
| encodings.

this is in fact not true. the standards _require_ 14 bits, as in two
bytes of 7 bits each, allowing 94×94 = 8836 characters per set. there
are 5 such sets in the ISO 2375 Registry, for a total code space of 44180
characters, but _far_ fewer characters are allotted to them. in fact,
there are _no_ existing encoding schemes that encompass all of these sets
in a single code space. (the encoding schemes that use the high bit, do
in fact only represent 7 bits of payload per byte. the high bit is not
payload.) Unicode does not even try, and acknowledges its limitations.
however, additions to make Unicode a 20-bit standard does have room for
all of them, as well as the world's historic scripts, such as runes.

| If there were no Chinese characters, then almost any human language
| could be expressed using one eight bit byte.

sort of true, provided _each_ language had its own, unique character set.
since this is obviously a stupid proposition, albeit the way we did
things before larger character sets such as ISO 8859-1 came around,
however, we need to look at how many character exist in each _script_.
neither Latin, Greek, nor Cyrillic can make do with 190 characters, which
is what we have space for in a single byte along with control characters
and such. Latin has almost 400 identified characters in use. Greek has
more than 300 when we count historic characters that do have standards of
their own. Cyrillic has almost 300 characters in the diverse language
groups that use it (and which also contribute to the size of the Latin
script). Arabic is already well known for needing more than a single
byte (due to the many languages that use this script), but I guess Hebrew
could make do with a single byte. Hawaiian can. however, you might be
_very_ surprised that even English typographers use more than 200
distinct symbols that they would like to include as _characters_. they
_did_ get them into Unicode, ending many years of fighting over how to
encode and represent special characters used in high-quality texts, which
used to need to be represented with commands or long codes in typesetting
languages.

| A Unicode-like standard would be wasteful. In a Unicode-like standard,
| each character takes up two bytes instead of one. That means twice as
| much consumption of resources such as memory and disk space for
| Unicode-encoded text.

this is in fact not true, and betrays a fundamental lack of understanding
that used to be common four years ago, but is by now mostly gone from all
quarters involved in Unicode. more than 99% of the world's textual
information is stored in the Latin alphabet (according to ISO JTC1/SC2
reports of a few years ago -- it might have dropped to 98% if we work
really hard to avoid Ben Bullock's stupid "racist" accusations), of which
an estimated 5% is _not_ the 26 basic characters. if we let UTF-8 encode
the basic Latin characters with one byte, these others would require two
bytes. the remaining 1% would probably be better off with two bytes per
character because UTF-8 is grossly inefficient beyond the Latin script.
(did I hear a "racist" accusation flying by?) the global _cost_ is thus
an increase of 6 or 7 percent over today's costs, and perhaps 10% in 20
years' time. a two-byte Unicode representation is also _shorter_ than a
Chinese encoded using more than one set with only 94 available codes per
byte, as in today's encoding schemes. the Unicode consortium estimates
to save 20% off the storage costs of Chinese texts with Han unification.

now, since you above implicitly argue _against_ mixing languages in one
file (or you would never have argued anything so stupid as to say all
language could (each) be expressed using one eight-bit byte), let's not
try to mix Chinese, Japanese or Korean, either, and then we come out with
a 10% global _savings_ when using Unicode over the existing schemes.

| Without Chinese characters there would be little reason for Unicode.

this is your opinion, but it is utter bullshit. the alternative is a
stateful encoding that has proven very hard for programmers to deal with,
although it shouldn't have been: ISO 2022. the _fact_ that this very
general solution has proven to be completely unsuited for computer
consumption is what _started_ the ISO 10646 project. anyone who knows
what he's talking about _also_ knows that it was the Unicode consortium
that introduced Han Unification to ISO 10646 and that the first DIS 10646
(draft) gave each of the East Asian cultures their own code space. it is
_still_ amazingly useful to get rid of the upwards of eight hundred
different character sets in use outside of East Asia. just providing
them with a single encompassing character set that allowed conversion
between two otherwise hard-to-compare sets is a tremendous boon to the
world. (my character set translation system, based on the tables that I
have made available at FTP.NAGGUM.NO in the "chars" directory, makes this
a lot easier than the old tables from one set to another by code.)

anyone who actually believes something as stupid as there being no need
for Unicode is unlikely to understand the issues involved and is only
presenting his false "arguments" as a thin veneer over his hatred towards
Unicode. when such a person has already proved that he is unable to deal
with criticism of other cultures and has to label it "racism", there
really should be no need to debunk the dishonesty and to uncover any more
hidden agendas. it is obvious that this is all about Ben Bullock hating
Unicode and trying to stop _his_ critics by labeling them "racists" in
the hope that nobody would listen to them.

| Before Unicode appeared, there were well-established ways of using
| Chinese-character based languages.

yeah, like about six different ways, none computationally compatible
(i.e., they require *huge* mapping tables), and three more variants that
were computationally compatible (i.e., they can work by algorithms) with
_one_ of the six each.

| The People's Republic of China (PRC) changed its characters by drastic,
| regular simplification based on handwritten abbreviations.

this is a mere typographical change.

| Japan changed its characters by less drastic and less regular
| simplifications.

this is a mere typographical change.

| Thus there were three branches, the Japanese, Chinese and Traditional
| branches. That the differences in the characters are not merely
| typographic can be seen, for instance, from the inability of young
| Japanese people to read the traditional character forms that were in
| common use in Japan less than fifty years ago.

these are _still_ typographical changes. you have proved _nothing_ if
you goal was to show that the differences were more than typographical.

the same "differences" can in fact be used about the Latin alphabet in
its several modern forms (all typographical) and an older form known as
Fraktur, widely used in Germany and many European newspapers heads. it
is _still_ nothing but a question of fonts and typography, no matter how
many younger people cannot read it. (whoever wants to use younger people
as an authority? "because we are ignorant" doesn't convince me of much.)

| Because Unicode is merely a computer encoding standard, it does not
| have the authority to either reverse the process of simplification or
| to drag it in either the Chinese or Japanese directions. The notion
| of Han unification is a fantasy of very little practical use.

both the Chinese and the Japanese authorities on their own languages have
expressed identical views as this one, only they stopped doing so three
to four years ago, and only people who do not understand their objections
or their reasons for retracting them continue to voice their arguments.

the "fantasy of very little practical use" is directly responsible for a
big industry in Japan, and Unicode's adoption rate is very high in Japan.

| For example, in Japanese encoding standards, the traditional forms of the
| characters and their simplified forms both have their own code. Thus
| Unicode's Han unification can not work.

a classic non sequitur with a classic instance of begging the question!
I'm amazed. these are errors of reasoning _so_ easy to avoid! except,
of course, to one who consciously uses them to confuse people whom he
hopes would not spot them.

in other words: this simply does not follow, but serves to "prove" a
point the "evidence" rests on in order to be evidence in the first place.
do go and learn some theory of argumentation, Ben Bullock, and maybe
you'll understand why accusations of "racism" have no place in a
reasonable argument, either.

| The notion of a Unicode typeface is also absurd, since it will be
| more-or-less unreadable to users of two of the three character groups.

well, it _is_ exceptionally helpful to the other 4 billion people (or
however many we are on this planet these days -- seems this "5 billion"
number has been around for far too long). willfully ignoring that many
people just because something works well in the Far East might look like
an argument of racial superiority. is that the case, Ben Bullock?

| There are ways around this problem, but the sum of Unicode and these
| additions is more complex than what Unicode was meant to replace: the use
| of escape codes and national encoding standards.

Unicode was never intended to _replace_ anything. it was, is, and will
remain an _additional_ character set standard that tries very hard to be
able to _subsume_ all the others for computational purposes, but there
are obvious advantages to continue to use other standards in certain
areas. nobody is opposed to that, except the fucking morons who call
people who say something to the contrary "racists" and present a whole
bunch of stupid non-arguments in the hopes that they won't be exposed.

| The sixteen bit Unicode standard is silly and wasteful for most human
| languages, and has almost no benefit even for those languages which
| actually do need to be encoded in sixteen bits.

this is your opinion. it is _not_ a fact. you are entitled to your
opinions, but when they are based solely on your irrational hatred of
Unicode, which is again based on your emotional problems with Han
Unification, who _cares_ what you feel?

| The only group which Unicode satisfies is people who think that by
| including `Unicode support' in their software that it will become
| internationalized.

sadly, this is the only valid point you might have, misguided as though
its foundation is. Unicode does not _solve_ any problems. if you
believed it did, your frustration is understandable, but it should also
go away when you understand that that never was their goal. Unicode
merely ameliorates the pain of solving _some_ of the problems in
internationalization, but it _does_ ameliorate them, even to the extent
that it is now _possible_ to solve the remaining problems, instead of
being intractably hard. by removing Unicode, the problems _remain_
intractably hard.

if we ever should need evidence of the pain and the intractably hard
problems involved, try to use MULE in cooperation with some _other_
proprietary solution to the _apparently_ same problem. this is why MULE
is bad and _this_ is why _any_ standard is better than none, despite
technical flaws and valid concerns about its problems, because the mere
agreement on _something_ removes a whole category of problems, and a very
large category at that. we can fix the problems. we cannot fix the lack
of agreement except with something we _can_ agree on. however, there
will alsways be the reactionary morons like Ben Bullock who regurgitate
already retracted arguments in order to destabilize the sometimes fragile
agreement that exist in the world of standardization. there will also be
annoying morons who think that if they can find a flaw, they should scrap
the whole thing and start over. and then there's the morons who will ask
people to go reinvent the wheel because they won't fix a simple flaw or
find ways to cooperate. if we keep these morons at bay with rat poison
or whatever else works on such vermin, the rest of us can perhaps try to
_solve_ the _remaining_ problems, now that it has become possible.

the ISO 10646 standard provides some history of itself, but you need to
buy the standard to get access to that information. however, the Unicode
consortium has made every effort to help people use their work, and have
put morons like Ben Bullock and their stupid "opinions" to shame through
actual improvements in the human condition. see www.unicode.org for
their _extensive_ material.

please call this "racism", too, Ben Bullock, so we can know for certain
that you call everything you don't like racism.

#:Erik
--
http://www.naggum.no/spam.html is about my spam protection scheme and how
to guarantee that you reach me. in brief: if you reply to a news article
of mine, be sure to include an In-Reply-To or References header with the
message-ID of that message in it. otherwise, you need to read that page.

Boris Schaefer

unread,

Aug 9, 1998, 3:00:00 AM8/9/98

to

Ben Bullock <b...@hayamasa.demon.co.uk> writes:

| branches. That the differences in the characters are not merely
| typographic can be seen, for instance, from the inability of young
| Japanese people to read the traditional character forms that were in
| common use in Japan less than fifty years ago.

I don't know much abut Unicode, but this part of your article contains
seriously flawed logic.

You assert the following:

a) Many young Japanese cannot read traditional texts
b) The difference between traditional and new characters is NOT merely
in form.

You go on to say that a) implies b).

This is just plain flawed logic.

Look at German for example. The typography of all German characters
changed, when Germans moved from the German alphabet to the Latin
alphabet. People here often have a hard time reading old texts.

But if people know the mapping from old German alphabet to Latin
alphabet, then they can read old German texts.

The change in German alphabet was purely in form.

--
Boris Schaefer -- s...@psy.med.uni-muenchen.de

Never delay the ending of a meeting or the beginning of a cocktail hour.

Yair Friedman

unread,

Aug 10, 1998, 3:00:00 AM8/10/98

to

Erik Naggum <cle...@naggum.no> writes:

> Arabic is already well known for needing more than a single
> byte (due to the many languages that use this script), but I guess Hebrew
> could make do with a single byte.

With the introduction of vowel and cantonation marks into ISO8859-8 new
revision, it is possible that a single glyph encoding will need up to 7
bytes.

Speaking of Hebrew and other Right-to-left (Semitic) languages, I think
it's a very good example of why MULE design is somewhat problematic at
least to rtl languages. Mule 2.3 did support Hebrew: I could create and
read files in Hebrew but thats all. Could not export them so other
programs can read them, print them, or do anything outside the scope of
Mule. Despite what Erik's pages say, Emacs 20 does _not_ support rtl
languages. There is a plan to add it after 20.3 is out, but the support
will still be limited because "We cannot add direction awareness to all
the text-processing primitives now".

I can just hope that things will go in the right direction.
--
Yair Friedman <y...@JohnBryce.Co.Il>
Plase make sure you REPLY to this message.

Lars Magne Ingebrigtsen

unread,

Aug 10, 1998, 3:00:00 AM8/10/98

to

Yair Friedman <y...@JohnBryce.Co.Il> writes:

> Speaking of Hebrew and other Right-to-left (Semitic) languages,

[...]

> I can just hope that things will go in the right direction.

"Left direction", you mean. :-)

--
(domestic pets only, the antidote for overdose, milk.)
la...@ifi.uio.no * Lars Magne Ingebrigtsen

Thomas C Lofgren

unread,

Aug 10, 1998, 3:00:00 AM8/10/98

to

>>>>> "Lars" == Lars Magne Ingebrigtsen <l...@gnus.org> writes:

Lars> Yair Friedman <y...@JohnBryce.Co.Il> writes:
>> Speaking of Hebrew and other Right-to-left (Semitic) languages,

Lars> [...]

>> I can just hope that things will go in the right direction.

Lars> "Left direction", you mean. :-)

Yeah, he got it all backwards.

Tom
--
Wherever I lay my .emacs, that's my ${HOME}

Norbert Koch

unread,

Aug 11, 1998, 3:00:00 AM8/11/98

to

Thomas C Lofgren <lof...@lectura.CS.Arizona.EDU> writes:

> >> I can just hope that things will go in the right direction.
>
> Lars> "Left direction", you mean. :-)
>
> Yeah, he got it all backwards.
>

upside-down perhaps :-)

--
Dr Norbert Koch, DELTA Industrie Informatik GmbH, Fellbach, Germany.
+49.711.57.151.37

Alan Shutko

unread,

Aug 12, 1998, 3:00:00 AM8/12/98

to

>>>>> "B" == Ben Bullock <b...@hayamasa.demon.co.uk> writes:

B> That the differences in the characters are not merely typographic
B> can be seen, for instance, from the inability of young Japanese
B> people to read the traditional character forms that were in common
B> use in Japan less than fifty years ago.

Do the pre and post simplification characters mean the same things,
but look different?

Sounds like a font change to me...

--
Alan Shutko <a...@acm.org> - By consent of the corrupted
Skiing Halted, Operator Broken, Out of Service for Repair

Xah

unread,

Aug 13, 1998, 3:00:00 AM8/13/98

to

>>>>>> "B" == Ben Bullock <b...@hayamasa.demon.co.uk> writes:
>

>B> That the differences in the characters are not merely typographic
>B> can be seen, for instance, from the inability of young Japanese
>B> people to read the traditional character forms that were in common
>B> use in Japan less than fifty years ago.

As Eric Naggum or others have said, the changes are purely typographical. At
least this is the case in Chinese.

(I'm Chinese. I know both simplified characters and traditional characters.)

Xah, x...@best.com
http://www.best.com/~xah/PageTwo_dir/more.html
Mountain View, CA, USA

David Lloyd-Jones

unread,

Aug 24, 1998, 3:00:00 AM8/24/98

to

Alan Shutko wrote in message ...
+AD4APgA+AD4APgA+- +ACI-B+ACI- +AD0APQ- Ben Bullock +ADw-ben+AEA-hayamasa.demon.co.uk+AD4- writes:
+AD4-B+AD4- That the differences in the characters are not merely typographic
+AD4-B+AD4- can be seen, for instance, from the inability of young Japanese
+AD4-B+AD4- people to read the traditional character forms that were in common
+AD4-B+AD4- use in Japan less than fifty years ago.
+AD4-
+AD4-Do the pre and post simplification characters mean the same things,
+AD4-but look different?
+AD4-
+AD4-Sounds like a font change to me...
+AD4-

Wey-yull, sorta.

In Japanese there is a custom of writing in +ACI-formal+ACI- Japanese for some
purposes. Wedding invitations, for instance, will be engraved in the kinds
of, uh, script that was in use before WWII. This will be clearly recognised
by anyone -- and in most cases identical to the forms in use in Taiwan
today. Bronze plaques on university buildings do not recognize the
post-WWII simplifications.

Ben Bullock is probably right that these habits are fading fast from common
use. Certainly the business generation who refused in their handwriting to
bow to modernity, in part as a way of asserting authority through
conservatism, have now almost all retired. (Bath-houses, however, will
continue until eternity to use the ancient Manyoshu forms on their +ACI-open for
business+ACI- flags.)

Still the universal Japanese, Chinese and Korean joy in humour, skill, and
one-upsmanship in the obscurities writing will not disappear. One hopes
that this set of joys will continue to be possible in electronic form.
Visions of drunken Japanese in bars showing off on their PIM's come to mind.

-dlj.

Piet van Oostrum

unread,

Aug 25, 1998, 3:00:00 AM8/25/98

to David Lloyd-Jones

>>>>> "David Lloyd-Jones" <d...@pobox.com> (DL) writes:

DL> Alan Shutko wrote in message ...
DL> +AD4APgA+AD4APgA+- +ACI-B+ACI- +AD0APQ- Ben Bullock +ADw-ben+AEA-hayamasa.demon.co.uk+AD4- writes:
DL> +AD4-B+AD4- That the differences in the characters are not merely typographic
DL> +AD4-B+AD4- can be seen, for instance, from the inability of young Japanese
DL> +AD4-B+AD4- people to read the traditional character forms that were in common
DL> +AD4-B+AD4- use in Japan less than fifty years ago.
DL> +AD4-
DL> +AD4-Do the pre and post simplification characters mean the same things,
DL> +AD4-but look different?
DL> +AD4-
DL> +AD4-Sounds like a font change to me...
DL> +AD4-

Why do you sprinkle all these +AD4- +ACI- etc in your message?
--
Piet van Oostrum <pi...@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: Piet.van...@gironet.nl