Unicode in Ruby now?

Tobias Peters

unread,

Jul 31, 2002, 7:50:21 AM7/31/02

to

I've read the thread "Unicode in Ruby's Future?" [ruby-talk: 40016]. It
remains a bit vague.

Of course you can already translate strings between multiple encodings
with one of the existing chartacter encoding libs right now. A problem
that I am currently facing:

When I export a string to an utf-8 encoded stream, how can I possibly know
its current encoding. Strings do not have an "encoding" tag. Will they
have in future?

Wouldn't it be a better solution to store strings in memory in a canonical
format (be it utf-8 for space savings, ucs-4 for O(1) indexing operations,
whatever) and let string sources and sinks have an "encoding" property,
and do the transformation on the fly?

We would need to identify possible sinks and sources of character strings
and how to determine their encoding anyway. Anyone interested? Perhaps
I'll create a Wiki Page at rubygarden for that.

Examples: stdin and stderr would be influenced by the user's locale.
Literal strings in ruby source code are a string source. There should be a
mechanism to state the character encoding for ruby source files, with a
reasonable default (which?). Filesystem names returned by Dir objects have
a charset encoding. How to determine that?

You get the picture. If you do Data Serialization to formats that restrict
the character encoding (be it xml or yaml), you have to know the encoding
of strings in memory. It would be helpful if ruby determined the character
encoding right when a string was created. Later on, there is no chance to
do that (except for error-prone heuristics).

Tobias

Yukihiro Matsumoto

unread,

Jul 31, 2002, 10:30:56 AM7/31/02

to

Hi,

In message "Unicode in Ruby now?"

on 02/07/31, Tobias Peters <tpe...@invalid.uni-oldenburg.de> writes:

|When I export a string to an utf-8 encoded stream, how can I possibly know
|its current encoding. Strings do not have an "encoding" tag. Will they
|have in future?

Yes.

|Wouldn't it be a better solution to store strings in memory in a canonical
|format (be it utf-8 for space savings, ucs-4 for O(1) indexing operations,
|whatever) and let string sources and sinks have an "encoding" property,
|and do the transformation on the fly?

No. Considering the existence of "big character set" like Mojikyo
(charset developed in Japan, which is bigger than Unicode), there
cannot be any ideal canonical format. In addition, from my
estimation, the cost penalty from code conversion to/from the
canonical character set is intolerable if one processes mainly on
non-ASCII, non-Unicode text data, like we do in Japan.

matz.

Clifford Heath

unread,

Jul 31, 2002, 8:00:19 PM7/31/02

to

Matz,

Is Mojikyo a superset of Unicode? If not, how hard is the translation to
UCS-4?

I designed the UCS-4 string class we use here in C++, with a UTF-8 storage
format (up to 31-bit with a six-byte UTF-8 sequence). The string class
remembers which character you last accessed and at what byte offset it
started, so that when you ask for another character, it can decide whether
heuristically to search forward from the start, forward or backward from
the remembered point (most common), or if it has ever counted the
characters, backward from the end. This minimises the search cost since
most string processing is largely sequential.

With the "remembered point" feature, I think UTF-8 has been a good tradeoff,
so much so that although I implemented the class using a pure interface and
a factory to allow alternate formats, we haven't needed to do it.

BTW, re "style", I like the definition I heard from a fashion figure:
"quirkiness with confidence". I guess the definition doesn't hold so well
for software though :-).

--
Clifford Heath

Curt Sampson

unread,

Jul 31, 2002, 11:31:30 PM7/31/02

to

On Thu, 1 Aug 2002, Clifford Heath wrote:

> Is Mojikyo a superset of Unicode?

Yes.

Basically, Ruby being a Japanese product, it is unlikely ever to be able
to standardise on Unicode. The Japanese (well, enough of them to matter)
hate Unicode. (They seem to hate I18N in general, for that matter; they
certainly seem to pay less attention to it than, say, Americans.) Maybe
they just like doing their own thing rather than interoperating with the
rest of the world.

(Sorry about the minor flame. I spent two years in the U.S. and another
year here in Japan doing I18N work, and it was pretty darn frustrating.
And it was my great disappointment with ruby to find out that the I18N
support is so poor.)

cjs
--
Curt Sampson <c...@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

Hal E. Fulton

unread,

Jul 31, 2002, 11:52:32 PM7/31/02

to

----- Original Message -----
From: "Curt Sampson" <c...@cynic.net>
To: "ruby-talk ML" <ruby...@ruby-lang.org>
Sent: Wednesday, July 31, 2002 10:30 PM
Subject: Re: Unicode in Ruby now?

> (Sorry about the minor flame. I spent two years in the U.S. and another
> year here in Japan doing I18N work, and it was pretty darn frustrating.
> And it was my great disappointment with ruby to find out that the I18N
> support is so poor.)

Well, here is your chance to contribute to
the I18N support... :)

Seriously, since you have some expertise, I'm
sure your knowledge will be valuable in
improving Ruby... talk to vruz also.

Hal

Curt Sampson

unread,

Aug 1, 2002, 12:50:44 AM8/1/02

to

On Thu, 1 Aug 2002, Hal E. Fulton wrote:

> Seriously, since you have some expertise, I'm sure your knowledge will
> be valuable in improving Ruby... talk to vruz also.

I doubt it. My opinion of the matter is that the correct way to
do things is to go with Unicode internally. (This does not rule
out processing non-Unicode things, but you process them as binary
byte-strings, not as character strings.) You lose a little bit of
functionality this way, but overall it's easy, fast, and gives you
everything you really need.

Unfortunately, a lot of Japanese programmers disagree with this. They
feel the need, for example, to have separate code points for a single
character, simply because one stroke is slightly different between the
way Japanese and Chinese people write it. (The meaning is exactly the
same.)

They sometimes even feel the need to have the source language encoded
within strings, rather than having only applications that need this
information deal with it in their data formats. (It's not that there
aren't uses for these sorts of features, but they are not useful enough
to put the burden and overhead of them on every single program that
wants to deal with a bit of text.)

Basically, if I18N is not going to be completely impossible, you're
going to have to live with a bit of lossage when it comes to putting
data into a computer, especially kanji data. But everybody suffers this
loss: even in English we lived through all the days of ASCII without the
ability to spell co-operate properly (with a diaeresis over the second
'o', instead of the hyphen). Or naive (diaeresis over the 'i'), for that
matter. We lived.

Anyway, I've had it with that battle. Ruby gets what it gets, and maybe
one day I'll be able easily to use it for I18N work, maybe not. In the
mean time there's perl and Java.

Yukihiro Matsumoto

unread,

Aug 1, 2002, 12:58:19 AM8/1/02

to

Hi,

In message "Re: Unicode in Ruby now?"

on 02/08/01, Curt Sampson <c...@cynic.net> writes:

|Basically, Ruby being a Japanese product, it is unlikely ever to be able
|to standardise on Unicode. The Japanese (well, enough of them to matter)
|hate Unicode. (They seem to hate I18N in general, for that matter; they
|certainly seem to pay less attention to it than, say, Americans.) Maybe
|they just like doing their own thing rather than interoperating with the
|rest of the world.

I don't think we hate I18N in general, but I admit many Japanese hate
Unicode-centric I18N.

|(Sorry about the minor flame. I spent two years in the U.S. and another
|year here in Japan doing I18N work, and it was pretty darn frustrating.
|And it was my great disappointment with ruby to find out that the I18N
|support is so poor.)

It's because no one but me is working on it. We need power.

matz.

Yukihiro Matsumoto

unread,

Aug 1, 2002, 1:22:29 AM8/1/02

to

Hi,

In message "Re: Unicode in Ruby now?"

on 02/08/01, Clifford Heath <cjh_n...@managesoft.com> writes:

|Is Mojikyo a superset of Unicode? If not, how hard is the translation to
|UCS-4?

Mojikyo character set contains all the CJK characters in ISO 10646.
But codepoint number for each character is different, so that we can
only do Unicode -> Mojikyo conversion using table lookup. The
translation to UCS-4 from Mojikyo is nearly impossible. (How can one
assign non-Unicode codepoint number in UCS-4?)

http://www.mojikyo.org/

|I designed the UCS-4 string class we use here in C++, with a UTF-8 storage
|format (up to 31-bit with a six-byte UTF-8 sequence). The string class
|remembers which character you last accessed and at what byte offset it
|started, so that when you ask for another character, it can decide whether
|heuristically to search forward from the start, forward or backward from
|the remembered point (most common), or if it has ever counted the
|characters, backward from the end. This minimises the search cost since
|most string processing is largely sequential.

Remembered point technique is very interesting. When we meet
performance problem with Ruby I18N, I will use it.

matz.

Yukihiro Matsumoto

unread,

Aug 1, 2002, 1:34:21 AM8/1/02

to

Hi,

In message "Re: Unicode in Ruby now?"

on 02/08/01, Yukihiro Matsumoto <ma...@ruby-lang.org> writes:

||And it was my great disappointment with ruby to find out that the I18N
||support is so poor.)
|
|It's because no one but me is working on it. We need power.

By the way, what are the required I18N features to satisfy you?
Enlighten me.

matz.

Curt Sampson

unread,

Aug 1, 2002, 4:35:00 AM8/1/02

to

On Thu, 1 Aug 2002, Yukihiro Matsumoto wrote:

> By the way, what are the required I18N features to satisfy you?
> Enlighten me.

Me? Not so much, actually. I need to be able to read in ISO-8859-1,
ISO-2022-JP, EUC-JP, Shift_JIS and UTF-8, convert them to a common
internal format so I don't have to worry about where the data I am
manipulating came from, and do the various conversions again for output.
And use your basic regexps and such as well, on the internal format.
The internal format should treat a character as a single character,
regardless of whether it's represented by one, two or more bytes in any
particular encoding.

I'll probably need to convert to/from various Chinese and/or Korean
character sets at some point in the future, but I don't need that right
now.

cjs

Yukihiro Matsumoto

unread,

Aug 1, 2002, 5:03:15 AM8/1/02

to

In message "Re: Unicode in Ruby now?"

on 02/08/01, Curt Sampson <c...@cynic.net> writes:

|Me? Not so much, actually. I need to be able to read in ISO-8859-1,
|ISO-2022-JP, EUC-JP, Shift_JIS and UTF-8, convert them to a common
|internal format so I don't have to worry about where the data I am
|manipulating came from, and do the various conversions again for output.

Reasonable. How do you want to specify the reading/writing charset?

matz.

Tobias Peters

unread,

Aug 1, 2002, 5:35:06 AM8/1/02

to

On Wed, 31 Jul 2002, Yukihiro Matsumoto wrote:
> In message "Unicode in Ruby now?"
> on 02/07/31, Tobias Peters <tpe...@invalid.uni-oldenburg.de> writes:
>
> |When I export a string to an utf-8 encoded stream, how can I possibly know
> |its current encoding. Strings do not have an "encoding" tag. Will they
> |have in future?
>
> Yes.

Nice. I still think sources and sinks of characters also need an
"encoding" property. Strings originating from some character source will
then have the source's encoding. Strings exported to character sinks will
have to be converted on the fly in case of a different encodings. We could
make behaviour in case of unconvertible characters a property of character
sinks.

We also need rules how to combine strings with different encoding then.
concatenating two strings encoded in koi8-r and iso-8859-1, respectively,
may only be possible when the result is encoded in some unicode
representation.

Are there any other character sets of relevance that are not part of
unicode yet? Otherwise, we could probably live with just two possible
canonical character encodings. With canonical, here I mean the encoding of
a string that is the result of a combination of other strings with
different encodings.

> No. Considering the existence of "big character set" like Mojikyo
> (charset developed in Japan, which is bigger than Unicode), there
> cannot be any ideal canonical format.

I understand that Mojikyo will not be folded into unicode due to
political reasons. Combining ruby strings in some unicode encoding with
ruby strings encoded in some Mojiko encoding might result in a runtime
error then.

> In addition, from my
> estimation, the cost penalty from code conversion to/from the
> canonical character set is intolerable if one processes mainly on
> non-ASCII, non-Unicode text data, like we do in Japan.

I understand that. It would affect all countries that use non-ascii
encodings.

Due to ruby's dynamic nature we could probably implement most of what is
required for international string with the current ruby version. The
biggest problems that I see are:

- Determining the encoding of string literals in source code. This should
be specified in the source file itself. Perhaps it's possible to
implement by overriding "require" and "load", read the whole file in
memory, and convert it to a user/system-specific default character set
before calling eval on it.
- Determining the encoding of strings that describe File system Paths. I
have no idea if operating systems provide this information to user space
applications.

Anyone interested in working on it?

Tobias

Curt Sampson

unread,

Aug 1, 2002, 6:45:14 AM8/1/02

to

On Thu, 1 Aug 2002, Yukihiro Matsumoto wrote:

> Reasonable. How do you want to specify the reading/writing charset?

Well, what Java does works fine for me. Basically, java has "streams"
which do binary I/O, and "readers/writers" which do character I/O. So to
do character I/O, you hand your InputStream or OutputStream to a class
that implements the InputReader or OutputWriter interface, and does the
charecter encoding conversion. Typically methods that open files or
whatever and return a reader or writer will have two signatures: one
which uses the "system default" character encoding (set by the locale
when you start the JVM), and the other where you can explicitly specify
the character encoding as a parameter.

This makes it easy to write "auto-sensing" InputReaders, too; so
long as they can read enough in advance of the first read from the
consumer, they can buffer it and look to see if it's, for example,
Shift_JIS versus EUC-JP.

But I'd be open to other ways of doing this, too.

BTW, I'd prefer to use the term "character encoding" rather than
"character set," as technically, the character set is just the
characters themselves, and not their assigment to binary numbers
or number sequences. Also, it would probably be best to use the
IANA standards for character encoding (though they call them
"character sets") names, available at

http://www.iana.org/assignments/character-sets

Curt Sampson

unread,

Aug 1, 2002, 6:54:08 AM8/1/02

to

On Thu, 1 Aug 2002, Tobias Peters wrote:

> ...

> We also need rules how to combine strings with different encoding then.
> concatenating two strings encoded in koi8-r and iso-8859-1, respectively,
> may only be possible when the result is encoded in some unicode
> representation.

Yeah. This is getting into complex nightmare city. That's why I'd prefer
to have the basic system just work completely in Unicode. One could have
a separate character system (character and string classes, byte-stream
to char converters, etc.) to work with this tagged format if one wished.

> Are there any other character sets of relevance that are not part of
> unicode yet?

Yeah. There are tons of obscure Japanese characters that are not in and
will never be in Unicode, some of which exist in various other character
sets. In particular there's Mojikyo (http://www.mojikyo.org/) which is
at 80,000 characters and growing.

> I understand that Mojikyo will not be folded into unicode due to
> political reasons.

Not just political reasons, but practical reasons. Unicode is designed
to work if you restrict yourself to using only 16-bit chars, and I
expect most programs are going to limit themselves to that. So even if
it were folded in to the extension space, most people wouldn't use it.

Jan Witt

unread,

Aug 1, 2002, 7:00:47 AM8/1/02

to

As for someone who just wants to use a set of
different character sets, the current discussions
about Unicode, I18N and all the different standards
is not very helpful for me.
Some of the discussions also display some
unpleasant parochialism, maybe traceable
to the use of some conceptual inconsistencies.
I suspect that "character" is an ambiguous concept
as used in the standards. To clarify matters, we must
distinguish between
meaning, representation and encodings.

Let us try and agree on a few basic issues:

What you see on the screen, analyze via OCR, write on
paper or see in a book,
are elementary shapes or "glyphs" (from ancient greek
glyphe = the carving).
These glyphs may be scaled or otherwise deformed,
emboldened etc.,
without introducing interpretation problems on that
level.
( Size and Emphasis play important roles on the level
of words and
sentences, but not typically on the encoding level of
single characters)

As on the level of textual terms there are
overlappings of
terms and meanings, namely synonyms and homonyms,
so on the glyph level, there are synmorphs and
homomorphs,
when we produce mappings of glyphs and [language]
characters.

A trivial case of homomorphism is the relationship
between
the glyphs representing an uppercase latin H and the
Greek letter Eta.

The classic case of synmorphism is the use of a
variety of fonts
(sets of glyph-character pairings) within a single
language.

A not so simple case is synmorphism from context
dependency like in
all languages written with Arabic characters, where
the choice of glyph for a given
character depends on the position of the character in
the word (beginning, middle, end).

So far, I have assumed that we are talking about
characters/glyphs that follow
linear textual orderings like :

Most modern European languages: glyphs: left to
right, lines top down
Japanese (if not following ")

Japanese: glyphs: top down,
lines right to left

Arabic, Hebrew (printed forms): glyphs: left to
right, lines top down

There are 2 exceptions to this:
diacritics (small marks above, below, besides or
within a glyph )
ligatures (intertwining of 2 or more
characters, e.g. in Indian languages)

Since the advent of Adobe Postscript and MS Truetype,
glyphs have become manageable
as geometric objects, which means, that they can be
encoded and decoded as such.

Also, for all languages, people use dictionaries.
Doing this, they know,
what language they are dealing with. Also, they know
about the standard
collating sequence of that language. ( In classical
Spanish do not search for 'chorro' under 'c')

To produce an online or hard copy dictionary, you
must choose a "canonical
representative" amongst the available fonts and
select a collating sequence.
(There may be several to choose from)

As a multilingual reader, I want my browser to find
out what language
a piece of text is written in and to display the
appropriate glyphs for
the different HTML elements as defined by the text
author.

As a multilingual writer, or OCR reader I will tell
my word processor what language I
want to use currently and it will give me a virtual
keyboard, provide me with a
dictionary, a spelling checker etc..

Assuming that all glyphs are uniquely numbered,
and knowing the language,
it should always be possible, to retrieve text from
the glyph representation
and to deal with it on any linguistic level desired.

The end user has to be given a list of standardized
names of languages, nothing more.
He/she should not need to know any glyph or
"character" encodings explicitly.

For someone who wants to write a multilingual
information system in Ruby, the question is how the
ideal situation just described can best be
approximated.

Jan

- Más sabe el diablo por viejo que por diablo. ( The
devil knows more from being old than from being the
devil.)

__________________________________________________
Do You Yahoo!?
Yahoo! Health - Feel better, live better
http://health.yahoo.com

Tobias Peters

unread,

Aug 1, 2002, 7:23:21 AM8/1/02

to

On Thu, 1 Aug 2002, Curt Sampson wrote:
> On Thu, 1 Aug 2002, Tobias Peters wrote:
> > Are there any other character sets of relevance that are not part of
> > unicode yet?
>

> Yeah. [... Mojikyo ...]

Sorry, with "other" I meant besides Mojikyo. ?

> Not just political reasons, but practical reasons. Unicode is designed
> to work if you restrict yourself to using only 16-bit chars, and I
> expect most programs are going to limit themselves to that.

I thought this was a design error, and Unicode has now roughly 32 bits for
code points, and I read somewhere it is expected that 21 bits will suffice
to encode all possible glyphs.

Tobias

Philipp Meier

unread,

Aug 1, 2002, 8:15:18 AM8/1/02

to

On Thu, Aug 01, 2002 at 07:53:07PM +0900, Curt Sampson wrote:
> On Thu, 1 Aug 2002, Tobias Peters wrote:
>
> > ...
> > We also need rules how to combine strings with different encoding then.
> > concatenating two strings encoded in koi8-r and iso-8859-1, respectively,
> > may only be possible when the result is encoded in some unicode
> > representation.
>
> Yeah. This is getting into complex nightmare city. That's why I'd prefer
> to have the basic system just work completely in Unicode. One could have
> a separate character system (character and string classes, byte-stream
> to char converters, etc.) to work with this tagged format if one wished.

Can't we follow an approach like ruby or java handles numbers? It's
naturally possible to add a floating point number and an integer. The
result is a floating point. I will sketch some ideas:

class String
alias :+ :__+

def + (other)
if other.is_a? self.class
self.__+(other)
else
self += convert(other)
end
end
end

class ISO88591 < String
def convert_from(other)
if other.is_a? ASCII
return other
elsif other.is_a? UTF8
return from_utf8
else
convert_from(UTF8.convert_from(other))
end
end

def from_utf8()
...
end

def to_utf8()
...
end

end

class UTF8 < String
def convert_from(other)
other.to_utf8
end
end

The convert method can of course try to delegate everything to a UTF8
class, so that ascii -> iso-8859-1 will result in ascii -> utf8 ->
iso8859-1. That means when introducing a new encoding one must only
provide a method to convert from / to utf8.

-billy.

--
Meisterbohne Söflinger Straße 100 Tel: +49-731-399 499-0
eLösungen 89077 Ulm Fax: +49-731-399 499-9

Alexander Bokovoy

unread,

Aug 1, 2002, 8:31:41 AM8/1/02

to

On Thu, Aug 01, 2002 at 07:53:07PM +0900, Curt Sampson wrote:

> > I understand that Mojikyo will not be folded into unicode due to
> > political reasons.
>
> Not just political reasons, but practical reasons. Unicode is designed
> to work if you restrict yourself to using only 16-bit chars, and I
> expect most programs are going to limit themselves to that. So even if
> it were folded in to the extension space, most people wouldn't use it.

Unicode 3.1 is 32-bit wide. I do not see reason to exist projects like
Mojikyo when it is perfectly can be done in 32-bit Unicode. Also, Mojikyo
institute restrictions on Mojikyo fonts are somewhat useless if they
want to make their encoding system more widespread and usefull.
--
/ Alexander Bokovoy
---
Maintence window broken

Tobias Peters

unread,

Aug 1, 2002, 8:31:18 AM8/1/02

to

On Thu, 1 Aug 2002, Yukihiro Matsumoto wrote:

> Reasonable. How do you want to specify the reading/writing charset?

How about

class IO
USER_DEFAULT_ENCODING = nl_langinfo(CODESET) # libc function, probably
# not available on windows
attr_accessor :encoding
end

have read/write look at @encoding. Should be added it to the IO
constructors in some way, but made optional.

Tobias

Curt Sampson

unread,

Aug 1, 2002, 8:53:11 AM8/1/02

to

On Thu, 1 Aug 2002, Tobias Peters wrote:

> > Not just political reasons, but practical reasons. Unicode is designed
> > to work if you restrict yourself to using only 16-bit chars, and I
> > expect most programs are going to limit themselves to that.
>
> I thought this was a design error, and Unicode has now roughly 32 bits for

> code points....

No, and no. As specified in section 2.2 of the standard (I quote
from 3.0, but 3.1 and 3.2 do not appear to have updated these
sections):

Plain Unicode text consists of sequences of 16-bit Unicode character
codes.... From the full range of 65,536 code values, 63,486 are
available to represent characters with single 16-bit code values,
and 2,048 code values are available to represent an additional
1,048,544 chaaracters through paired 16-bit code values.

So note that the character codes are still 16-bit. though sometimes
it takes two character codes to represent a character.

Section 2.9, "Conforming to the Unicode Standard" states as its
first precept that, "An implementation that conforms to the Unicode
Standard... treats characters as 16-bit units." It goes on to say
that the standard, "does not require that an application be capable
of interpreting and rendering all Unicode characters so as to be
conformant."

Section 5.4 really gets into the details of handling surrogate pairs.
There are essentially three levels at which you can do this: none, where
you treat each 16-bit code as an individual, unknown character and do
not guarantee integrity of the pairs (i.e., you don't guarantee that
you won't split them); weak where you interpret some pairs, but treat
the others as "none" treats them; and "strong" where you interpet some
pairs, and guarantee the integrity of them.

It continues with:

As with text-element boundaries, the lowest-level string-handling
routines (such as wcschr) do not necessarily need to be modified
to prevent surrogates from being damaged. In practice, it is
sufficient that only certain higher-level processes...be aware of
surrogate pairs; the lowest-level routines can continue to function
on sequences of 16-bit Unicode code values without having to treat
surrogates specially.

So from all of this, it's pretty obvious that the expectation is that
only those systems that really need to work with surrogates are going to
be doing more than the bare minimum to support them, and those that do
support it will be doing it at something above the basic language level.

I should also point out that Java has no special support for surrogates
in the String class; String.length() returns the number of code values
in the string, not the number of characters. It's not a problem in practice.

Curt Sampson

unread,

Aug 1, 2002, 8:57:03 AM8/1/02

to

On Thu, 1 Aug 2002, Alexander Bokovoy wrote:

> Unicode 3.1 is 32-bit wide.

I have just looked at my 3.0 standard and the 3.1 and 3.2 updates on the
web site, and I do not see any evidence of this. Did I miss something?
See the message I just posted for the details as I know them.

> I do not see reason to exist projects like
> Mojikyo when it is perfectly can be done in 32-bit Unicode.

Mojikyo is doing things like setting code points for characters that
will never exist in Unicode, because those characters are combined due
to the character combining rules. Mojikyo has a different purpose from
Unicode: Unicode wants to make doing standard, day-to-day work easy;
Mojikyo wants to give maximum flexability in the display of Chinese
characters. Given the number and complexity of kanji, these two aims are
basically incompatable.

Alexander Bokovoy

unread,

Aug 1, 2002, 9:42:57 AM8/1/02

to

On Thu, Aug 01, 2002 at 09:55:48PM +0900, Curt Sampson wrote:
> On Thu, 1 Aug 2002, Alexander Bokovoy wrote:
>
> > Unicode 3.1 is 32-bit wide.
>
> I have just looked at my 3.0 standard and the 3.1 and 3.2 updates on the
> web site, and I do not see any evidence of this. Did I miss something?
> See the message I just posted for the details as I know them.

http://www.unicode.org/unicode/reports/tr19/tr19-9.html :

--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--

3 Relation to ISO/IEC 10646 and UCS-4

ISO/IEC 10646 defines a 4-byte encoding form called UCS-4. Since UTF-32 is
simply a subset of UCS-4 characters, it is conformant to ISO/IEC 10646 as
well as to the Unicode Standard.

As of the recent publication of the second edition of ISO/IEC 10646-1,
UCS-4 still assigns private use codepoints (E0000016..FFFFFF16 and
6000000016..7FFFFFFF16) that are not in the range of valid Unicode
codepoints. To promote interoperability among the Unicode encoding forms
JTC1/SC2/WG2 has approved a motion removing those private use assignments:

Resolution M38.6 (Restriction of encoding space) [adopted unanimously]

"WG2 accepts the proposal in document N2175 towards removing the provision
for Private Use Groups and Planes beyond Plane 16 in ISO/IEC 10646, to
ensure internal consistency in the standard between UCS-4, UTF-8 and
UTF-16 encoding formats, and instructs its project editor [to] prepare
suitable text for processing as a future Technical Corrigendum or an
Amendment to 10646-1:2000."

While this resolution must still be approved as an Amendment to
10646-1:2000, the Unicode Technical Committee has every expectation that
once the text for that Amendment completes its formal balloting it will
proceed smoothly to publication as part of that standard.

Until the formal balloting is concluded, the term UTF-32 can be used to
refer to the subset of UCS-4 characters that are in the range of valid
Unicode code points. After it passes, UTF-32 will then simply be an alias
for UCS-4 (with the extra requirement that Unicode semantics are observed)

--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--

> > I do not see reason to exist projects like
> > Mojikyo when it is perfectly can be done in 32-bit Unicode.
>
> Mojikyo is doing things like setting code points for characters that
> will never exist in Unicode, because those characters are combined due
> to the character combining rules. Mojikyo has a different purpose from
> Unicode: Unicode wants to make doing standard, day-to-day work easy;
> Mojikyo wants to give maximum flexability in the display of Chinese
> characters. Given the number and complexity of kanji, these two aims are
> basically incompatable.

I still don't see why both goals should be incompatible a priori. But this
is possible an offtopic here. :)

--
/ Alexander Bokovoy
---

I went to a Grateful Dead Concert and they played for SEVEN hours. Great song.
-- Fred Reuss

Curt Sampson

unread,

Aug 1, 2002, 10:25:01 AM8/1/02

to

On Thu, 1 Aug 2002, Alexander Bokovoy wrote:

> On Thu, Aug 01, 2002 at 09:55:48PM +0900, Curt Sampson wrote:
> > On Thu, 1 Aug 2002, Alexander Bokovoy wrote:
> >
> > > Unicode 3.1 is 32-bit wide.
> >
> > I have just looked at my 3.0 standard and the 3.1 and 3.2 updates on the
> > web site, and I do not see any evidence of this. Did I miss something?
> > See the message I just posted for the details as I know them.
>
> http://www.unicode.org/unicode/reports/tr19/tr19-9.html :

> [section 3, Relation to ISO/IEC 10646 and UCS-4]

Actually, I was looking for someone to attack my argument, not
support it. :-)

What this says is that they are removing some private code areas
in the ISO 10646 UCS-4 encoding so that it will become smaller and
compatable with UTF-32. And, as it says at the beginning of that
document:

UTF-32 is restricted in values to the range 0..10FFFF16, which
precisely matches the range of characters defined in the Unicode
Standard (and other standards such as XML), and those representable
by UTF-8 and UTF-16.

So Unicode is not 32-bit in any sense of the word. A character in
the UCS-32 encoding of Unicode takes up 32-bits, but many of those
bits are unused.

> > Mojikyo wants to give maximum flexability in the display of Chinese
> > characters. Given the number and complexity of kanji, these two aims are
> > basically incompatable.
>
> I still don't see why both goals should be incompatible a priori. But this
> is possible an offtopic here. :)

Partly efficiency concerns. As the speed of CPUs increases relative
to memory, the relative cost of string handling (which is pretty
memory intensive) gets higher and higher. And also things like ease
of use; avoiding duplications makes things like pattern matching
and use of dictionaries much easier. (Imagine, for example, that
ASCII had two 'e's in it, and people used one or the other randomly,
as they liked. Now instead of writing s/feet/fleet/, you have to
write at least s/f[ee][ee]t/fleet/, or in certain fussy cases even
s/f([ee][ee])t/fl\1t/. Ouch.

MikkelFJ

unread,

Aug 1, 2002, 11:04:10 AM8/1/02

to

"Curt Sampson" <c...@cynic.net> wrote in message
news:Pine.NEB.4.44.02080...@angelic.cynic.net...

> On Thu, 1 Aug 2002, Tobias Peters wrote:
>
> > ...
> > We also need rules how to combine strings with different encoding then.
> > concatenating two strings encoded in koi8-r and iso-8859-1,
respectively,
> > may only be possible when the result is encoded in some unicode
> > representation.
>
> Yeah. This is getting into complex nightmare city. That's why I'd prefer
> to have the basic system just work completely in Unicode. One could have
> a separate character system (character and string classes, byte-stream
> to char converters, etc.) to work with this tagged format if one wished.

But isn't this what matz suggest?
Each stream is tagged, that is the same as having different types. It's
basically just a different way to store the type while having a lot of
common string operations.
The only major issues I see are: fixed size characters versus variable
length characters - for example UTF-8 versus UCS-4.
I think a string class for each type might make sense.

I think UTF-8 is universal enough to warrant its own string class, but for
fixed size formats, why should it matter what encoding is used? The
important thing is that the encoding is stored, possibly along with the byte
width of the character.

BTW: Unicode is not a fixed with format. It is an almost fixed with format -
but there are escape codes and options for future extensions. Hence UCS-4 is
a strategy with limited timespan. That's why I prefer UTF-8 - it recognizes
the variable length issue up front, yet usually takes up less space than
UCS-2 or UCS-4 except if you are from Asia in which case you probably want a
different encoding.

One other detail about UTF-8 - its really nice to work with as raw 8 bit
character. When writing a lexer for UTF-8 you can define anything above 127
to be a letter and A-Za-z to also be a letter. Then you can forget
everything about UTF-8 and still parse everything correctly including
indentifiers. UCS-2, UCS-4 lexers get bloated or impossible due to the huge
lookup.

For practical string processing, it is actually very seldom that you
actually need to index a character sequence. Normally you simply break the
string into pieces instead.
You could have a iterator object for the string class which does not care
about variable length issues. It cannot index, but it can remember a
position even after a string is changed.

Mikkel

Yukihiro Matsumoto

unread,

Aug 1, 2002, 12:27:18 PM8/1/02

to

Hi,

In message "Re: Unicode in Ruby now?"
on 02/08/01, Curt Sampson <c...@cynic.net> writes:

|Well, what Java does works fine for me. Basically, java has "streams"
|which do binary I/O, and "readers/writers" which do character I/O. So to
|do character I/O, you hand your InputStream or OutputStream to a class
|that implements the InputReader or OutputWriter interface, and does the
|charecter encoding conversion. Typically methods that open files or
|whatever and return a reader or writer will have two signatures: one
|which uses the "system default" character encoding (set by the locale
|when you start the JVM), and the other where you can explicitly specify
|the character encoding as a parameter.
|
|This makes it easy to write "auto-sensing" InputReaders, too; so
|long as they can read enough in advance of the first read from the
|consumer, they can buffer it and look to see if it's, for example,
|Shift_JIS versus EUC-JP.

Thank you for information.

|BTW, I'd prefer to use the term "character encoding" rather than
|"character set," as technically, the character set is just the
|characters themselves, and not their assigment to binary numbers
|or number sequences. Also, it would probably be best to use the
|IANA standards for character encoding (though they call them
|"character sets") names, available at
|
| http://www.iana.org/assignments/character-sets

I use the following definitions:

character set:
the set of characters with number assigned to each character.

code point:
the assigned number for each character in particular
character set

character encoding scheme:
the way to represent sequence of code points.

Mojikyo is a character set. Unicode is a character set. UTF-8 is a
character encoding scheme for Unicode. Shift_JIS is a character
encoding scheme. ISO 10646 defines both character set and encoding
scheme.

matz.

Alexander Bokovoy

unread,

Aug 1, 2002, 12:36:50 PM8/1/02

to

On Thu, Aug 01, 2002 at 11:23:38PM +0900, Curt Sampson wrote:
> > > I have just looked at my 3.0 standard and the 3.1 and 3.2 updates on the
> > > web site, and I do not see any evidence of this. Did I miss something?
> > > See the message I just posted for the details as I know them.
> >
> > http://www.unicode.org/unicode/reports/tr19/tr19-9.html :
> > [section 3, Relation to ISO/IEC 10646 and UCS-4]
>
> Actually, I was looking for someone to attack my argument, not
> support it. :-)

:) The reason I've pointed to that section, is that there will be no
difference in ISO/IEC 10646 and Unicode very soon in sense of covered code
space. This means as soon as it will be acomplished, uniform expansion to
unused bits in 32-bit space will start. If CJK community will have
interest in it, of course. As you may remember, there were some complaints
in past about 'small' code space for covering CJK in Unicode. It wouldn't
be so relatively soon.

> > > Mojikyo wants to give maximum flexability in the display of Chinese
> > > characters. Given the number and complexity of kanji, these two aims are
> > > basically incompatable.
> >
> > I still don't see why both goals should be incompatible a priori. But this
> > is possible an offtopic here. :)
>
> Partly efficiency concerns. As the speed of CPUs increases relative
> to memory, the relative cost of string handling (which is pretty
> memory intensive) gets higher and higher. And also things like ease
> of use; avoiding duplications makes things like pattern matching
> and use of dictionaries much easier. (Imagine, for example, that
> ASCII had two 'e's in it, and people used one or the other randomly,
> as they liked. Now instead of writing s/feet/fleet/, you have to
> write at least s/f[ee][ee]t/fleet/, or in certain fussy cases even
> s/f([ee][ee])t/fl\1t/. Ouch.

Well, it raises a completely different problem set. It does attack a
foundation upon which current meaning of character encoding is built.
Remember that 'character encoding' usually understood as a way to address
and differentiate 'characters' in a 'string' using one property --
position in some abstract 'alphabet' which has little to do with real
life language properties. For example, CP1251 which is used in Belarus and
other slavic countries has two 'i' -- one from ASCII and another (with
_exactly_ same glyph in fonts) for Belarussian and Ukrainian languages.
There is no information in the CP1251 encoding to differentiate those two
except attaching external property table (which is done in IANA proposal
by mapping all positions of encoding to some Unicode code points, which,
in turn, have all needed properties assigned).

What you are showing above, is a need to perform operations on these 'external'
properties, like it is done in ICU, for example. Actually, it would be much
more productive to implement kind of Mojikyo inside ICU.

--
/ Alexander Bokovoy
---

Ever notice that even the busiest people are never too busy to tell you
just how busy they are?

Jan Witt

unread,

Aug 1, 2002, 2:51:43 PM8/1/02

to

I beg your pardon,
you have all the wonderful standards at your
fingertips.
What is meant by "representing" a character?
How are glyphs mapped to code points?
Is that a many to many mapping?
What are the attributes of a glyph?
What are the attributes of a code point?
Outside of natural language text processing,
are there areas where the parsing of
non-Latin-1 strings is relevant? If so,
what are they?
Please help my ignorance
Jan

-
Ecclesiastes 1:9The thing that hath been, it is that
which shall be; and that which is done is that which
shall be done: and there is no new thing under the
sun.

The King James Version (Authorized)

Yukihiro Matsumoto

unread,

Aug 1, 2002, 4:43:24 PM8/1/02

to

Hi,

In message "Re: Unicode in Ruby now?"

on 02/08/02, Jan Witt <ontolog...@yahoo.com> writes:

|I beg your pardon,
|you have all the wonderful standards at your
|fingertips.

Probably you're asking Curt, but I will answer what I can.

|What is meant by "representing" a character?

|What are the attributes of a code point?

A code point is a number index to a character. "representing" a
character means encoding, for example:

Japanese Hiragana "Ka" has a code point 9252 in JIS
EUC encoded "Ka" is "\xa4\xab".

Japanese Hiragana "Ka" has a code point 12363 in Unicode
UTF-8 encoded "Ka" is "\xe3\x81\x8b".

|Outside of natural language text processing,
|are there areas where the parsing of
|non-Latin-1 strings is relevant? If so,
|what are they?

Because some people in the world need it to represent their daily
text. My mail, memo, journal, and almost everything are written in
non-Latin-1 string (EUC-JP).

matz.

TAKAHASHI Masayoshi

unread,

Aug 1, 2002, 7:37:14 PM8/1/02

to

Hi,

ma...@ruby-lang.org (Yukihiro Matsumoto) wrote:
> Mojikyo is a character set. Unicode is a character set.

IMHO, Mojikyo is a character-glyph set. It defines detail
of glyph design(HANE, TOME, HARAI) which unified in other
character set (JIS and UCS).
It's the reason why Mojikyo is not (should not) unified in
Unicode, I think.

Regards,

TAKAHASHI 'Maki' Masayoshi E-mail: ma...@rubycolor.org

Dan Sugalski

unread,

Aug 2, 2002, 12:04:00 AM8/2/02

to

At 1:50 PM +0900 8/1/02, Curt Sampson wrote:
>On Thu, 1 Aug 2002, Hal E. Fulton wrote:
>
>> Seriously, since you have some expertise, I'm sure your knowledge will
>> be valuable in improving Ruby... talk to vruz also.
>
>I doubt it. My opinion of the matter is that the correct way to
>do things is to go with Unicode internally. (This does not rule
>out processing non-Unicode things, but you process them as binary
>byte-strings, not as character strings.) You lose a little bit of
>functionality this way, but overall it's easy, fast, and gives you
>everything you really need.
>
>Unfortunately, a lot of Japanese programmers disagree with this. They
>feel the need, for example, to have separate code points for a single
>character, simply because one stroke is slightly different between the
>way Japanese and Chinese people write it. (The meaning is exactly the
>same.)

This is just a comment from an interested but mildly uninvolved
bystander (though I'm dealing with similar issues with Parrot) but...
Given that the people who've made these decisions have made them
about their native language (a language that is neither your nor my
native language) perhaps it's a bit presumptuous to decide that what
they've done is wrong and some other way is better. It'd be about the
same as someone else deciding that there's no need for a character
set to deal with upper and lowercase roman letters since, after all,
they represent the same thing. Or that you're only supporting
whatever Esperanto needs since that should be good enough for anyone.

This is someone's *language* you're dealing with. It existed long
before computers did, it's deeply rooted in culture, and is by far
more important than any computer issue. Language is important--it
conveys meaning and culture, and is the data. The computer is a tool.
If the tool can't deal with the language, it means the tool is
broken, not the language.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Clifford Heath

unread,

Aug 2, 2002, 12:51:38 AM8/2/02

to

TAKAHASHI Masayoshi wrote:
> It's the reason why Mojikyo is not (should not) unified in
> Unicode, I think.

That sounds fair, but computers still need to process such symbols.
IMO the ISO 10646 folk should be approached to allocate a 24-bit
block inside the UCS-4 encoding, but outside the Unicode space.
That way the UCS-4 Mojikyo characters can be encoded using either
the 4/5/6 byte extension of UTF-8, or using the UTF-8 style of
encoding with 1/2/3/4 bytes (i.e. with an assumed UCS-4 top byte
that isn't zero, as with Unicode).

I agree with Dan's comments, and think this would be the best way
to resolve the issue.

--
Clifford Heath

Tobias Peters

unread,

Aug 2, 2002, 3:38:40 AM8/2/02

to

On Fri, 2 Aug 2002, Clifford Heath wrote:
> IMO the ISO 10646 folk should be approached to allocate a 24-bit

[...]

> I agree with Dan's comments, and think this would be the best way
> to resolve the issue.

Then we would delay i18n string processing until the iso 10646 people have
made such a decision? This will probably never happen! And even if it
does, then not in the next few years. We need unicode in ruby *now*. It
seems we can get by with a choice of two possible canonical encodings to
be used for the result of concatenating strings with different encodings:
utf-8 and some mojikyu encoding, based on the encodings of the original
strings. Let's implement it.

Tobias

Curt Sampson

unread,

Aug 2, 2002, 4:15:37 AM8/2/02

to

On Fri, 2 Aug 2002, Dan Sugalski wrote:

> Given that the people who've made these decisions have made them
> about their native language (a language that is neither your nor my
> native language) perhaps it's a bit presumptuous to decide that what
> they've done is wrong and some other way is better.

I don't think it's so presumptious.

First, I do know some Japanese and Chinese, including kanji, so this
stuff isn't a complete mystery to me. Second, through experience
building I18N web sites and suchlike, I'd say I have a better
understanding of I18N issues than many Japanese programmers do.
Certainly Japanese systems builders more often than not do not generally
take into account I18N issues. (And for many of them, why should they?
They're not interested in anything outside of Japan, so it's not worth
spending time, effort and money on it.)

Also, note that the Unicode-haters in Japan, while noisy amongst
programmers, are far from representative of the users. Most Japanese
could care less if you even have 薔薇 (bara--rose) available in
kanji, much less anything in the Unicode surrogate area.

For those that really do need support for all the kanji, rather than all
the generally generally used in modern life, there are solutions that
are much better than Unicode will ever be, and they should use those.
Those solutions are also much higher in overhead (for both programming
and machine resources), though, and that burden shouldn't be put on all
software.

An analogy might be text files versus DTP. ASCII doesn't have things
like font sizes, kerning information, and so on, so it alone isn't
useful for DTP. For that you use another, customized software system
that adds the capabilities you need. But this is a good thing, it means
that all those systems out there that don't care about font, size,
kerning, etc. (such as your local database server) don't deal with the
overhead of it.

> If the tool can't deal with the language, it means the tool is
> broken, not the language.

No tool can deal with everything in the language. ASCII, or even
ISO-8859-1, certainly doesn't deal with a huge number of issues in
English. Yet ASCII does a good job for a lot of everyday needs,
and doesn't cost too much, so it serves us well. (Certainly seems
to be working ok in this e-mail message, anyway!)

Curt Sampson

unread,

Aug 2, 2002, 4:54:13 AM8/2/02

to

On Fri, 2 Aug 2002, Alexander Bokovoy wrote:

> :) The reason I've pointed to that section, is that there will be no
> difference in ISO/IEC 10646 and Unicode very soon in sense of covered code
> space.

Right. They're *reducing* the ISO/IEC code space to match Unicode.

> This means as soon as it will be acomplished, uniform expansion to
> unused bits in 32-bit space will start.

I very, very much doubt that. Remember, Unicode uses 16-bit code values,
and all high and low surrogate characters are immediately identifiable.
Breaking this would result in much, much pain.

> If CJK community will have
> interest in it, of course. As you may remember, there were some complaints
> in past about 'small' code space for covering CJK in Unicode. It wouldn't
> be so relatively soon.

Yeah, but for day to day use, nobody even uses the surrogate pairs. This
is part of the whole point of Unicode; you can safely ignore them or do
only very minimal processing to deal with them, and all but specialized
applications will still work.

Curt Sampson

unread,

Aug 2, 2002, 5:25:44 AM8/2/02

to

On Fri, 2 Aug 2002, MikkelFJ wrote:

> > Yeah. This is getting into complex nightmare city. That's why I'd prefer
> > to have the basic system just work completely in Unicode. One could have
> > a separate character system (character and string classes, byte-stream
> > to char converters, etc.) to work with this tagged format if one wished.
>
> But isn't this what matz suggest?
> Each stream is tagged, that is the same as having different types. It's
> basically just a different way to store the type while having a lot of
> common string operations.

No, because then you have to deal with conversions. Most popular
character sets are convertable to Unicode and back without loss. That is
not true of any arbitrary pair of character sets, though, even if you go
through Unicode.

The reason for this is as follows. Say character set Foo has split
a unified hanji, "a", and also has "A". When converting to Unicode,
that "A" will be preserved because it's assigned a code point in
a compatability area, and when you convert back from Unicode, that
"A" will be translated to "A" in Foo. However, if character set Bar
does not have "A", just "a", the "A" will be converted to "a". When you
go from Bar back to Unicode, you end up with "a" again because there's
no way to tell that it was originally "A" when you converted out.

But there's an even better reason than this for converting to
Unicode on input, rather than doing internal tagging. If you don't
have conversion tables for a particular character encoding, it's
much better to find out at the time you try to get the information
in to the system than at some arbitrary later point when you try
to do a conversion. That way you know where the problem information
is coming from.

In terms of interface, I would say:

1. Continue to use String as it is for "binary" data. This is
efficient, if you don't need to do much processing.

2. Add a UString or similar for dealing with UTF-16 data. There's
no need for surrogate support in this, for reasons I will get into
below, so this is straight fixed width. Reasonably efficient (almost
maximally efficient for those of us using Asian languages :-)) and
very easy to use.

3. Add other, specialized classes when you need to do special
purpose things. No need for this in the standard distribution.

> BTW: Unicode is not a fixed with format.

In terms of code values, it is fixed width. However, some characters are
represented by pairs of code values.

> ...but there are escape codes...

No, there are no escape codes. The high and low code values for
surrogate characters have their own special areas, and so are easily
identifiable.

> and options for future extensions.

Not that I know of. Can you explain what these are?

> Hence UCS-4 is a strategy with limited timespan.

Not at all, unless they decide to change Unicode to the point where it
no longer uses 16-bit code values, or add escape codes, or something
like that. That would be backward-incompatable, severely complicate
processing, and generally cause all hell to break lose. So I'd rate
this as "Not Likely."

Here are a few points to keep in mind about Unicode processing:

1. The surrogate pairs are almost never used. Two years ago
there weren't even any characters assigned to those code points.

2. There are many situations where, even if surrogate pairs
are present, you don't know or care, and need do nothing to
correctly deal with them.

3. Broken surrogate pairs are not a problem; the standard says you
must be able to ignore broken pairs, if you interpret surrogate
pairs at all.

3. The surrogate pairs are extremely easy to distinguish, even
if you don't interpret them.

4. The code for dealing with surrogate pairs well (basically,
not breaking them) is very simple.

The implication of point 1 is that one should not spend a lot of effort
dealing with surrogate pairs, as very few users will ever use them. Very
few Asian users will ever use them in their lifetimes, in fact.

The implication of points 2 and 3 are that not everything that deals
with Unicode has to deal with, or even know about, surrogate pairs. If
you are writing a web application, for example, your typical fields you
just take as a whole from the web browser or database, and give as a whole
to the web browser or database. Thus only the web browser really has any
need at all to deal with surrogate pairs.

If you take a substring of a string and in the process end up with
a surrogate pair half on either end, that's no problem. It just
gets ignored by whatever facilities deal with surrogate pairs, or
treated as an unknown character by those that don't (rather than
two unknown characters for an unsplit surrogate pair).

The only time you really run into a problem is if you insert
something into a string; there's a chance you might split the
surrogate pair, and lose the character. This is pretty uncommon
except in interactive input situations, though, where you know how
to handle surrogate pairs and can avoid doing this, or where you
don't know and the user can't see the characters anyway.

Well, another area you can run into problems with is line wrapping, but
there's no single algorithm for that anyway, and plenty of algorithms
break on languages for which they were not designed. So there you should
add some very simple code that avoids splitting surrogate pairs. (This
code is much simpler than the line wrapping code anyway, so it's hardly
a burden.) That shows the advantages of points 3 and 4 (essentially the
same point).

So I propose just what the Unicode standard itself proposes in
section 5.4: UString (or whatever we call it) should have the
Surrogate Support Level "none"; i.e., it completely ignores the
existence of surrogate pairs. Things that use UString that have
the potential to encounter surrogate pair problems or wish to
interpret them can add simple or complex code, as they need, to
deal with the problem at hand. (Many users of UString will need to
do nothing.)

Note that there's a big difference between this and your UTF-8
proposal: ignoring multibyte stuff in UTF-8 is going to cause much,
much more lossage because there's a much, much bigger chance of
breaking things when using Asian languages. With UTF-16, you probably
won't even encounter surrogates, whereas with Japanese in UTF-8,
pretty much every character is multibyte.

MikkelFJ

unread,

Aug 2, 2002, 6:09:11 PM8/2/02

to

"Curt Sampson" <c...@cynic.net> wrote in message
news:Pine.NEB.4.44.02080...@angelic.cynic.net...

> > But isn't this what matz suggest?

> > Each stream is tagged, that is the same as having different types. It's
> > basically just a different way to store the type while having a lot of
> > common string operations.
>
> No, because then you have to deal with conversions. Most popular
> character sets are convertable to Unicode and back without loss. That is
> not true of any arbitrary pair of character sets, though, even if you go
> through Unicode.

Not really, you just produce a runtime error that the data cannot be - say -
concatenated. You just use the same vehicle to carry different cargo.
You can have special functions for to-Unicode and From-unicode. And similar
for popular Asian scripts.

> But there's an even better reason than this for converting to
> Unicode on input, rather than doing internal tagging. If you don't
> have conversion tables for a particular character encoding, it's
> much better to find out at the time you try to get the information
> in to the system than at some arbitrary later point when you try
> to do a conversion. That way you know where the problem information
> is coming from.

That is essentially static versus dynamic typing. I'd say in most situations
the application would be pretty well aware about what they are doing. The
tagging allows a generic rutine to handle multiple formats if it so chooses,
or could error if it got anything but Unicode (or whatever).

> 2. Add a UString or similar for dealing with UTF-16 data. There's

Obviously you know more about Unicode than most. What is the practical
difference between UCS-4, UCS-2 and UTF-16. Is it that "extended
characerts" - or surrogates - will take on more space than UCS-4 but
typically take up the same space as UCS-2?

> > and options for future extensions.
>
> Not that I know of. Can you explain what these are?

I guess you know more about this than I. I can't give you details I only
have it from memory. It is possible that it is covered by reserved ranges of
code pairs.
In that case UCS-4 should be sufficient.

>
> > Hence UCS-4 is a strategy with limited timespan.

> Not at all, unless they decide to change Unicode to the point where it
> no longer uses 16-bit code values, or add escape codes, or something
> like that. That would be backward-incompatable, severely complicate
> processing, and generally cause all hell to break lose. So I'd rate
> this as "Not Likely."

It wouldn't be the first time hell breaks loose in this area though.

> 2. There are many situations where, even if surrogate pairs
> are present, you don't know or care, and need do nothing to
> correctly deal with them.

Does this means that UCS-2 is the best format?

> So I propose just what the Unicode standard itself proposes in
> section 5.4: UString (or whatever we call it) should have the
> Surrogate Support Level "none"; i.e., it completely ignores the
> existence of surrogate pairs. Things that use UString that have
> the potential to encounter surrogate pair problems or wish to
> interpret them can add simple or complex code, as they need, to
> deal with the problem at hand. (Many users of UString will need to
> do nothing.)

Is that UCS-2 or UTF-16 then?

>
> Note that there's a big difference between this and your UTF-8
> proposal: ignoring multibyte stuff in UTF-8 is going to cause much,
> much more lossage because there's a much, much bigger chance of
> breaking things when using Asian languages. With UTF-16, you probably
> won't even encounter surrogates, whereas with Japanese in UTF-8,
> pretty much every character is multibyte.

I did not mean so that you should ignore the content. But you can process it
as if it were ASCII because in many languages everthing that is not text is
found in the ASCII range. Due to the way UTF-8 is encoded you never risc
getting a spurious ASCII character following this path. For example, you can
find delimited text simply scanning from one double quote to the next.
Everything in between is a sound text possibly in UTF-8 - you do not need to
care about this. Subsequently you may wish to convert the delimited string
into UCS-2 or whatever.

This approach avoids complex character type lookups when parsing text. It
will for example work for XML and Ruby.

In order to add international identifier support to a UTF-8 stream processed
as ASCII you can use the following pattern (I believe Ruby already does
something similar).

identifier = /[A-Za-z_\x80-\0xfd][A-Za-z_\x0080-\0xfd]*/

Mikkel

Yukihiro Matsumoto

unread,

Aug 3, 2002, 5:52:00 AM8/3/02

to

Hi,

In message "Re: Unicode in Ruby now?"

on 02/08/02, Tobias Peters <tpe...@invalid.uni-oldenburg.de> writes:

|Then we would delay i18n string processing until the iso 10646 people have
|made such a decision? This will probably never happen! And even if it
|does, then not in the next few years. We need unicode in ruby *now*. It
|seems we can get by with a choice of two possible canonical encodings to
|be used for the result of concatenating strings with different encodings:
|utf-8 and some mojikyu encoding, based on the encodings of the original
|strings. Let's implement it.

I'm not against Unicode or any other charset. I just want that the
applications written in Ruby can choose their cannonical encodings.
Many of them choose Unicode in the future. But I don't want to force
Unicode in any way, when EUC-JP is good enough. And I'm implementing
it now.

matz.

Curt Sampson

unread,

Aug 4, 2002, 4:59:13 AM8/4/02

to

On Sat, 3 Aug 2002, MikkelFJ wrote:

> Obviously you know more about Unicode than most. What is the practical
> difference between UCS-4, UCS-2 and UTF-16.

I don't have my spec. handy, so I'm going from memory here; someone with
the spec in front of him should correct me if I'm wrong.

UCS-4 is a 4-byte encoding, and UCS-2 is a two-byte encoding for ISO-10646.
UCS-2 is similar to UTF-16, which is a Unicode encoding.

> Is it that "extended
> characerts" - or surrogates - will take on more space than UCS-4 but
> typically take up the same space as UCS-2?

All characters take up 4 bytes in UCS-4. Each code value takes up
two bytes in UCS-2 and UTF-16; some characters need two code values.

> > Not at all, unless they decide to change Unicode to the point where it
> > no longer uses 16-bit code values, or add escape codes, or something
> > like that. That would be backward-incompatable, severely complicate
> > processing, and generally cause all hell to break lose. So I'd rate
> > this as "Not Likely."
>
> It wouldn't be the first time hell breaks loose in this area though.

It would for Unicode. I don't think they're likely to completely
break backwards compatability.

> > 2. There are many situations where, even if surrogate pairs
> > are present, you don't know or care, and need do nothing to
> > correctly deal with them.
>
> Does this means that UCS-2 is the best format?

In my opinion, yes.

> I did not mean so that you should ignore the content. But you can process it
> as if it were ASCII because in many languages everthing that is not text is
> found in the ASCII range. Due to the way UTF-8 is encoded you never risc
> getting a spurious ASCII character following this path. For example, you can
> find delimited text simply scanning from one double quote to the next.
> Everything in between is a sound text possibly in UTF-8 - you do not need to
> care about this.

Right. The same is true of UTF-16. However, UTF-16 has the advantage
that it's more compact when representing Japanese or other Asian
languages, and it's easier to manipulate individual characters.

Clifford Heath

unread,

Aug 4, 2002, 7:21:38 PM8/4/02

to

Curt Sampson wrote:
> Remember, Unicode uses 16-bit code values,

No. Unicode uses UCS-4 characters, 32 bits. It also provides UCS-2,
which has surrogates, which don't allow easy extension to encoding
all UCS-4 characters. However that's not a good argument why programs
should deal with characters as anything less than 32-bit. UCS-2 has
always been a broken encoding and should be avoided, but UTF-8
resolves the issue (up to 31 bits anyway).

--
Clifford Heath

Curt Sampson

unread,

Aug 4, 2002, 9:42:23 PM8/4/02

to

On Mon, 5 Aug 2002, Clifford Heath wrote:

> Curt Sampson wrote:
> > Remember, Unicode uses 16-bit code values,
>
> No. Unicode uses UCS-4 characters, 32 bits. It also provides UCS-2,
> which has surrogates, which don't allow easy extension to encoding
> all UCS-4 characters.

These statements are both very wrong. Please consult the Unicode
specification or read previous messages here under this subject
line.

Clifford Heath

unread,

Aug 4, 2002, 11:47:44 PM8/4/02

to

Curt Sampson wrote:
> These statements are both very wrong.

I was deliberately being "reinterpretive", but what I said is the effective
truth. If you want to do Unicode-3 correctly and simply, then a 32 bit
internal representation is the right one - surrogates simply recreate
exactly the same problems of variable-length encoding that plagued
earlier efforts at providing a simple way (for the programmer) to code
correctly. Externally, a more compact encoding is needed (utf-8 or utf-16
are valid choices), but internally, UTF-16 is bogus in the extreme.
Even internally, *if appropriately hidden* behind an API that only exposes
whole characters, a more compact encoding (such as I've recently
described) can be worthwhile.

You seem to be so wedded to the Java/Unicode model that you can't see
out of the hole into which you've dug yourself.

--
Clifford Heath

Curt Sampson

unread,

Aug 5, 2002, 2:00:02 AM8/5/02

to

On Mon, 5 Aug 2002, Clifford Heath wrote:

> Curt Sampson wrote:
> > These statements are both very wrong.
>
> I was deliberately being "reinterpretive", but what I said is the effective
> truth.

What part of "UTF-32 is restricted in values to the range 0..10FFFF16,

which precisely matches the range of characters defined in the
Unicode Standard (and other standards such as XML), and those

representable by UTF-8 and UTF-16." (Unicode Standard Annex #19)
don't you understand?

Also note that UTF-32 is still "variable length" in some senses,
in that it can still have combining characters that you need to
interpret.

> If you want to do Unicode-3 correctly and simply, then a 32 bit

> internal representation is the right one...

You have addressed none of the points I made in my previous when
I said that one can do Unicode 3 correctly and simply in UTF-16.
Please address them.

> - surrogates simply recreate
> exactly the same problems of variable-length encoding that plagued
> earlier efforts at providing a simple way (for the programmer) to code
> correctly.

If you consider variable length a real problem; UTF-32 doesn't fix
it since it still has combining characters.

>Externally, a more compact encoding is needed (utf-8 or utf-16
> are valid choices), but internally, UTF-16 is bogus in the extreme.

This is completely wrong.

> You seem to be so wedded to the Java/Unicode model that you can't see
> out of the hole into which you've dug yourself.

No. I'm going by stuff out of the Unicode 3 standard here, not just
the java model.

If you'd actually work though some typical cases of string use and
see what happens when they encounter surrogate pairs, you'd see
that your analysis of the problem is not at all correct.

Clifford Heath

unread,

Aug 5, 2002, 7:18:28 PM8/5/02

to

Curt Sampson wrote:
> What part of "UTF-32 is restricted in values to the range 0..10FFFF16,
> which precisely matches the range of characters defined in the
> Unicode Standard (and other standards such as XML), and those
> representable by UTF-8 and UTF-16." (Unicode Standard Annex #19)
> don't you understand?

I see you have built a new 21-bit computer which is going to conquer
the world, have you? If not, what size memory word *do* you store that
21-bit value in? So you see that 32 bits is *effectively* the minimum
that gets used to store a single UTF-32 codepoint value.

The limit of 0..10FFFF16 is to protect the broken UTF-16 encoding. If
you must deal with a variable-length encoding, at least UTF-8 can be
extended to encompass all the world's character sets.

> Also note that UTF-32 is still "variable length" in some senses,
> in that it can still have combining characters that you need to
> interpret.

Yes, and aren't they a bugger! But unavoidable, since rooted in real
languages. The only saving grace is that many applications don't have
to worry about them.

> You have addressed none of the points I made in my previous when
> I said that one can do Unicode 3 correctly and simply in UTF-16.

You're quite right, and I never had an issue with that. I *do* have an
issue with the way the standard shuts out a large number of characters
in order to protect a broken encoding. Unicode 3 doesn't go far enough
not because it panders to the majority but because it shuts out the
minority who need those characters.

> If you'd actually work though some typical cases of string use

I have. Optimising for typical string use shouldn't come at the expense
of representational completeness. How would you feel if some European
body decided that the letters J and Y were to be dropped from English
because they didn't fit in some six-bit encoding? (I once worked on a
CDC computer which used six-bit characters - no lower case!)

--
Clifford Heath

Curt Sampson

unread,

Aug 5, 2002, 8:53:39 PM8/5/02

to

On Tue, 6 Aug 2002, Clifford Heath wrote:

> I see you have built a new 21-bit computer which is going to conquer
> the world, have you? If not, what size memory word *do* you store that
> 21-bit value in?

If I want to store 21-bit values, I store them in a 32-bit word, wasting
11 bits. That does not change the value to be 32 bits in range.

But as far as Unicode goes, I store that in 16-bit words, for reasons I
have gone into in detail before.

> > Also note that UTF-32 is still "variable length" in some senses,
> > in that it can still have combining characters that you need to
> > interpret.
>
> Yes, and aren't they a bugger! But unavoidable, since rooted in real
> languages. The only saving grace is that many applications don't have
> to worry about them.

Many applications need not worry about surrogate pairs, either.
Also, surrogate pairs are much, much easier to deal with than
combining characters. So removing surrogate pairs does little to
solve this problem.

> I *do* have an
> issue with the way the standard shuts out a large number of characters
> in order to protect a broken encoding.

It doesn't shut them out in the slightest; surrogate pairs work
just fine. Please re-read my previous posts.

> Unicode 3 doesn't go far enough
> not because it panders to the majority but because it shuts out the
> minority who need those characters.

Hello! It does not shut them out! What part of "surrogate pairs
work fine without special processing" don't you understand? Why
don't you point out how they break, if you think that they do,
rather than making unsubstantiated claims?

Also, I would like to know just who is using these characters, and
for what.

In Japan, I have *never* met anyone outside of someone testing Unicode
or Unicode-based applications that has ever used a surrogate character.
Aside from a few kanji specialists, Japanese people would be hard
pressed even to identify the readings and meanings of more than a
handful of these surrogate characters if you showed them the whole lot.
The Japanese have demonstrated that they have no pressing need for these
characters in their computer character sets by sticking for years to
standards (EUC-JP, Shift-JIS, ISO-2022-JP) that do not contain any of
these characters. (Unicode handles the characters from all of these
character sets without using surrogate pairs.)

> I have. Optimising for typical string use shouldn't come at the expense
> of representational completeness.

Of course not. But UTF-16 is representationally complete. There is
no Unicode character you can represent in UTF-32 that you cannot
represent in UTF-16.

> How would you feel if some European
> body decided that the letters J and Y were to be dropped from English
> because they didn't fit in some six-bit encoding? (I once worked on a
> CDC computer which used six-bit characters - no lower case!)

NOTHING IS BEING DROPPED! ALL CHARACTERS ARE REPRESENTABLE IN UTF-16! HELLO!

The proper parallel here is, "how would you feel if some European
body decided that the thorn character was, only under certain
circumstances, going to be marginally harder to deal with."

BTW, when was the last time you used thorn? That's the equivalant of the
kanji that are in the surogate range.

Clifford Heath

unread,

Aug 5, 2002, 9:47:59 PM8/5/02

to

Curt Sampson wrote:
> NOTHING IS BEING DROPPED! ALL CHARACTERS ARE REPRESENTABLE IN UTF-16! HELLO!

Hey, no need to shout. Why is there a Mojikyo standard if no-one's being
shut out? I can only assume that they want to represent characters that
are excluded from Unicode but could have been easily included via UCS-4
and UTF-8. Am I wrong, and are the Mojikyo people just pig-headed?

> Aside from a few kanji specialists,...

Those few also need to use computers - linguistic study is a valid use for
computers - and there's no need to adopt a standard that forces them to
use an encoding other than that used by the rest of the world.

--
Clifford Heath

Curt Sampson

unread,

Aug 5, 2002, 10:19:31 PM8/5/02

to

On Tue, 6 Aug 2002, Clifford Heath wrote:

> Curt Sampson wrote:
> > NOTHING IS BEING DROPPED! ALL CHARACTERS ARE REPRESENTABLE IN UTF-16! HELLO!
>
> Hey, no need to shout. Why is there a Mojikyo standard if no-one's being
> shut out?

Because certain very specialized applications, mainly to do with
academic research into Kanji, need to represent more kanji than Unicode
cares to represent.

If you need it, you need it. If you don't, you don't want to get
near it because it's extremely unwiedly and expensive to use. Also,
note that it's only useful for kanji, not for other writing systems.

> I can only assume that they want to represent characters that
> are excluded from Unicode but could have been easily included via UCS-4
> and UTF-8.

One doesn't "include" things in Unicode via UCS-4 and UTF-8".
Those are merely encoding schemes for Unicode; different ways of
representing it. All Unicode characters can be represented in UTF-8,
UTF-16 and UTF-32.

You can encode other character sets in these encoding as well, but then
(tautology here) you're not encoding Unicode. From the spec, section 3.8:

The definition of UTF-8 in Amendment 2 to ISO/IEC 10646 also
allows for the use of five- and six-byte sequences to encode
characters that are outside the range of the Unicode character
set; those five- and six-=byte sequences are illegal for the
use of UTF-8 as a transformation of Unicode characters.

> > Aside from a few kanji specialists,...
>
> Those few also need to use computers - linguistic study is a valid use for
> computers - and there's no need to adopt a standard that forces them to
> use an encoding other than that used by the rest of the world.

Yes there is. They have completely different requirements that are
basically impossible to met with something also intended for general
purpose use by the rest of the world. (One of these is the ability
to change the standard very quickly.) And they're not interested
in supporting many of the things required by other languages
(right-to-left text, combining characters, etc.)

Would you make the entire world stop using regular filesystems and
start using Lotus Notes just because a few people need to have
full-text search and additional indexed fields in some applications?

Anyway, it appears to me that you do not have a very good understanding
of Unicode, so I am going to drop this argument until you go read
the specification carefully and can point out which parts of it
you want to argue with. Nothing I've been saying here is my own
original idea; it's all in the Unicode specification itself.

Curt Sampson

unread,

Aug 11, 2002, 11:39:02 PM8/11/02

to

On Sat, 10 Aug 2002, MikkelFJ wrote:

> It is far less common to index strings by number of characters. But when you
> do, UCS-4 is better. A typical application is text formatting where you
> wan't the nearest linebreak to the given width of say 80 characters. UCS-2
> is probably not good enough because combining (or surrogates or whatever)
> will only take up one display unit. But then you probably need to take
> proportional spacing into account anyway.

Well, in a lot of cases it's no big deal, because you just want to
limit the length of a string. For example, I may want to trucate
a display field to twenty characters, so it doesn't overflow. With
UTF-16, I can safely just truncate. If I break a surrogate, no
problem; it doesn't display. If I break a combining character, it's
a bit more of a problem (because only part of it displays), but
nothing most people can't live with.

This is one of the big advantages of UTF-16 over UTF-8; you can do
simple operations the simple way and still produce valid UTF-16 output.
(There's no explicit rule, as far as I know at least, that states that
UTF-8 parsers *must* ignore broken characters, as there is with UTF-16.)

Curt Sampson

unread,

Aug 11, 2002, 11:44:59 PM8/11/02

to

On Sat, 10 Aug 2002, Bret Jolly wrote:

> Unicode is no longer something that can be squeezed into two
> bytes, even for practical purposes. There are over 40 000 CJK
> characters outside the "BMP", that require surrogates in UTF-16.

I agree. What we may not agree on is that surrogates work very well
in UTF-16, and require minimal or no processing in many cases.

> For example, the scandalous situation where many Chinese and
> Japanese cannot write their names in unicode will have to be fixed
> eventually...

How many people is this, really? I note that Japanese people have been
putting up with this "scandalous" situation for years now, and will
continue to do so far a long time, as Shift_JIS and EUC-JP encodings of
JIS X 208 and JIS X 212 are showing no signs of declining in use, and
they are both fully present in the Unicode BMP.

> But UTF-16 was a mistake from the beginning. It is no longer fixed-
> width, and it is sure to grow much less fixed-width in practice....

UTF-32 is not fixed width, either, and never has been. Nothing can
be fixed width in Unicode due to combining characters.

The only "extra" problem that UTF-16 presents over UTF-32 is dealing
with surrogates, and that is a very easy problem to deal with.

> Yet it is just long enough to introduce an
> endianness nightmare. The UTF-16 folks try to fix this with a kluge,
> the byte-order mark, but the kluge is an abomination. It is non-local,
> and hence screws string processing. It breaks unix's critical shebang
> hack.

Actually, that would not be hard to fix. It's pretty trivial to
modify the kernel to deal with that.

> ...and maybe unicode will never be right for
> everybody, so I think Ruby should support other character sets as well,
> including some which are not compatible with unicode.

I certainly agree with that!

Bret Jolly

unread,

Aug 12, 2002, 3:43:52 PM8/12/02

to

Curt Sampson <c...@cynic.net> wrote in message news:<Pine.NEB.4.44.02081...@angelic.cynic.net>...

> This is one of the big advantages of UTF-16 over UTF-8; you can do
> simple operations the simple way and still produce valid UTF-16 output.
> (There's no explicit rule, as far as I know at least, that states that
> UTF-8 parsers *must* ignore broken characters, as there is with UTF-16.)

UTF-8 parsers must ignore "broken" characters because, as I pointed
out in a previous message, "broken" characters are never valid UTF-8,
due to the UTF-8 design. The standard now only allows parsing of
valid characters (the loopholes that existed in unicode version 3.0
were eliminated by updates in versions 3.1 and 3.2). The unicode
standard expressly forbids the interpretation of illegal UTF-8
sequences.

There are also advantages to a fixed-width encoding, such as the
recently introduced UTF-32, which can often outweigh the endianness
issues. (Encodings which are not byte-grained, such as UTF-32 and
UTF-16, need two variants, big-endian and little-endian.)

UTF-16 was not thought through very well. It is an encoding
following the mental line of least resistence -- encode the
character points by their numbers. There was no reason the encoding
should have included 0-bytes, thus sabotaging byte-grained string
processing by C programs. And of course it was thought that all
characters of interest to any significant community could fit in
the two-byte "Basic Multilingual Plane". This is not attainable
even with current unicode unless you consider Chinese, Japanese,
mathematicians, and musicians to be insignificant communities.
Also, important further expansion outside the BMP is inevitable.

But UTF-16 in both big-endian and little-endian variants is sure to
be one of those technical blunders which far outlives its
excusability, due to inertia and corporate politics, so Ruby should
probably provide direct support. Failing that, Ruby could provide
indirect support via invisible translation to some other unicode
encoding.

Some people use UTF-16 as a disk storage format and expand to
UTF-32 in memory. This allows one to directly access characters
by index for unicode strings in memory, while avoiding crass
inefficiency in disk usage. But for general multilingual processing,
UTF-8 seems more efficient and handier as a disk storage format.

The unicode consortium has recently promulgated *yet another*
encoding form, CESU-8, intended only for internal use by programs,
and not for data transfer between applications. CESU-8 is byte-
grained and similar to UTF-8, but CESU-8 has been designed so CESU-8
text will have the same binary collation as equivalent UTF-16 text.
I don't know if there is a reason for RUBY to support this.

Though notoriously unwise myself, I'd like to make a plea for
some wisdom. Many people here have a great deal of experience with
internationalization, and rightly consider themselves experts. But
expertise comes in many flavors, and one should think twice before
making assertions about what *other* people need. The need for
internationalization, M17n, and so forth by a maker of corporate web
sites is different from the need of a mathematician, musician, or
someone trying to computerize Akkadian tablets. We should avoid the
parochial thought that our interests are the only important or
"practical" ones.

Regards, Bret

http://www.rexx.com/~oinkoink/
oinkoink
at
rexx
dot
com

Bret Jolly

unread,

Aug 12, 2002, 4:11:26 PM8/12/02

to

Curt Sampson <c...@cynic.net> wrote in message news:<Pine.NEB.4.44.02081...@angelic.cynic.net>...

> On Sat, 10 Aug 2002, Bret Jolly wrote:
>
>
> > For example, the scandalous situation where many Chinese and
> > Japanese cannot write their names in unicode will have to be fixed
> > eventually...

> How many people is this, really? I note that Japanese people have been
> putting up with this "scandalous" situation for years now, and will
> continue to do so far a long time, as Shift_JIS and EUC-JP encodings of
> JIS X 208 and JIS X 212 are showing no signs of declining in use, and
> they are both fully present in the Unicode BMP.

What is in use is determined not only by what character sets and
encodings are available, but also on how much software is available
to use them, and how aware the users are of the availability of this
software. The software and the awareness are continually expanding.
If people know they *can* write their names correctly, they will.

This is not just an issue of vanity. (Nor is it personal: my Chinese
name *can* be written in the BMP. :-)) It is not just the person whose
name cannot be written who is affected, but anyone who wants to talk
about him. For example, I understand that one of the top 5 politicians
in China has a name which does not appear in unicode. This makes life
hard for journalists and political scholars writing in Chinese.

Regards, Bret
http://www.rexx.com/~oinkoink

Dave Thomas

unread,

Aug 12, 2002, 4:42:39 PM8/12/02

to

oinkoi...@rexx.com (Bret Jolly) writes:

> For example, I understand that one of the top 5 politicians in China
> has a name which does not appear in unicode. This makes life hard
> for journalists and political scholars writing in Chinese.

Which leads to some interesting Orwellian possibilities: not just
could people removed from history, but we arrange the character set so
that their name cannot even be expressed.

D*ve

Curt Sampson

unread,

Aug 15, 2002, 12:55:56 AM8/15/02

to

On Tue, 13 Aug 2002, Bret Jolly wrote:

> UTF-8 parsers must ignore "broken" characters because, as I pointed
> out in a previous message, "broken" characters are never valid UTF-8,
> due to the UTF-8 design. The standard now only allows parsing of
> valid characters (the loopholes that existed in unicode version 3.0
> were eliminated by updates in versions 3.1 and 3.2). The unicode
> standard expressly forbids the interpretation of illegal UTF-8
> sequences.

Ah. So does this mean that if I break a String into two in the
middle of a UTF-8 sequence both broken sequence parts will be
preserved, so that the character reappears if I put the two strings
back together again? This, to my mind, is one of the big advantages of
the UTF-16 surrogate character specification.

> There are also advantages to a fixed-width encoding, such as the

> recently introduced UTF-32....

I think I've already said this about eight million times, but:

UTF-32 is not fixed width, due to combining characters.

> But UTF-16 in both big-endian and little-endian variants is sure to
> be one of those technical blunders which far outlives its

> excusability....

Well, we'll just have to agree to differ on this. I deal with a lot of
Japanese text in my various programs, and at the lowest level (String
objects and suchlike) I find UTF-16 to be by far the most convenient
way of dealing with with it. It's small, efficient, lets me do basic
handling of stuff with ease, and lets me push up some of the harder
issues into just the classes that really need it, rather than having to
deal with them everywhere.

> Though notoriously unwise myself, I'd like to make a plea for
> some wisdom. Many people here have a great deal of experience with
> internationalization, and rightly consider themselves experts. But
> expertise comes in many flavors, and one should think twice before
> making assertions about what *other* people need. The need for
> internationalization, M17n, and so forth by a maker of corporate web
> sites is different from the need of a mathematician, musician, or
> someone trying to computerize Akkadian tablets. We should avoid the
> parochial thought that our interests are the only important or
> "practical" ones.

Well, I've said all along that Unicode just is not suitable for a
lot of very technical purposes. My argument is that it's *impossible*
for a single character set to deal with everything, and even dealing
with most of it is completely impractical. Thus, use a simple
character set like Unicode and it's relatively simple accompanying
algorithms for day to day work, and do something custom when you
have requirements beyond that.

chen_le...@yahoo.com

unread,

Dec 23, 2004, 8:47:31 AM12/23/04

to

Take a look at http://www.muftah-alhuruf.com you can find there an
Arabic virtual keyboard.