I also invite others to reply.
Doug Gwyn writes:
> If I may add a general observation about code set issues, particularly
> multibyte encodings: It seems to me that the people designing software
> facilities, hardware, and standards concerning the issues generally fail
> to appreciate a crucial design point: The sooner you can map everything
> into a uniform format with simple, clean, properties, the better off you
> are.
I assume that you are referring to a character encoding where all
characters occupy the same number of bits.
> Instead, we keep seeing designs that require the users of the
> services to face algorithmic complexity, because the data being operated
> upon has been left in a complex encoded form instead of being turned into
> the previously mentioned uniform format with nice properties. Algorithms
> naturally reflect the underlying structure of the data. If you'd like to
> be able to code programs that deal with text in a simple manner, as seen
> in early UNIX utilities such as "wc", you need to keep the form in which
> text is seen by program code as simple as possible; for example, all text
> characters must be handled as one "character" type, a complete unit of
> which would be returned per call to getchar(), obviating the need for
> wchar_t and the (rapidly growing) library of functions for helping
> applications deal with nonunitized, fragmented, and stateful characters.
I assume that you are referring to the wide character (wchar_t)
routines being proposed by a Japanese working group as an addendum to
ANSI/ISO C, and those defined by X/Open (which overlap with the
Japanese proposal to some extent).
According to the C standard, getchar() returns an int. As far as I can
tell, an int must be at least 16 bits. So it can be argued that
getchar's return value is large enough to accommodate "most"
characters, including the numerous Japanese Kanji.
There are many routines that take character strings as arguments, e.g.
fopen(), strcpy(), etc. How do we convert from a string of getchar'ed
ints to a string of chars?
If chars and ints are the same size, it is easy to convert between the
two. This might be done by setting CHAR_BIT to 16 (if ints are 16
bits). However, I suspect that many existing programs will fail to
compile or fail to run properly if CHAR_BIT is increased in this way.
(If I suspect wrongly, please correct me.)
On the other hand, it can be argued that the wchar_t approach is one
where CHAR_BIT is not changed from its common value (8). One of the
disadvantages of the wchar_t approach, is that many new routines have
to be defined, to mirror the existing char-based routines (fopen,
strcpy, etc). Also, programs need to be modified extensively to take
advantage of these routines.
From a migration viewpoint, it might be argued that the wchar_t
approach is more practical, since changing CHAR_BIT will break many
programs instantly, while providing wchar_t facilities would allow
application developers to upgrade their software at their own leisure.
On the other hand, doing everything with chars would keep the
library's specs relatively small and simple.
(A third approach would be to provide routines that convert between
getchar'ed ints and strings of 8-bit chars, but then one would
probably also want versions of fopen, strcpy, etc that deal with
strings of ints, so this is quite close to the wchar_t approach.)
-
--
EvdP
>From a migration viewpoint, it might be argued that the wchar_t
>approach is more practical, since changing CHAR_BIT will break many
>programs instantly, while providing wchar_t facilities would allow
>application developers to upgrade their software at their own leisure.
The problem of CHAR_BIT is that it is not compatible with the existing *DATA*.
Conversion is not simple if the data is not pure text file. For example,
a file may consists of fixed sized 80 bytes record whose 14-th to 72-th
bytes are character and other bytes are pure binary.
>On the other hand, doing everything with chars would keep the
>library's specs relatively small and simple.
Introducing wchar_t will double the number of character related functions
in the library. That's all. It won't make the library spec large nor complex.
It seems to be the ideal solution to use (now disapproved) dis10646
one octet compaction externally and to use four byte form internally as
wchar_t.
Masataka Ohta
If the implementation so chooses, it could certainly allot 16 bits for
the representation of a "char" datum, which is big enough to represent
all character sets that I know of.
>There are many routines that take character strings as arguments, e.g.
>fopen(), strcpy(), etc. How do we convert from a string of getchar'ed
>ints to a string of chars?
Ah, now that may be a misunderstanding. While it is true that the
return type of getchar() is int, whenever the returned value is other
than EOF it is guaranteed to be assignable to an unsigned char datum
without loss of information. In other words, getchar() doesn't return
wchar_t encodings. No conversion from getchar() result to char is
necessary; just stash the value received.
If you (Doug, or anyone else) were responsible for overall decisions
about the C compiler within a company such as Sun, would you be for or
against changing the size of a char to 16 bits? Please give reasons.
-
--
EvdP
Yes, this follows from several of the standard's requirements.
>without loss of information.
I cannot find this requirement in the standard.
If the file is binary, then I think this is "nearly" true(*), but if the
file is text, then it need not be true. If a char is an 8-bit byte, and
a program writes bytes into a text file which are halves of printable wide
characters but not printable by themselves, then 4.9.2 deliberately avoids
defining whether they can be read back correctly.
(*) On a one's-complement machine, a character might have a value which
is encoded as -0. The character cannot be one of the required C locale
characters, I think. However, when it is assigned to an unsigned char,
its value would be converted to +0.
>In other words, getchar() doesn't return wchar_t encodings.
If a char is an 8-bit byte, then this is usually true but not necessarily
helpful. If a char is as wide as wchar_t, then getchar() can return
wchar_t encodings, though it doesn't have to and this still doesn't help.
>No conversion from getchar() result to char is
>necessary; just stash the value received.
Certainly the value can be stashed and then used, but whether the value
will be "correct" is problematic.
--
Norman Diamond dia...@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.
Permission is granted to feel this signature, but not to look at it.
As an anyone else as an implementor, I would keep C chars as byte-size,
because lots of system software has to work with bytes and it is nice if
this software can be written in C.
In application-oriented languages, I would make char be the size of a
printable character (probably a static size equal to the largest length
of all printable characters) -- though you didn't really ask this one.
If I were in such a position, I'd have to weigh various factors, and it
is impossible for me to predict what my recommendation would actually be.
If X3J11 had adopted either the "long char" or "short char" proposals,
then the decision would be easy (although different in the two cases).
As it is, though, byte==character and thus the concept is overloaded;
there IS no clearly "correct" solution when the overload is mandated.
The whole point of my question was to get people to weigh the various
factors here and now. If you are too busy to take part, I can
understand.
> If X3J11 had adopted either the "long char" or "short char" proposals,
> then the decision would be easy (although different in the two cases).
I don't actually know the "long char" and "short char" proposals, but
I assume that a long char is at least as big as a char, while a short
char is at most as big as a char.
So are you saying that if the "long char" proposal had been adopted,
you would not change the size of a char? In this case, what's the big
difference between long char and wchar_t? Both would require
conversions to use the char-based interfaces at the library and system
call level.
Also, are you saying that if the "short char" proposal had been
adopted, you would change the size of a char to 16 bits? If you did
that, a huge amount of software would have to be updated just to get
it to work again, let alone internationalize it.
> As it is, though, byte==character and thus the concept is overloaded;
> there IS no clearly "correct" solution when the overload is mandated.
Are you saying that the concept of "character" is overloaded? The C
standard clearly distinguishes between "character" and "multibyte
character".
-
--
EvdP
It's not a matter of being busy, it's the fact that I'm not in a position
to judge for Sun or any other vendor what relative weights to assign the
various factors that must enter into such a decision.
>> If X3J11 had adopted either the "long char" or "short char" proposals,
>> then the decision would be easy (although different in the two cases).
>I don't actually know the "long char" and "short char" proposals, but
>I assume that a long char is at least as big as a char, while a short
>char is at most as big as a char.
Ok, I assumed you had followed the earlier proposals. Here's a summary:
"long char": C would support both "char" and "long char" data.
"char" would basically mean a machine byte while "long char"
would mean a datum capable of completely representing a basic
unit of text (a "character"). getchar() would return a "long
char" representation (within its usual int return type); string
literals would be arrays of long char. Most existing
implementations would make "char" and "long char" synonymous,
while forward-looking implementations would make sure that
"long char" was represented with at least 16 bits. (The actual
Japanese proposal was a cross between "long char" as I describe
it and the explicit multibyte method that was actually adopted.)
"short char": C would support both "char" and "short char" data.
"short char" would basically mean a machine byte while "char"
would mean a datum capable of completely representing a basic
unit of text (a "character"). getchar() would return a "char"
representation (within its usual int return type); string
literals would be arrays of char. Few implementations would
much care whether or not "char" and "short char" were the same
size, and forward-looking implementations would make sure that
"char" was represented with at least 16 bits.
X3.159-1989: C must support "char" data, and the implementation
must select one of the existing integral types as "wchar_t".
"char" means both a machine byte and a datum capable of
completely representing a basic unit of text (a "character") --
but only if the character is taken from the portable C character
set; other characters may require a "multibyte" representation.
getchar() returns a "char" representation (within its usual int
return type); string literals are arrays of char. Because that
is insufficient for real character sets, a second kind of string
literal is added to the language, and the compiler must
translate it to array of wchar_t. Because real characters don't
fit within a char, but the standard I/O functions transfer one
char at a time, additional library functions are required to
convert between char streams and character representations (as
wchar_t). Because this makes the byte encoding of character
sets visible to the program, issues of "shift state" and other
encoding artifacts must be explicitly dealt with. Also, as was
predicted, the Japanese are naturally desirous of a standard
"separate but equal" set of standard library functions for
dealing with the wchar_t form of characters. (They also wanted
this for the "long char" alternative, but in that case at least
logically the existing functions could have been respecified to
always deal with "long char". As part of my "short char"
proposal, I identified all the places in the draft standard that
would need editing to fully enforce a distinction between byte
and character; for example, the mem*() functions would deal with
bytes while the str*() functions would deal with characters.
The distinctions could and should be applied to a "long char"
method as well.)
>So are you saying that if the "long char" proposal had been adopted,
>you would not change the size of a char?
No. If the "long char" proposal OR the "short char" proposal had been
adopted, it would be obvious which type to use for character and which
for basic machine unit ("byte", which could actually be implemented as
"bit" if the system had good support for bit operations). Since we're
taking for granted the need to have more than 8 bits to represent a
character, the type corresponding to character would be assigned 16 bits
in the implementation, and the type corresponding to "byte" would be
left as the smallest addressable machine unit (typically 8 or 9 bits).
>In this case, what's the big difference between long char and wchar_t?
>Both would require conversions to use the char-based interfaces at the
>library and system call level.
No. As I use the notion, "long char" would BE the type supported by
the traditional I/O library functions.
>Also, are you saying that if the "short char" proposal had been
>adopted, you would change the size of a char to 16 bits? If you did
>that, a huge amount of software would have to be updated just to get
>it to work again, let alone internationalize it.
"Internationalization" of code that assumes 8 bits for a char already
requires a tremendous amount of work. The "short char" proposal would
have affected SOME existing code in only two ways:
1) sizeof would return byte sizes, so char arrays would have
size different from number of elements. Careful programmers
already had avoided making the assumption that sizeof(char)==1,
so their code would have been unaffected. Now that the C
standard guarantees sizeof(char)==1, we've pretty much lost the
opportunity to exploit bit-addressable architectures, and coders
are undoubtedly no longer being so careful to distinguish size
units from char units.
2) network code that deals in "octets" using nonportable
assumptions about structure packing, etc. would be even more
widely broken than it already is when ported to an
implementation that chose to make "char" 16 bits. Note that
this situation is still possible under X3.159-1989, although
less likely since everybody is jumping on the wchar_t bandwagon,
reducing the motivation for making "char" bigger than the
minimum supported machine size (typically 8 or 9 bits).
>> As it is, though, byte==character and thus the concept is overloaded;
>> there IS no clearly "correct" solution when the overload is mandated.
>Are you saying that the concept of "character" is overloaded? The C
>standard clearly distinguishes between "character" and "multibyte
>character".
"Multibyte character" was an invention of the committee necessitated by
their decision to treat program I/O units, even for text streams, as
distinct from character representations. No such concept would be
required were "character" to be the direct object of a text-stream I/O
request.
In fact there has been considerable confusion, even among the
"internationalization" community, about the distinction between these
two kinds of "character". Dave Prosser did a pretty good job of editing
the standard to enforce the distinction throughout, but it is still
difficult to untangle the usages in some contexts.
The principle that should have been followed would have been to have
each character handled as a single item throughout any text-handling
program. There is no advantage to making programs deal with explicit
character set encoding concerns, and considerable disadvantage in so
doing. However, that's the path that "international" vendors were
already following, so that momentum carried over into the C standard.
It doesn't make it the best possible design.
>Since we're
>taking for granted the need to have more than 8 bits to represent a
>character, the type corresponding to character would be assigned 16 bits
>in the implementation,
Please don't assume 16 bits are enough for characters, only because you
don't know more than 65536 characters.
I know about 10,000 characters and also know that there exists more than
65536 characters.
Masataka Ohta
Please take a look at Unicode, as well as the competing ISO character set.
Yes, there are languages for which 64k glyphs is not enough. But each of
those languages has an alphabet which needs only a subset, and some of these
glyphs can be combined.
--
Sean Eric Fagan | "What *does* that 33 do? I have no idea."
s...@kithrup.COM | -- Chris Torek
-----------------+ (to...@ee.lbl.gov)
Any opinions expressed are my own, and generally unpopular with others.
Thanks for the good summary, Doug!
> The principle that should have been followed would have been to have
> each character handled as a single item throughout any text-handling
> program. There is no advantage to making programs deal with explicit
> character set encoding concerns, and considerable disadvantage in so
> doing. However, that's the path that "international" vendors were
> already following, so that momentum carried over into the C standard.
> It doesn't make it the best possible design.
Now, and even then, we didn't have the luxury of choosing the "best"
design. We are constrained by "existing practice". Changing the size
of a char from, say, 8 to 16 would break many pieces of code, and
would reveal some of the places in the source code that need to be
internationalized. This is great for i18n (internationalization), but
I doubt that people would be happy to break code that has nothing to
do with i18n.
Although it is sometimes interesting to talk about a "principle that
should have been followed", I would prefer to discuss what we should
do from now on.
I think that implementations that currently have 8-bit chars should
keep them that way, while wchar_t should be defined as an unsigned
32-bit quantity. New wchar_t-based interfaces should be defined to
supplement the char-based interfaces that can reasonably be determined
to be text-oriented, e.g. file names. (By the way, even the Unicode
Consortium now seems to endorse 32-bit characters for future
expansion.)
-
--
EvdP
It seems to be stretching the notion of "character" much farther than
is reasonable to have so many of them in a single locale.
In any case, the reason I said 16 bits was because all representations
(code sets) in actual use that I have heard about comfortably fit into
16 bits. Certainly if more are really needed they should be used.
The important point is that 8 are not enough.
>I think that implementations that currently have 8-bit chars should
>keep them that way, while wchar_t should be defined as an unsigned
>32-bit quantity.
I think wchar_t should be defined as a *SIGNED* 32-bit int, so that
the common coding practice:
while((ch=getchar())>=0)
alse works with wchar_t as:
while((wch=getwchar())>=0)
To do so, iso10646 should only use 31 bits and the most significant
bit of G octet should be always 0.
This restriction to G octet allows programmers to use the reserved bit
internally for masking purpose.
Being able to use 1, 2 or even 3 bits for some internal flags is very handy
for many programs (if you need more, you should assign extra octet for flags).
Note that the original vi can't handle 8 bit character code because it used
the most significant bit for an internal flag. Now we have 32 bits, so
reserve some for flags.
That's why Japanese comments on DIS 10646 vote includes:
- G-octet range should be restricted
Masataka Ohta
> while((ch=getchar())>=0)
I don't know how common this is. I always use "while((ch = getchar()) != EOF)",
and use "int ch". Your code won't work on a machine with signed integers and
ISO Latin-1 characters.
> alse works with wchar_t as:
> while((wch=getwchar())>=0)
If you make this "while((wch = getwchar()) != WEOF)" then you don't need to
change ISO 10646.
> Note that the original vi can't handle 8 bit character code because it used
> the most significant bit for an internal flag.
Which is as good an argument against restricting wchar_t as any I could make.
--
Peter da Silva; Ferranti International Controls Corporation; +1 713 274 5180;
Sugar Land, TX 77487-5012; `-_-' "Have you hugged your wolf, today?"
>> I think wchar_t should be defined as a *SIGNED* 32-bit int, so that
>> the common coding practice:
>
>> while((ch=getchar())>=0)
>I don't know how common this is.
In old ages of UNIX (V6?), it was common to check the result of function calls
by its negativeness. I still sometimes code so.
>I always use "while((ch = getchar()) != EOF)",
>and use "int ch". Your code won't work on a machine with signed integers and
>ISO Latin-1 characters.
I, of course, use "int ch". But beginners often confused.
>> alse works with wchar_t as:
>
>> while((wch=getwchar())>=0)
>
>If you make this "while((wch = getwchar()) != WEOF)" then you don't need to
>change ISO 10646.
As for 10646, yes, but, in general character encoding scheme, can't getwchar()
return WEOF as legal return value?
In old ages, getchar() can return 0 as a legal value and also return 0
on EOF. Such a behaviour can not be corrected until the type of getchar()
is wider than 8 bit.
Then, if getwchar() can legally return WEOF, we must use wider (64 bit)
type for getwchar() and wch.
It is much simpler to standarize that getwchar() can't return WEOF, or, EOF
or, getwchar() can't be negative.
Though getchar() can be used for general purpose byte stream processing,
getwchar() should not be used for general purpose 32 bit data stream
processing. So, restricting valid value of getwchar() is meaningful.
>> Note that the original vi can't handle 8 bit character code because it used
>> the most significant bit for an internal flag.
>
>Which is as good an argument against restricting wchar_t as any I could make.
Do you have any experience in programming?
Masataka Ohta
Oh, damn. Masataka is invading another newsgroup. Time to edit my
kill file....
Masataka has demonstrated his quality on comp.unix.internals; as
far as I can see, he has no redeeming characteristics at all.
Your best bet is to totally ignore this turkey, unless you really
want a several month discussion in which he repeatedly asserts
the absurd, while presenting made up "facts" to bolster these
absurdities, and in which he supplements his arguments by such
valuable comments as the one I quoted.
Followups have been directed to alt.flame.