Why should raw unicode be 32 bits? What is raw Unicode? UCS-16 and UCS-32
exist. Windows NT defines wchar_t as a 16 bit entity (UCS-16) whereas
various *nix platforms use UCS-32 or sometimes UCS-16 extended to 32 bits.
Netware is a bit schizoid and in some places treats it as 16 bits and 32
bits in others (which really confuses some things).
I'd be quite happy to have UTF8 libraries as most of my work nowadays is
primarily written using UTF8 (comms programming) and I translate in and out
for clients. What about all the mbcs support (non-UTF8) in the library. Do
we add UTF-8 as well or instead of?
Then again, I don't develop with Watcom so it's not my call.
Carl
In today's world it would be natural for a compiler to use UTF-8
encoded Unicode for "multibyte" characters and then treat wide
characters as raw Unicode (although in that case they should probably
be 32 bits, not 16 bits). I'm guessing that Open Watcom does not
currently do this. True?
Peter
Jiri
> Why should raw unicode be 32 bits? What is raw Unicode? UCS-16
> and UCS-32 exist.
Well, I'm probably not using the right terminology. My thought was
this: for Unicode code points above 64K, UTF-16 requires two 16 bit
words for every character. This means, for example, that a simple
count of 16 bit words does not necessarily provide a true character
count. Also indexing an array of 16 bit words is not necessarly the
same as indexing a string of characters (the fifth element of the
array is not necessarily the fifth character). If this matters (who
uses Unicode code points above 64K?) it seems like the most
straightforward solution is to store Unicode characters internally
as 32 bit quantities.
> I'd be quite happy to have UTF8 libraries as most of my work
> nowadays is primarily written using UTF8 (comms programming) and
> I translate in and out for clients. What about all the mbcs
> support (non-UTF8) in the library. Do we add UTF-8 as well or
> instead of?
I guess that's what my question was all about: just what is the
format and character set used by the multibyte character string
support currently in the library? I was not able to determine it by
looking at the Open Watcom documentation.
Peter
Someone. That is all we need to know.
> it seems like the most
> straightforward solution is to store Unicode characters internally
> as 32 bit quantities.
>
> > I'd be quite happy to have UTF8 libraries as most of my work
> > nowadays is primarily written using UTF8 (comms programming) and
> > I translate in and out for clients. What about all the mbcs
> > support (non-UTF8) in the library. Do we add UTF-8 as well or
> > instead of?
>
> I guess that's what my question was all about: just what is the
> format and character set used by the multibyte character string
> support currently in the library? I was not able to determine it by
> looking at the Open Watcom documentation.
My understanding on this this subject is far from perfect, but I
will try to convey what I believe to have learned, and then maybe
I will be corrected by someone else :-)
mbcs does not support/specify a specific encoding, it only
specifies general storage space for multibyte characters.
UTF-16 is widely recognized as the most versatile/efficient
encoding for most unicode supporting application software.
MS NT, Pocket PC, Java and MacOS X uses UTF-16 internally.
wchar_t is 16 bits on Windows, and 32 bits on *nix.
Linux uses UTF-8 encoding for shell-i/o-filenames
for backward compatibility.
A portable open source unicode library can be found at:
http://oss.software.ibm.com/icu/
More interesting reading on the subjects above:
http://www.mindspring.com/~markus.scherer/unicode/tn-uni16-20040113.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html
Roald
> mbcs does not support/specify a specific encoding, it only
> specifies general storage space for multibyte characters.
> UTF-16 is widely recognized as the most versatile/efficient
> encoding for most unicode supporting application software.
My understanding on this is also far from perfect. Here's where I was
coming from: the library provides functions to covert from multibyte
characters/strings to wide characters/strings so that seems to suggest
that the encoding methods used are---or potentially could be---different.
That lead me to wonder just what encodings, if any, Open Watcom's library
supported. I had trouble finding any specific information about that in
the documentation.
UTF-16 has the annoying property that, technically, the characters are of
variable length. However, that's only true if you need to use the
relatively "exotic" characters with code points above 64K. I believe there
were no such characters defined in Unicode until recently. I get the
impression that many software systems use UTF-16 thinking that by doing so
all characters would be the same size. Yet that is no longer strictly true
and those systems now either a) can't handle all of Unicode or b) can no
longer assume a uniform character size. I suppose one could say that the
choice of 16 bit storage units for characters was shortsighted although at
the time it was made, I'm sure 16 bits were fine.
Anyway my primary question was about the Open Watcom library and what
character encodings it supported in its multibyte and/or wide characters.
> http://www.cl.cam.ac.uk/~mgk25/unicode.html
This document says, for example, that glibc v2.2 or higher on Linux uses
32 bit wide characters and locale specific multibyte characters including
UTF-8 and ISO 8859-1 encodings. It talks about how one can get the various
multibyte character encodings by playing around with locale settings. It
seems like Open Watcom needs documentation along these lines someplace or
another. I couldn't find any, at least.
Peter
> The unicode guy I have had some communication with maintains
> that UTF-16 is the best choice all over, for most systems.
> This is because 32 bits wastes a lot of space/speed in most cases.
Yeah, that is definitely a downside. It almost seems like a UTF-24 would
have been useful... although there are obviously problems with that idea
too.
> But the acknowledged downside is that an index or length operation
> is more costly because of the possible non uniform character
> size. But if you have a String object wrapping the details, things
> like length can be cached, and other speed optimizing tricks.
Definitely. Inside, for example, std::wstring, much magic can be hidden.
I was thinking about this case, though
std::wchar_t buffer[BUFFER_SIZE];
// ...
... buffer[10] // Not necessarily character #10 (or #11, whatever)
However, I'm not going to loose any sleep over it. I wonder if this
matter will rear its ugly head again in the Linux port of Open Watcom. If
glibc is using/expecting 32 bit wchar_t will Open Watcom on Linux want to
play along with that? If so, will that create a problem for Open Watcom's
compatibility with itself on other platforms?
I'm not really looking for answers here... these are just random musings.
Peter
The unicode guy I have had some communication with maintains
that UTF-16 is the best choice all over, for most systems.
This is because 32 bits wastes a lot of space/speed in most cases.
But the acknowledged downside is that an index or length operation
is more costly because of the possible non uniform character
size. But if you have a String object wrapping the details, things
like length can be cached, and other speed optimizing tricks.
> Anyway my primary question was about the Open Watcom library and what
> character encodings it supported in its multibyte and/or wide characters.
Sorry, I probably know less about this than you do.
Roald
I realize that (as should be clear from earlier posts in thread).
The "downside" of this is twice the data to be read/written for each
"character" processed. Will/may blow the RAM cache faster than you
expect ;-) I personally believe that the benefits of 32 bits wchar_t
is totally dwarfed by the generally increased i/o speed of 16 bits.
99% of world languages needs only 16 bits...
> > So except
> > for some rare cases, it will not matter much. (How often do one need
> > to do buffer[42] access on a string that may just as well
> > be in chinese as in english based on where the end user lives?)
>
> When text stored in a "wchar" array is parsed, or a tabular display is
> built. Presumably the character classification functions, such as
> "isletter", "isnumber", "ispunct", etc. will be available for wide
> characters, too.
For the full unicode range? Without a 3rd party library?
I think not.
> > > However, I'm not going to loose any sleep over it. I wonder if this
> > > matter will rear its ugly head again in the Linux port of Open
> Watcom. If
> > > glibc is using/expecting 32 bit wchar_t will Open Watcom on Linux
> want to
> > > play along with that? If so, will that create a problem for Open
> Watcom's
> > > compatibility with itself on other platforms?
> > >
> > > I'm not really looking for answers here... these are just random
> musings.
> >
> > I am not certain, but I believe that OW will just have to go along
> > with what standard the OS platform outlines. That means wchar_t with
> > 16 bits on Microsoft platforms, and wchar_t of 32 bits on Linux.
>
> There goes binary compatibility.
Has there ever been binary compatibility? All types the same size
on all supported platforms? The size of int on the different platforms
indicates to me that this has never been the case...
Since OW supplies its own run time library, it can get away with non
conforming size of wchar_t, as long as a conversion to platform
format is included (reasonably trivial to convert from UTF16 to UTF32).
If one size is chosen for all platforms, I submit that the best
general choice is UTF16 (IMHO).
Roald
For UTF32 all "code points" are the same size today, and for the
foreseeable future. Thus if wchar_t is 32b, array indexing and character
indexing is identical.
> So except
> for some rare cases, it will not matter much. (How often do one need
> to do buffer[42] access on a string that may just as well
> be in chinese as in english based on where the end user lives?)
When text stored in a "wchar" array is parsed, or a tabular display is
built. Presumably the character classification functions, such as
"isletter", "isnumber", "ispunct", etc. will be available for wide
characters, too.
>
> > However, I'm not going to loose any sleep over it. I wonder if this
> > matter will rear its ugly head again in the Linux port of Open
Watcom. If
> > glibc is using/expecting 32 bit wchar_t will Open Watcom on Linux
want to
> > play along with that? If so, will that create a problem for Open
Watcom's
> > compatibility with itself on other platforms?
> >
> > I'm not really looking for answers here... these are just random
musings.
>
> I am not certain, but I believe that OW will just have to go along
> with what standard the OS platform outlines. That means wchar_t with
> 16 bits on Microsoft platforms, and wchar_t of 32 bits on Linux.
There goes binary compatibility.
--
E. S. (Steve) Fábián 6522 Baja Way
410-799-7972 Elkridge, MD 21075
Right. But I think one just has to get used to the idea that for
unicode character manipulation, one has to use a library. And that
is a fact wether 16 or 32 bits wchar_t storage is used. So except
for some rare cases, it will not matter much. (How often do one need
to do buffer[42] access on a string that may just as well
be in chinese as in english based on where the end user lives?)
> However, I'm not going to loose any sleep over it. I wonder if this
> matter will rear its ugly head again in the Linux port of Open Watcom. If
> glibc is using/expecting 32 bit wchar_t will Open Watcom on Linux want to
> play along with that? If so, will that create a problem for Open Watcom's
> compatibility with itself on other platforms?
>
> I'm not really looking for answers here... these are just random musings.
I am not certain, but I believe that OW will just have to go along
with what standard the OS platform outlines. That means wchar_t with
16 bits on Microsoft platforms, and wchar_t of 32 bits on Linux.
Roald
>
>The unicode guy I have had some communication with maintains
>that UTF-16 is the best choice all over, for most systems.
>This is because 32 bits wastes a lot of space/speed in most cases.
>But the acknowledged downside is that an index or length operation
>is more costly because of the possible non uniform character
>size.
Is strlenUTF16() supposed to return the number of readable characters,
or the number of 16bit words in the string?
I would want this to work, no matter if there were non-uniform
characters:
utf16 *dupstringutf16(utf16 *incoming)
{
utf16 *mystring;
mystring = malloc((strlenutf16(incoming)+1) * sizeof utf16);
strcpyutf16(mystring, incoming);
return mystring;
}
Where is the extra overhead of handling a UTF16 escape code?
JimS
> However, the description of wchar_t says that it is "an integer type whose
> range of values can represent distinct codes for all members of the
> largest extended character set specified among the supported locales."
> This seems to say that using UTF-16 surrogate pairs in wchar_t would be
> contrary to the standard. In other words, C99 seems to say that when
> wchar_t is used, one character must fit in exactly one wchar_t object in
> all cases.
>
I think that's exactly right - the whole point of wchar_t is to avoid
dealing with multi-byte character strings (MBCS).
Multi-byte strings have some important advantages: they're compatible
with plain ASCII strings, and they are regular NULL terminated
strings, hence a multi-byte string will pass unscathed through code
that knows nothing about it. Also multi-byte text tends to require
less storage space.
The disadvantage is the lack of 1:1 string offset:character
relationship, and processing a multi-byte strings always requires
parsing it from left to right (strrchr() gets tricky).
Using wchar_t pretty much flips the advantages/disadvantages.
Michal
>JimS <so...@not.com> wrote in
>news:ebkc405adrovbbh9b...@4ax.com:
>
>> Is strlenUTF16() supposed to return the number of readable characters,
>> or the number of 16bit words in the string?
>
>There is a function in the C99 library named wcslen. The C99 standard says
>that it returns the number of wide characters in the wide character string
>pointed at by its argument. This suggests that it's supposed to return a
>character count.
>
>However, the description of wchar_t says that it is "an integer type whose
>range of values can represent distinct codes for all members of the
>largest extended character set specified among the supported locales."
>This seems to say that using UTF-16 surrogate pairs in wchar_t would be
>contrary to the standard. In other words, C99 seems to say that when
>wchar_t is used, one character must fit in exactly one wchar_t object in
>all cases. This in turn seems to imply, as far as Unicode goes, that one
>either a) does not support all possible Unicode characters or b) one uses
>a 32 bit wchar_t.
You see, this is what I mean - any utf16 escape code would still be a
utf16 character, with a special meaning, like '\n' for instance.
>> utf16 *dupstringutf16(utf16 *incoming)
>> {
>> utf16 *mystring;
>> mystring = malloc((strlenutf16(incoming)+1) * sizeof utf16);
>> strcpyutf16(mystring, incoming);
>> return mystring;
>> }
>>
>> Where is the extra overhead of handling a UTF16 escape code?
>
>Well, if strlenutf16 returned a character count and if the string was UTF-
>16 encoded, there would be a possibility that it might return a value that
>was less than the number of 16 bit units used in the string (meaning the
>memory allocation would be wrong).
What use would that be? How would I find how much memory I needed?
I made my own function names up so everyone would understand I'm not
necessarily talking about 'standard' widechar functions.
>On the other hand if strlenutf16 returns a count of 16 bit units, you
>couldn't necessarily use it to count
>the number of characters in the string without making assumptions about
>the string's contents.
Number of characters must be the same as number of words else none of
this makes any sense. How a program might interpret those characters
for display is another matter entirely (that's where there are
possible performance problems).
I'm playing devil's advocate here, because I can't see the pitfalls
being discussed elsethread.
JimS
> You see, this is what I mean - any utf16 escape code would still be a
> utf16 character, with a special meaning, like '\n' for instance.
UTF-16 doesn't use escape codes, exactly. Code points above 64K are
represented with two 16 bit units called "surrogate pairs" but neither of
them is an escape character. Either of such units, taken alone, is an
invalid character---it is meaningless without the other unit. This is a
good thing. If you randomly access a UTF-16 string and find half of a
surrogate pair, you can tell that you have done so. If there was an escape
code, and you skipped the escape code, you wouldn't necessarly know and
that could be a problem.
> What use would that be? How would I find how much memory I needed?
>
> I made my own function names up so everyone would understand I'm not
> necessarily talking about 'standard' widechar functions.
It seems to me that when processing UTF-16 one needs two functions: one
function to measure the length in characters and another to measure the
length in 16 bit units. The second function would be necessary so that
memory allocations could be done. The first would be for telling the user
how long his/her strings are.
> Number of characters must be the same as number of words else none of
> this makes any sense.
Unicode defines a concept of "combining" characters where a single
character on the display is composed of a base character plus various
combining marks that are stored as separate characters in the Unicode
data. I think it's entirely fair the count the combining characters
separately even though they all correspond to a single displayed
character. This is not unlike what happens with the ASCII tab character...
how much space it takes on the screen is different than the one
character's worth of data it consumes in a string.
However, you can't really count each half of a surrogate pair as separate
characters. As I said above, they have no meaning separately. It would be
like counting the upper four bits of the ASCII code for 'x' as one
character and the lower four bits as another (in fact, it's very much like
that... that's essentially how the surrogate pairs are made). If you want
one character per storage unit, UTF-16 doesn't cut it... unless you are
willing to ignore characters with code points above 64K, of course.
Peter
>JimS <so...@not.com> wrote in news:473d401tmvdvf9o8g5tudrbsk1m74knn8u@
>4ax.com:
>> What use would that be? How would I find how much memory I needed?
>>
>> I made my own function names up so everyone would understand I'm not
>> necessarily talking about 'standard' widechar functions.
>
>It seems to me that when processing UTF-16 one needs two functions: one
>function to measure the length in characters and another to measure the
>length in 16 bit units. The second function would be necessary so that
>memory allocations could be done. The first would be for telling the user
>how long his/her strings are.
>
>> Number of characters must be the same as number of words else none of
>> this makes any sense.
For anyone else feeling a bit thick on this subject, here's the
current Unicode spec for UTF8,16 and 32.
http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html
I've used wide chars before and always treated them as if they were
simply a 16bit alphabet. Silly me.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_9i79.asp
Half of Windows, since Win2K, supports surrogates, before that it was
simply limited to a word. :-/
I can't find anywhere where it says what VisualStudio's CLIB will
support.
JimS
> Is strlenUTF16() supposed to return the number of readable characters,
> or the number of 16bit words in the string?
There is a function in the C99 library named wcslen. The C99 standard says
that it returns the number of wide characters in the wide character string
pointed at by its argument. This suggests that it's supposed to return a
character count.
However, the description of wchar_t says that it is "an integer type whose
range of values can represent distinct codes for all members of the
largest extended character set specified among the supported locales."
This seems to say that using UTF-16 surrogate pairs in wchar_t would be
contrary to the standard. In other words, C99 seems to say that when
wchar_t is used, one character must fit in exactly one wchar_t object in
all cases. This in turn seems to imply, as far as Unicode goes, that one
either a) does not support all possible Unicode characters or b) one uses
a 32 bit wchar_t.
> utf16 *dupstringutf16(utf16 *incoming)
> {
> utf16 *mystring;
> mystring = malloc((strlenutf16(incoming)+1) * sizeof utf16);
> strcpyutf16(mystring, incoming);
> return mystring;
> }
>
> Where is the extra overhead of handling a UTF16 escape code?
Well, if strlenutf16 returned a character count and if the string was UTF-
16 encoded, there would be a possibility that it might return a value that
was less than the number of 16 bit units used in the string (meaning the
memory allocation would be wrong). On the other hand if strlenutf16
returns a count of 16 bit units, you couldn't necessarily use it to count
the number of characters in the string without making assumptions about
the string's contents.
Peter
True. IMHO OW should offer 2 options at this time:
1) UTF16 without mapping of 32b codepoints (i.e., without support for
them),
2) UTF32.
The actual string processing for these is simple - all codepoints are
identical size.
A future enhancement might deal with UTF16 with mapping of 32b
codepoints, possibly by converting to and from UTF32 on input and
output.
>
> > > So except
> > > for some rare cases, it will not matter much. (How often do one
need
> > > to do buffer[42] access on a string that may just as well
> > > be in chinese as in english based on where the end user lives?)
> >
> > When text stored in a "wchar" array is parsed, or a tabular display
is
> > built. Presumably the character classification functions, such as
> > "isletter", "isnumber", "ispunct", etc. will be available for wide
> > characters, too.
>
> For the full unicode range? Without a 3rd party library?
> I think not.
I don't know whether or not the Unicode consortium provides the
appropriate database. I know that most implementations of C use a simple
table-lookup, usually implemented as a macro - in the 8b case there is a
table of 256 entries, indexed by the "codepoint" (using Unicode
terminology), with a different mask for each classification function.
Building the table for either UTF16 or UTF32 would be extremely error
prone without machine-readable tables...
...
> > > I am not certain, but I believe that OW will just have to go along
> > > with what standard the OS platform outlines. That means wchar_t
with
> > > 16 bits on Microsoft platforms, and wchar_t of 32 bits on Linux.
> >
> > There goes binary compatibility.
>
> Has there ever been binary compatibility? All types the same size
> on all supported platforms? The size of int on the different platforms
> indicates to me that this has never been the case...
Size of "int" is a problem only if you wish to link (statically or
dynamically) a caller module with a called module from different
platforms. A programmer coding for portability will not use "int" when
"long" is required to represent the data range, nor will "int" be used
when "short" has the required range. The biggest problem is that of
dealing with data imported from another platform - it is coded in terms
of absolute sizes, and C89 permits 32-bit short.
>
> Since OW supplies its own run time library, it can get away with non
> conforming size of wchar_t, as long as a conversion to platform
> format is included (reasonably trivial to convert from UTF16 to
UTF32).
> If one size is chosen for all platforms, I submit that the best
> general choice is UTF16 (IMHO).
I agree.
Peter C. Chapin 04.03.04 7:09 wrote:
> Unicode defines a concept of "combining" characters where a single
> character on the display is composed of a base character plus various
> combining marks that are stored as separate characters in the Unicode
> data. I think it's entirely fair the count the combining characters
> separately even though they all correspond to a single displayed
> character. This is not unlike what happens with the ASCII tab character...
> how much space it takes on the screen is different than the one
> character's worth of data it consumes in a string.
Don't forget, that \t not necessary should be considered as "multiple
spaces" character - it is control character, also as \n, and it may insert
many spaces, show one space or show (dump) one specific character image
(without handling its control functions).