wide-chars and Unicode

Samir Dhume

unread,

Dec 30, 1998, 3:00:00 AM12/30/98

to

Hi.

1. I have a question about wide-chars. I was under the impression
that wide chars used the Unicode character set. But a sizeof(wchar_t)
on my Solaris gives me 4. Could someone clarify please.

2. Could someone direct me to source code for multi-byte support
functions like wctomb, wcstombs etc.

Thank you,
Samir

Lawrence Kirby

unread,

Dec 31, 1998, 3:00:00 AM12/31/98

to

In article <76e80f$m63$1...@jetsam.uits.indiana.edu>
sdh...@cs.indiana.edu "Samir Dhume" writes:

>Hi.
>
>1. I have a question about wide-chars. I was under the impression
>that wide chars used the Unicode character set. But a sizeof(wchar_t)
>on my Solaris gives me 4. Could someone clarify please.

A number of implementations use Unicode but the C standard doesn't specify
the use of a specific character set at any point. However AFAIK
having sizeof(wchar_t)==4 doesn't preclude the use of Unicode.

>2. Could someone direct me to source code for multi-byte support
>functions like wctomb, wcstombs etc.

The details will be platform specific but you can probably find
implementations in source code distribution such as GNU's. However there's
no real need to see the source unless you are building your own
implementation, just use the libraries supplied with your compiler.

--
-----------------------------------------
Lawrence Kirby | fr...@genesis.demon.co.uk
Wilts, England | 7073...@compuserve.com
-----------------------------------------

Message has been deleted

Lawrence Kirby

unread,

Jan 4, 1999, 3:00:00 AM1/4/99

to

In article <76qubs$n3k$1...@nr1.ottawa.istar.net>
rha...@intouchsurvey.com "Russell Harper" writes:

...

>2. My understanding is that mb is fixed at two characters per datum and not
>necessarily the same as a wide character, so if you decide to write the
>functions yourself you'll have to use some sort of conversion:
>
> #include <limits.h>
>
> /* Assumes mb is a two character array */
> mb[0] = wc & UCHAR_MAX;
> mb[1] = (wc >> CHAR_BIT) & UCHAR_MAX;

No, see my earlier response in this thread.

Ross Ridge

unread,

Jan 5, 1999, 3:00:00 AM1/5/99

to

sdh...@cs.indiana.edu "Samir Dhume" writes:
>1. I have a question about wide-chars. I was under the impression
>that wide chars used the Unicode character set. But a sizeof(wchar_t)
>on my Solaris gives me 4. Could someone clarify please.

Solaris doesn't use Unicode for wchar_t. I believe it uses locale
dependent fixed-width EUC character sets.

Lawrence Kirby <fr...@genesis.demon.co.uk> wrote:
>A number of implementations use Unicode but the C standard doesn't specify
>the use of a specific character set at any point. However AFAIK
>having sizeof(wchar_t)==4 doesn't preclude the use of Unicode.

According to the Unicode standard it does. If it's not in 16-bit unsigned
quantities, it's not Unicode.

>>2. Could someone direct me to source code for multi-byte support
>>functions like wctomb, wcstombs etc.
>
>The details will be platform specific but you can probably find
>implementations in source code distribution such as GNU's.

Do GNU's mb routines actually do anything?

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rri...@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/u/rridge/
db //

Szu-Wen Huang

unread,

Jan 5, 1999, 3:00:00 AM1/5/99

to

Ross Ridge (rri...@calum.csclub.uwaterloo.ca) wrote:
: Lawrence Kirby <fr...@genesis.demon.co.uk> wrote:
: >[...] AFAIK

: >having sizeof(wchar_t)==4 doesn't preclude the use of Unicode.

: According to the Unicode standard it does. If it's not in 16-bit
: unsigned quantities, it's not Unicode.

[...]

Please help me understand this statement. Given:

int a;
long b;

a = 'a';
b = 'a';

On an ASCII system, a and b will both evaluate to ASCII 'a',
regardless of how many bits are used to store the ASCII value.
Are you saying the analogue of this code is not permissible when
using Unicode? Why is there such a restriction?

Lawrence Kirby

unread,

Jan 6, 1999, 3:00:00 AM1/6/99

to

In article <F53J4...@undergrad.math.uwaterloo.ca>
rri...@calum.csclub.uwaterloo.ca "Ross Ridge" writes:

...

>According to the Unicode standard it does. If it's not in 16-bit unsigned
>quantities, it's not Unicode.

So, you're claiming that it is impossible to implement Unicode on a platform
that doesn't have a 16 bit integer type, e.g. a Cray that supports 8 bit
char types and all of the other integer types are 64 bits, or a 36 bit system
that supports 9, 18 and 36 bit integer types? What is the rationale for
such a restriction?

>>>2. Could someone direct me to source code for multi-byte support
>>>functions like wctomb, wcstombs etc.
>>
>>The details will be platform specific but you can probably find
>>implementations in source code distribution such as GNU's.
>
>Do GNU's mb routines actually do anything?

I don't know offhand, why not check them?

Ross Ridge

unread,

Jan 6, 1999, 3:00:00 AM1/6/99

to

rri...@calum.csclub.uwaterloo.ca "Ross Ridge" writes:
>According to the Unicode standard it does. If it's not in 16-bit unsigned
>quantities, it's not Unicode.

Lawrence Kirby <fr...@genesis.demon.co.uk> wrote:
>So, you're claiming that it is impossible to implement Unicode on a
>platform that doesn't have a 16 bit integer type, e.g. a Cray that
>supports 8 bit char types and all of the other integer types are 64
>bits, or a 36 bit system that supports 9, 18 and 36 bit integer types?

Not if you want to claim that processing of wchar_t values can be
done in way conforming to the Unicode specification.

> What is the rationale for such a restriction?

The Unicode specification has no rationale. Personally, I think it's
just best ignored, like much of the Unicode specification.

Will Rose

unread,

Jan 6, 1999, 3:00:00 AM1/6/99

to

Ross Ridge (rri...@calum.csclub.uwaterloo.ca) wrote:

: rri...@calum.csclub.uwaterloo.ca "Ross Ridge" writes:
: >According to the Unicode standard it does. If it's not in 16-bit unsigned
: >quantities, it's not Unicode.

: Lawrence Kirby <fr...@genesis.demon.co.uk> wrote:
: >So, you're claiming that it is impossible to implement Unicode on a
: >platform that doesn't have a 16 bit integer type, e.g. a Cray that
: >supports 8 bit char types and all of the other integer types are 64
: >bits, or a 36 bit system that supports 9, 18 and 36 bit integer types?

: Not if you want to claim that processing of wchar_t values can be
: done in way conforming to the Unicode specification.

: > What is the rationale for such a restriction?

: The Unicode specification has no rationale. Personally, I think it's
: just best ignored, like much of the Unicode specification.

Well, the UTF-8 annex seems to be pretty widely implemented. And I'd
have thought, tho' I haven't tried it, that you could pack and unpack
wchar_t types on systems without a 16-bit primitive in the same way
that BCPL handles packed strings. Tedious, but quite possible.

Will
c...@crash.cts.com

Ross Ridge

unread,

Jan 7, 1999, 3:00:00 AM1/7/99

to

Will Rose <c...@cts.com> wrote:
>Well, the UTF-8 annex seems to be pretty widely implemented.

Not really. In practice, on the few systems it's supported, it's used
just to encode ISO-8859-1 characters.

>And I'd have thought, tho' I haven't tried it, that you could pack and
>unpack wchar_t types on systems without a 16-bit primitive in the same
way that BCPL handles packed strings. Tedious, but quite possible.

Again I suggest just ignoring the Unicode specification.

Ross Ridge

unread,

Jan 7, 1999, 3:00:00 AM1/7/99

to

rri...@calum.csclub.uwaterloo.ca "Ross Ridge" writes:
>According to the Unicode standard it does. If it's not in 16-bit unsigned
>quantities, it's not Unicode.

Lawrence Kirby <fr...@genesis.demon.co.uk> wrote:
>So, you're claiming that it is impossible to implement Unicode on a
>platform that doesn't have a 16 bit integer type, e.g. a Cray that
>supports 8 bit char types and all of the other integer types are 64
>bits, or a 36 bit system that supports 9, 18 and 36 bit integer types?

Ross Ridge <rri...@calum.csclub.uwaterloo.ca> wrote:
>Not if you want to claim that processing of wchar_t values can be
>done in way conforming to the Unicode specification.

I didn't have Unicode specification handy at the time I made the
post I'm following up to, but here's what it says about a 32-bit
wchar_t type:

On systems where the native character type or wchar_t are
implemented as 32-bit quantities, an implementation may
transiently use 32-bit quantities to represent Unicode
characters during processing. The internals of this
representation are treated as black box and are not Unicode
conformant. In particular, any API or runtime interfaces that
accept strings of 32-bit characters are not Unicode conformant.
...

The Unicode specification is not what a lot people think it is.

Will Rose

unread,

Jan 7, 1999, 3:00:00 AM1/7/99

to

Ross Ridge (rri...@calum.csclub.uwaterloo.ca) wrote:

: Will Rose <c...@cts.com> wrote:
: >Well, the UTF-8 annex seems to be pretty widely implemented.

: Not really. In practice, on the few systems it's supported, it's used
: just to encode ISO-8859-1 characters.

: >And I'd have thought, tho' I haven't tried it, that you could pack and
: >unpack wchar_t types on systems without a 16-bit primitive in the same
: way that BCPL handles packed strings. Tedious, but quite possible.

: Again I suggest just ignoring the Unicode specification.

The trouble is that firstly you give no reasons for this approach,
and secondly you supply no alternative. How do you suggest handling
multibyte characters?

Will
c...@crash.cts.com

Ross Ridge

unread,

Jan 7, 1999, 3:00:00 AM1/7/99

to

Will Rose <c...@cts.com> wrote:
>And I'd have thought, tho' I haven't tried it, that you could pack and
>unpack wchar_t types on systems without a 16-bit primitive in the same
>way that BCPL handles packed strings. Tedious, but quite possible.

Ross Ridge (rri...@calum.csclub.uwaterloo.ca) wrote:
>Again I suggest just ignoring the Unicode specification.

Will Rose <c...@cts.com> wrote:
>The trouble is that firstly you give no reasons for this approach,
>and secondly you supply no alternative.

Have you been reading this thread? My reasons and the alternative is
obvious. Since the unsigned 16-bit prcoessing requirement is absurdly
difficult to implement on some systems, it's just best ignored as being
an unreasonable requirement. The alternative is just to define wchar_t
as a 36-bit or 64-bit or whatever sized type happens to work.

>How do you suggest handling multibyte characters?

Personally, unless handling multiple incompatable multibyte character
sets *simultaneously* is a requirement, I recommend fixed-width EUC.

Adrian HAVILL

unread,

Jan 8, 1999, 3:00:00 AM1/8/99

to

Ross Ridge wrote:
>
> Will Rose <c...@cts.com> wrote:
> >And I'd have thought, tho' I haven't tried it, that you could pack and
> >unpack wchar_t types on systems without a 16-bit primitive in the same
> >way that BCPL handles packed strings. Tedious, but quite possible.
>
> Ross Ridge (rri...@calum.csclub.uwaterloo.ca) wrote:
> >Again I suggest just ignoring the Unicode specification.
>
> Will Rose <c...@cts.com> wrote:
> >The trouble is that firstly you give no reasons for this approach,
> >and secondly you supply no alternative.
>
> Have you been reading this thread? My reasons and the alternative is
> obvious. Since the unsigned 16-bit prcoessing requirement is absurdly
> difficult to implement on some systems, it's just best ignored as being
> an unreasonable requirement. The alternative is just to define wchar_t
> as a 36-bit or 64-bit or whatever sized type happens to work.

The Unicode Standard does not say that you must process using an
unsigned 16-bit type.

It says that "a process shall interpret Unicode code values as 16-bit
quantities." The internal representation is up to the process... so long
as the process doesn't attempt to treat a single 16-bit value as two or
more atomic 8-bit characters (ie allow a match between a 8-bit value and
the high or low half of a Unicode character), and that the data going in
and out of the process to and from the real world is valid Unicode (or a
valid transformation of it), the black box implementation is up to you--
long, unsigned long, int, short-- char is even possible so long as the
code groups in 16.

You shouldn't use wchar_t for Unicode, because the a) source wouldn't be
portable (wchar_t can be as small as 8-bits in a Standard C environment)
and b) because the internal format of wchar_t is compiler dependent
(with the only condition being that the portable C character set
correspond to the wide character by zero extension), the Standard C wide
character functions (<wchar.h>, <wctype.h>, mbtowc(), etc.) wouldn't
work correctly with the manually stuffed wchar_t... so I don't see the
advantage in using wchar_t when you could just use a typedef or macro to
a unsigned short type, which will be portable and you don't have to
worry about the sign.

> >How do you suggest handling multibyte characters?
>
> Personally, unless handling multiple incompatable multibyte character
> sets *simultaneously* is a requirement, I recommend fixed-width EUC.

The only reason you're want to go to EUC is if: data portability is not
a concern (transfer between all systems now and in the future will be
homogenous and all use EUC), and/or b) your compiler implements wchar_t
as fixed width EUC and char as multi-byte EUC... which saves you the
trouble of re-inventing the wheel. Otherwise, converting between a
particular EUC format for external use and internal use is just as
tedious as implementing Unicode... so you may as well go the more
portable route.
--
Adrian D. Havill; Chief Developer;
Development Section, Service Department, System Division,
InterQ, inc.; "Internet for Everyone!"

Ross Ridge

unread,

Jan 8, 1999, 3:00:00 AM1/8/99

to

Adrian HAVILL <hav...@interq.ad.jp> wrote:
>The Unicode Standard does not say that you must process using an
>unsigned 16-bit type.

The Unicode *specification* (it is *not* a standard) requires
that wchar_t be an unsigned 16-bit type *explicitly*. Read the
relevent section I quoted from the specification in a previous
articl.e

>> Personally, unless handling multiple incompatable multibyte character
>> sets *simultaneously* is a requirement, I recommend fixed-width EUC.
>
>The only reason you're want to go to EUC is if: data portability is not
>a concern (transfer between all systems now and in the future will be
>homogenous and all use EUC), and/or b) your compiler implements wchar_t
>as fixed width EUC and char as multi-byte EUC... which saves you the
>trouble of re-inventing the wheel.

You're mixing up issues. The choice of wchar_t is largely an internal
issue and doesn't affect data portability. Externally, programes use
multibyte character sets and I made no recommendation on the
representation of multibyte character sets. (My recommendation would
be to use the well established character set for the language and
system type, eg. EUC-JP on Japanese Unix machines, Shift-JIS on
Japanese Windows machines, and ISO 8859-1 on American Unix or Windows
machines. Intra-system data portability far outweighs inter-language
data portability.)

> Otherwise, converting between a particular EUC format for external
>use and internal use is just as tedious as implementing Unicode...

I don't know what you're trying to say here, either you're
ignoring my non-simultaneous requirement, or you don't realize
that converting between EUC multi-byte and EUC fixed width is
simple and algorithmic.

>... you may as well go the more portable route.

Given the limitted and highly varaible support for Unicode, I don't see
it as being more portable.

Michael Rubenstein

unread,

Jan 9, 1999, 3:00:00 AM1/9/99

to

On Fri, 8 Jan 1999 19:51:53 GMT, rri...@calum.csclub.uwaterloo.ca
(Ross Ridge) wrote:

>Adrian HAVILL <hav...@interq.ad.jp> wrote:
>>The Unicode Standard does not say that you must process using an
>>unsigned 16-bit type.
>

>The Unicode *specification* (it is *not* a standard) ...

Wrong answer. Try again.

I don't want to get into an argument as to what constitutes a
standard, but the authoritative source of information on Unicode is
the book

The Unicode Consortium; The Unicode Standard, Version 2.0;
Addison-Wesley; 1996; ISBN 0-2-1-48345-9

It is customary to refer to a book by its title -- shortening it to
"The Unicode Standard" is quite reasonable in informal discussion.

The quotation you gave to support your claim that an API that accepts
strings of 32-bit characters is not Unicode conformant is from section
5.1 of this book.
--
Michael M Rubenstein

Ross Ridge

unread,

Jan 12, 1999, 3:00:00 AM1/12/99

to

rri...@calum.csclub.uwaterloo.ca (Ross Ridge) wrote:
>The Unicode *specification* (it is *not* a standard) ...

Michael Rubenstein <mik...@ix.netcom.com> wrote:
>Wrong answer. Try again.
>
>I don't want to get into an argument as to what constitutes a

>standard...

Then don't say I'm wrong.

Michael Rubenstein

unread,

Jan 13, 1999, 3:00:00 AM1/13/99

to

On Tue, 12 Jan 1999 19:55:27 GMT, rri...@calum.csclub.uwaterloo.ca
(Ross Ridge) wrote:

>rri...@calum.csclub.uwaterloo.ca (Ross Ridge) wrote:
>>The Unicode *specification* (it is *not* a standard) ...
>
>Michael Rubenstein <mik...@ix.netcom.com> wrote:
>>Wrong answer. Try again.
>>
>>I don't want to get into an argument as to what constitutes a
>>standard...
>
>Then don't say I'm wrong.

Then don't be wrong. The name of the book is "The Unicode Standard,
Version 2.0." It is reasonable to refer to the book by that name.

Do you want to argue about the name of the book? Or whether one
should refer to it by its name? I will not argue about what
constitutes a standard. I will be glad to argue that one should use
the title of a book when citing it.
--
Michael M Rubenstein