Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

compression streambuf and wchar_t

2 views
Skip to first unread message

John Dill

unread,
Feb 18, 2004, 7:06:16 AM2/18/04
to
I am now looking at validating my compression streams with respect to
wchar_t characters, but I am completely new to dealing with wchar_t,
unicode, and the like. The main issue seems to be how to translate a
multi-byte character stream into a single character stream for
compression and decompression. Does anyone have experience with this
type of problem? Are there other issues that come up?

Also, can someone give me some basic tutorial links about how to use
wchar_t with unicode or whatever is represented by wchar_t? Maybe
some sample code for declaring strings in wchar_t that I can send to
my compression and decompression streams.

Thanks,
John

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

ka...@gabi-soft.fr

unread,
Feb 19, 2004, 8:17:45 PM2/19/04
to
john...@uiowa.edu (John Dill) wrote in message
news:<302c79f4.04021...@posting.google.com>...

> I am now looking at validating my compression streams with respect to
> wchar_t characters, but I am completely new to dealing with wchar_t,
> unicode, and the like. The main issue seems to be how to translate a
> multi-byte character stream into a single character stream for
> compression and decompression. Does anyone have experience with this
> type of problem? Are there other issues that come up?

No experience, but...

C++ defines all IO in terms of bytes (char's); there is no such thing as
wide character IO. A wide caracter filebuf translates to and from bytes
using the imbued locale.

Logically, the translation should be a variant of the filtering
streambuf idiom -- a filtering wstreambuf which forwards to a
streambuf. It isn't, so you can't reuse the translation logic in
wfilebuf; you'll have to reimplement it (and the interface to
std::codecvt is anything but simple). The resulting chain should be:
translating_wstreambuf -> gzip_streambuf -> filebuf

Alternatively, although the gzip library works with bytes, the
underlying algorithms are defined over an "alphabet" -- you could simply
reimplement gzip for wide caracters. My personal feeling, however, is
that this is not the way to go. But I've no experience with systems
where the physical type stored on disk was a wchar_t, and not a char;
perhaps under Windows, there would be some argument for implementing a
complete chain (filebuf, compression stream, etc.) using wchar_t
exclusively. In such a case, however, you will 1) have to implement a
new version of wfilebuf, since the standard version converts to bytes in
a locale dependant manner (but which will doubtlessly make your
compressed data unreadable), and 2) reimplement gzip. There may also be
a problem with memory consumption in the reimplementation of gzip -- to
be effective, you must be able to store strings in the alphabet, and the
number of strings you store must be considerably larger than the number
of characters in the alphabet. The usual 8 bit implementation will
store up to around 2^16 strings; to be effective using a 16 bit
alphabet, I would imagine that something between 2^20 and 2^24 strings
would be necessary.

> Also, can someone give me some basic tutorial links about how to use
> wchar_t with unicode or whatever is represented by wchar_t?

It's implementation defined:-). Even whether wchar_t is larger than
char, or what the code set is.

Seriously, on common platforms, there are two large variants: Windows
and AIX use UTF-16 multibyte encoding, even with wchar_t, and most of
the other Unixes (Solaris, HP-UX, and also Linux) use UCS-4. I'm not
sure what the default translation for external use is for the UTF-16
machines : I suspect that it is UTF-16LE for Windows, and UTF-16LE for
AIX. For the Unixes, it is invariably UTF-8. UTF-8 is also what is
used for transmission over the Internet, so presumably there is some
support for it under Windows and AIX as well.

For more information, I would highly recommend the Unicode site:
http://www.unicode.org/. You might want to start with the glossary (if
you aren't familiar with the terms I used above) -- the technical
reports are also worth reading. (Overall, the site is exceptionally
well done and informative.) There's also a FAQ:
http://www.cl.cam.ac.uk/~mgk25/unicode.html; it is mainly oriented Unix,
but there are a lot of issues which are independant of the machine.

> Maybe some sample code for declaring strings in wchar_t that I can
> send to my compression and decompression streams.

For starters, what's wrong with:

std::wstring s( L"Some text" ) ;

You can use universal character names to specify non-ASCII characters in
the string.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

Richard Smith

unread,
Feb 20, 2004, 11:05:26 PM2/20/04
to
john...@uiowa.edu (John Dill) wrote in message news:<302c79f4.04021...@posting.google.com>...
> I am now looking at validating my compression streams with respect to
> wchar_t characters, but I am completely new to dealing with wchar_t,
> unicode, and the like.

If your serious about writing portable code that properly handles
Unicode, forget about wchar_t. The current Standard leaves far too
many things about it unspecified. For example, how big is it? On
some implementations it is 16 bits long, on others it is 32; even 8
bits is legal (I think). Let's assume for the moment that all
implementations choose either 16-bit of 32-bit wchar_ts.

This means that if you put unicode characters in, you can't tell
whether you'll get a UCS-4 [aka UTF-32] or a UCS-2 string out. This
is particularly worrying: UCS-2 has been obsolete for some time
because it was realised that 65,536 characters didn't provide enough
space to fit all characters they wanted in. UTF-16 [formerly UCS-2e],
the modern 16-bit encoding of Unicode is no use for this because it
can take several 16-bit "characters" to encode a single Unicode
characters.

Even if you were happy with UTF-16, wchar_t is still little in
portable code as it's unlikely that you'd be happy using UCS-4 on some
platforms and UTF-16 on others.

Looking at the C++ Standards Committee web page, I see that there are
a couple of papers (N962 and N969) on this. Maybe the next standard
will manage to sort some of this mess out. Until then: avoid wchar_t
if you value portability.

--
Richard Smith

ka...@gabi-soft.fr

unread,
Feb 23, 2004, 1:51:47 PM2/23/04
to
ric...@ex-parrot.com (Richard Smith) wrote in message
news:<1a0929fa.04022...@posting.google.com>...

> john...@uiowa.edu (John Dill) wrote in message
> news:<302c79f4.04021...@posting.google.com>...

> > I am now looking at validating my compression streams with respect
> > to wchar_t characters, but I am completely new to dealing with
> > wchar_t, unicode, and the like.

> If your serious about writing portable code that properly handles
> Unicode, forget about wchar_t.

[...]


> Looking at the C++ Standards Committee web page, I see that there are
> a couple of papers (N962 and N969) on this. Maybe the next standard
> will manage to sort some of this mess out. Until then: avoid wchar_t
> if you value portability.

I understand your sentiments, and can sympathize with them, but what do
you propose as an alternative? I played around with defining my own
guaranteed 32 bit Unicode type, and instantiating std::basic_string,
etc. over it. All I can say is that it is an awful lot of work, and I'm
not sure quite what it buys you.

One this I will say, concerning this issue (and internationalization in
general): as with everything else, start by defining a requirements
specification. What do you want to achieve? Deciding to use wchar_t
because it is the in thing, with the idea that simply by using wchar_t
your programs will be in some way more "international" or less parochial
will not achieve anything. As I pointed out in my response, there are
really only two variants to deal with for mainstream platforms, and for
many applications, the difference between the two isn't important.

One of the first things which needs specifying is what characters you
have to deal with. Many programs use a restricted vocabulary, and very
few need the linear B characters or music symbols. It may just be
possible that you can get by with just USC-16, in which case, there is
no problem. Or you may not need to support things like operator[] in
your strings -- so the fact that UTF-16 involves multibyte characters
may not be a problem. Until you have exactly specified what you need to
achieve, however, it is difficult to say what the correct solution is.

Finally, be very wary of third party software. The standard string
class (regardless of its instantiation) is locale-free, which means that
any third party software which doesn't accept locale parameters will
probably not be able to handle anything other than pure ASCII. Even in
the best of cases (e.g. something like Boost's regular expressions), I'd
feel more comfortable if there was an explicit statement that they
handle multi-byte encodings (like UTF-8 or UTF-16) correctly.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Ben Hutchings

unread,
Feb 24, 2004, 2:33:20 PM2/24/04
to
ka...@gabi-soft.fr wrote:
<snip>
> Finally, be very wary of third party software. The standard string
> class (regardless of its instantiation) is locale-free, which means that
> any third party software which doesn't accept locale parameters will
> probably not be able to handle anything other than pure ASCII. Even in
> the best of cases (e.g. something like Boost's regular expressions), I'd
> feel more comfortable if there was an explicit statement that they
> handle multi-byte encodings (like UTF-8 or UTF-16) correctly.

Regex++ deals only with complete characters, so if you wish to use it
with UTF-8 or UTF-16 strings you have to define iterators for it that
read and write complete characters. In Boost 1.30 you also have to
modify reg_expression::assign() because it assumes random-access
iterators (this is a bug). I think this is fixed in Regex++ version 4
which is in Boost 1.31.

Richard Smith

unread,
Feb 25, 2004, 6:51:20 PM2/25/04
to
James Kanze wrote:
> ric...@ex-parrot.com (Richard Smith) wrote in message
> > avoid wchar_t if you value portability.
>
> I understand your sentiments, and can sympathize with them, but what do
> you propose as an alternative? I played around with defining my own
> guaranteed 32 bit Unicode type, and instantiating std::basic_string,
> etc. over it. All I can say is that it is an awful lot of work, and I'm
> not sure quite what it buys you.

If you need to support code on several implementations, some of which
have 16-bit wchar_ts and some 32-bit wchar_ts, it can buy you quite a
lot. I'm sure that there are situations when you can happily live one
on some platforms and the other on others, but once you throw
third-party libraries that need a known type of string into the mix,
life can get difficult.

You're quite right when you say that implementing a 32-bit character
type to instantiate std::basic_string over, and then specialising
std::char_traits for it, is a lot of work. In particular, the
requirement in 21/1 that the character type be POD is itself
problematic.


> One of the first things which needs specifying is what characters you
> have to deal with. Many programs use a restricted vocabulary, and very
> few need the linear B characters or music symbols.

Indeed so, however my current primary concern is with the CJK
Extension B area. I don't speak any of the East Asian languages in
question so I don't really know what these characters are, though I
presume they are not in common usage. However, I could imagine (and
would welcome correction if this is not the case) that these
characters might occasionally be encountered in historical quotations
or perhaps proper names. Clearly in some (many?) applications this
isn't relevant; however, in others this might be considered an
unnecessary limitation.

> It may just be
> possible that you can get by with just USC-16, in which case, there is
> no problem. Or you may not need to support things like operator[] in
> your strings -- so the fact that UTF-16 involves multibyte characters
> may not be a problem. Until you have exactly specified what you need to
> achieve, however, it is difficult to say what the correct solution is.

If I were happy to use UTF-16 and handle the multi-"byte" characters,
I would quite probably be equally happy to use UTF-8. In fact, this
is my usual solution. I've have a generic string class that can
handle multi-byte characters, and I use this with UTF-8. You don't
get random access, but you do get iterators whose value_type is the
logical character type (so a 32-bit quantity for UTF-8). And you get
a valid UTF-8 string when you call .c_str().

ka...@gabi-soft.fr

unread,
Feb 25, 2004, 7:01:02 PM2/25/04
to
Ben Hutchings <do-not-s...@bwsint.com> wrote in message
news:<slrnc3n4km.p3b....@shadbolt.i.decadentplace.org.uk>...

> ka...@gabi-soft.fr wrote:
> <snip>
> > Finally, be very wary of third party software. The standard string
> > class (regardless of its instantiation) is locale-free, which means that
> > any third party software which doesn't accept locale parameters will
> > probably not be able to handle anything other than pure ASCII. Even in
> > the best of cases (e.g. something like Boost's regular expressions), I'd
> > feel more comfortable if there was an explicit statement that they
> > handle multi-byte encodings (like UTF-8 or UTF-16) correctly.

> Regex++ deals only with complete characters, so if you wish to use it
> with UTF-8 or UTF-16 strings you have to define iterators for it that
> read and write complete characters.

I presume then that it also supposes that the character encoding is the
same in the regular expression definition, the text is is comparing or
searching, and the locale it has been passed.

If I recall correctly, Boost's regular expressions require
bi-directional iterators for the text to be searched. This makes a
mapping iterator somewhat more difficult, although still quite doable
for UTF-8 or UTF-16; there are multibyte encodings, however, where it
will simply not be possible (except maybe by storing a staring point and
an offset, and rescanning from the start each time you do --).

Is there any reason *why* bi-directional iterators are necessary. My
regular expression class (admittedly a lot simpler) builds a DFA
(lazily), and only scans forward. I think that handling (..) as Boost
does (saving the subsequence) does require multiple scans in certain
cases, but I would think that it could still be done with forward
iterators, saving where one is, and coming back to there if necessary.
What have I missed?

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Ben Hutchings

unread,
Feb 27, 2004, 1:07:32 PM2/27/04
to
ka...@gabi-soft.fr wrote:
> Ben Hutchings <do-not-s...@bwsint.com> wrote in message
> news:<slrnc3n4km.p3b....@shadbolt.i.decadentplace.org.uk>...
>> ka...@gabi-soft.fr wrote:
>> <snip>
>> > Finally, be very wary of third party software. The standard string
>> > class (regardless of its instantiation) is locale-free, which means that
>> > any third party software which doesn't accept locale parameters will
>> > probably not be able to handle anything other than pure ASCII. Even in
>> > the best of cases (e.g. something like Boost's regular expressions), I'd
>> > feel more comfortable if there was an explicit statement that they
>> > handle multi-byte encodings (like UTF-8 or UTF-16) correctly.
>
>> Regex++ deals only with complete characters, so if you wish to use it
>> with UTF-8 or UTF-16 strings you have to define iterators for it that
>> read and write complete characters.
>
> I presume then that it also supposes that the character encoding is the
> same in the regular expression definition, the text is is comparing or
> searching, and the locale it has been passed.

The character encodings can be different but the character set
must be the same.

> If I recall correctly, Boost's regular expressions require
> bi-directional iterators for the text to be searched. This makes a
> mapping iterator somewhat more difficult, although still quite doable
> for UTF-8 or UTF-16; there are multibyte encodings, however, where it
> will simply not be possible (except maybe by storing a staring point and
> an offset, and rescanning from the start each time you do --).

Yes, that's correct, except that it does not require writability and
unary operator* need not return a reference. In terms of the proposed
new iterator requirements, the iterators must be bidirectional traversal
iterators and readable iterator.

> Is there any reason *why* bi-directional iterators are necessary.
> My regular expression class (admittedly a lot simpler) builds a DFA
> (lazily), and only scans forward. I think that handling (..) as
> Boost does (saving the subsequence) does require multiple scans in
> certain cases,

As does any expression that may involve backtracking over multiple
characters, surely.

> but I would think that it could still be done with forward
> iterators, saving where one is, and coming back to there if
> necessary.

Exactly.

> What have I missed?

Look-behind assertions require backwards traversal, but Regex++ isn't
documented as supporting them.

ka...@gabi-soft.fr

unread,
Feb 27, 2004, 1:07:55 PM2/27/04
to
ric...@ex-parrot.com (Richard Smith) wrote in message
news:<1a0929fa.04022...@posting.google.com>...

> James Kanze wrote:
> > ric...@ex-parrot.com (Richard Smith) wrote in message
> > > avoid wchar_t if you value portability.

> > I understand your sentiments, and can sympathize with them, but what
> > do you propose as an alternative? I played around with defining my
> > own guaranteed 32 bit Unicode type, and instantiating
> > std::basic_string, etc. over it. All I can say is that it is an
> > awful lot of work, and I'm not sure quite what it buys you.

> If you need to support code on several implementations, some of which
> have 16-bit wchar_ts and some 32-bit wchar_ts, it can buy you quite a
> lot. I'm sure that there are situations when you can happily live one
> on some platforms and the other on others, but once you throw
> third-party libraries that need a known type of string into the mix,
> life can get difficult.

> You're quite right when you say that implementing a 32-bit character
> type to instantiate std::basic_string over, and then specialising
> std::char_traits for it, is a lot of work. In particular, the
> requirement in 21/1 that the character type be POD is itself
> problematic.

That's not so difficult in itself: "typedef uint_32 UnicodeChar ;". Of
course, uint_32 isn't portable, but it shouldn't be too difficult to
manage just a single typedef:-). The real problems come from the fact
that to be useful, you also need to specialize a certain number of
templates in the standard library, and you are not allowed (formally, at
least) to specialize a standard template over a type that you didn't
define. So the typedef solution doesn't work. (I'm not sure if I'm
actually correct here. My impression of the way things like operator<<
and such work is that you need to specialize std::numpunct. But I'll
admit that I find most of <locale> completely incomprehensible, and
could easily be mistaken.)

> > One of the first things which needs specifying is what characters
> > you have to deal with. Many programs use a restricted vocabulary,
> > and very few need the linear B characters or music symbols.

> Indeed so, however my current primary concern is with the CJK
> Extension B area. I don't speak any of the East Asian languages in
> question so I don't really know what these characters are, though I
> presume they are not in common usage. However, I could imagine (and
> would welcome correction if this is not the case) that these
> characters might occasionally be encountered in historical quotations
> or perhaps proper names. Clearly in some (many?) applications this
> isn't relevant; however, in others this might be considered an
> unnecessary limitation.

I don't speak any of the CJK languages either, but presumably, if they
weren't in the older versions of Unicode, they aren't that frequent:-).
This is what I meant when I said a "restricted vocabulary": if your text
only consists of a fixed number of predefined messages, you can have
them translated, and know in advance whether you need the extension B
area. If you text deals with user provided input, of course, or if you
are concerned about quality typeset output, you should probably consider
that you do need it, until someone can prove the contrary.

Note too that the term "internationalization" is a misnomer. A lot of
the software written here (a French bank) has to be conform to French
laws with regards to bookkeeping practices and fiscal issues. And
software written to respect French tax laws is NOT international --
there is no way, short of a complete rewrite, that it can be used
anywhere else. We still need <locale> (or <locale.h>, as it still is
with the compilers we use), we may even choose to use wchar_t, for the
occasionnal foreign character -- a customer might be Czeck, and have a
r-haceck in his name, for example. But it will always be Latin script;
Arabic and Japanese customer names will be transliterated. As far as we
are concerned, the fact that wchar_t is 16 bits on our clients, and 32
bits on our servers, isn't a problem. (Transmission between the two,
obviously, will take place over the network. Using a protocol, which
will doubtlessly use UTF-8, like every other protocol.)

My last project was international -- at least in theory, anyone in the
world could log in, and use anything they could get out of their
keyboard as a password, which we had to verify. The protocol (RADIUS)
used UTF-8, so the Extention B area was certainly a possibility. On the
other hand, all we had to do was encrypt the byte stream, and compare
the results for equality with a previously encrypted byte stream.
Totally international, but not a single instance of <locale>, and we
just left the characters in UTF-8, in char's internally.

As I said: first define clearly what you need, then implement it.
Generic solutions are fine up front for things like vector, where you
know in advance that they will be needed, and you know pretty well what
the needed functionality will be. (You have to be fully perverted by
Java to think that a vector doesn't need a [] operator:-).) In the case
of internationalization, however, I don't think anyone really knows what
is needed generically. We're all just stabbing in the dark a bit.

> > It may just be possible that you can get by with just USC-16, in
> > which case, there is no problem. Or you may not need to support
> > things like operator[] in your strings -- so the fact that UTF-16
> > involves multibyte characters may not be a problem. Until you have
> > exactly specified what you need to achieve, however, it is difficult
> > to say what the correct solution is.

> If I were happy to use UTF-16 and handle the multi-"byte" characters,
> I would quite probably be equally happy to use UTF-8.

It depends. First, as I mention above, there are some cases where no 8
bit code is sufficient, but where you can be sure that your UTF-16 will
always be one byte codes (with a 16 bit byte). Secondly, when you look
at a single byte in UTF-16, you know immediately where you are: single
byte character, first byte, or second byte. With UTF-8, you might have
to look around a bit. But I'll admit that the difference isn't that
great, and except in cases where you have a restricted set of text, and
you can consider the presence of a surrogate as an error, UTF-8 probably
isn't much more work. Of the three possibilities: 8 bit UTF-8, 16 bit
UTF-16, and 32 bit USC-4, UTF-16 is by far the least useful. Unless, of
course, that is what your target system uses -- you rarely win fighting
the system.

> In fact, this is my usual solution. I've have a generic string class
> that can handle multi-byte characters, and I use this with UTF-8. You
> don't get random access, but you do get iterators whose value_type is
> the logical character type (so a 32-bit quantity for UTF-8). And you
> get a valid UTF-8 string when you call .c_str().

Are the iterators bi-directional, or forward? Forward iterators with
UTF-8 are trivial. Bi-directional are a bit more work -- not that much,
but a bit.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Eugene Gershnik

unread,
Feb 27, 2004, 11:21:44 PM2/27/04
to
Richard Smith wrote:
> john...@uiowa.edu (John Dill) wrote in message
> news:<302c79f4.04021...@posting.google.com>...
>> I am now looking at validating my compression streams with respect to
>> wchar_t characters, but I am completely new to dealing with wchar_t,
>> unicode, and the like.
>
> If your serious about writing portable code that properly handles
> Unicode, forget about wchar_t. The current Standard leaves far too
> many things about it unspecified. For example, how big is it? On
> some implementations it is 16 bits long, on others it is 32; even 8
> bits is legal (I think). Let's assume for the moment that all
> implementations choose either 16-bit of 32-bit wchar_ts.
>
> This means that if you put unicode characters in, you can't tell
> whether you'll get a UCS-4 [aka UTF-32] or a UCS-2 string out.

AFAIK wchar_t doesn't have to be in Unicode at all. On some platforms
(notably Solaris) it has locale-dependent encoding. The funny thing is that
apart from poorly supported __STDC_ISO_10646__ there just isn't any good way
to know what the actual encoding is.

Eugene

P.J. Plauger

unread,
Feb 28, 2004, 7:24:32 AM2/28/04
to
"Eugene Gershnik" <gers...@nospam.hotmail.com> wrote in message
news:_e6dnZCet_w...@speakeasy.net...

> Richard Smith wrote:
> > john...@uiowa.edu (John Dill) wrote in message
> > news:<302c79f4.04021...@posting.google.com>...
> >> I am now looking at validating my compression streams with respect to
> >> wchar_t characters, but I am completely new to dealing with wchar_t,
> >> unicode, and the like.
> >
> > If your serious about writing portable code that properly handles
> > Unicode, forget about wchar_t. The current Standard leaves far too
> > many things about it unspecified. For example, how big is it? On
> > some implementations it is 16 bits long, on others it is 32; even 8
> > bits is legal (I think). Let's assume for the moment that all
> > implementations choose either 16-bit of 32-bit wchar_ts.
> >
> > This means that if you put unicode characters in, you can't tell
> > whether you'll get a UCS-4 [aka UTF-32] or a UCS-2 string out.
>
> AFAIK wchar_t doesn't have to be in Unicode at all. On some platforms
> (notably Solaris) it has locale-dependent encoding. The funny thing is
that
> apart from poorly supported __STDC_ISO_10646__ there just isn't any good
way
> to know what the actual encoding is.

How do you determine the encoding used by string literals, printf, or
scanf? If you put Ebcdic characters in, you can't tell whether you'll
get an ISO-646 [aka ASCII] or a RAD50 string out. And yet, programming
languages have survived for half a century with little or no machinery
for detecting the execution character set or changing it at the whim
of the programmer.

The real problem with wide characters is that C89 *suggested* the
possibility that the behavior of mbtowc/wctomb might change with
a change in locale. Left unanswered is the effect of such a change
on the validity of wide string literals such as L"abc", or a host
of other matters. For more on the delicate business of changing
wide-character encodings on the fly, see our on-line CoreX manual,
in particular:

http://www.dinkumware.com/manuals/reader.aspx?b=cx/&h=multibyte.html

The C committee introduced wchar_t and its minimal set of conversion
functions as a way for programs to *portably* manipulate large
character sets. But this is a different kind of portability than
"writing portable programs that properly handle Unicode" -- it
is scrupulously wide character-set neutral, just as C has been
byte character-set neutral for decades.

If you really want to manipulate Unicode, in all its variant
forms, you need something like our CoreX library. And you indeed
must be wary of using wchar_t to hold your wide characters, since
it is not necessarily large enough. Our CoreX conversions will
work with a variety of integer types, so you can indeed write
very portable code that works with Unicode.

On a loosely related note, the C committee has recently approved
TR 19769, which adds the type definitions char16_t and char32_t.
It also adds to the language string literals of the form u"abc"
and U"abc", along with library functions for converting between
these wide forms and multibyte sequences. It is anticipated, but
not required, that vendors will use these to supply UTF-16 and
UCS-4 support.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Eugene Gershnik

unread,
Feb 28, 2004, 11:00:30 PM2/28/04
to
P.J. Plauger wrote:
>> "Eugene Gershnik" <gers...@nospam.hotmail.com> wrote in message
>> news:_e6dnZCet_w...@speakeasy.net...
>>
>> AFAIK wchar_t doesn't have to be in Unicode at all. On some platforms
>> (notably Solaris) it has locale-dependent encoding.
>> The funny thing is that
>> apart from poorly supported __STDC_ISO_10646__ there just isn't
>> any good way
>> to know what the actual encoding is.
>
> How do you determine the encoding used by string literals, printf,
> or scanf? If you put Ebcdic characters in, you can't tell whether
> you'll
> get an ISO-646 [aka ASCII] or a RAD50 string out. And yet,
> programming languages have survived for half a century with little
> or no machinery
> for detecting the execution character set or changing it at the whim
> of the programmer.

I am talking about a different kind of problem. Suppose I get a UTF-8 string
from the network. The fact that it is UTF-8 is specified in some standard
and is entirely outside the scope of C or C++. Now I need to convert this
string to some fixed width representation (to do regular expression searches
for example). The first urge is to use wchar_t but it is not going to work.
First the conversion could be lossy (on Solaris), second the result may not
be fixed width (on Windows or AIX) and third how am I supposed to perform
the conversion? On Unix there is the iconv facility but I need to know the
encoding used for wchar_t. With narrow chars there is at least
nl_langinfo(CODESET) but there is nothing similar for the wide characters.
So to work with a fixed width representation one needs to abandon wchar_t
and define some proprietary utf32_char_t. This isn't a panacea either
because doing so means that you loose the wcsxxx functions optimized for
each platfrom, character classification and other goodies.
The end result of the above and other problems mentioned in the thread
"troubled by std::wifstream::open(const char*)" is that on Unix wchar_t is
almost useless while on Windows it does the job of char and should be used
instead. As for a portable fixed-width character type there just isn't any.


> The real problem with wide characters is that C89 *suggested* the
> possibility that the behavior of mbtowc/wctomb might change with
> a change in locale. Left unanswered is the effect of such a change
> on the validity of wide string literals such as L"abc", or a host
> of other matters. For more on the delicate business of changing
> wide-character encodings on the fly, see our on-line CoreX manual,
> in particular:
>
> http://www.dinkumware.com/manuals/reader.aspx?b=cx/&h=multibyte.html
>
> The C committee introduced wchar_t and its minimal set of conversion
> functions as a way for programs to *portably* manipulate large
> character sets. But this is a different kind of portability than
> "writing portable programs that properly handle Unicode" -- it
> is scrupulously wide character-set neutral, just as C has been
> byte character-set neutral for decades.

Very true. However, wouldn't you agree that both kinds of portability are
needed? The C standard had already bowed to the convergence of the hardware
architectures by standardizing uint32_t and such. Isn't this a good time to
take notice of convergence of character sets?

> If you really want to manipulate Unicode, in all its variant
> forms, you need something like our CoreX library. And you indeed
> must be wary of using wchar_t to hold your wide characters, since
> it is not necessarily large enough. Our CoreX conversions will
> work with a variety of integer types, so you can indeed write
> very portable code that works with Unicode.

Yes today the only choice is to use a 3rd party library. Explaining this to
the management is almost impossible however. Isn't it the common knowledge
that all you need is wchar_t? :-(

> On a loosely related note, the C committee has recently approved
> TR 19769, which adds the type definitions char16_t and char32_t.
> It also adds to the language string literals of the form u"abc"
> and U"abc", along with library functions for converting between
> these wide forms and multibyte sequences. It is anticipated, but
> not required, that vendors will use these to supply UTF-16 and
> UCS-4 support.

Why is it not required??? IMHO if just one of Sun/Microsoft/whatever will
decide that they don't want these types to be UCS-2 and UCS-4 they will
render them useless for everybody.

--
Eugene

0 new messages