Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

deriving from std::moneypunct facet

12 views
Skip to first unread message

Old Wolf

unread,
Apr 27, 2004, 6:40:17 AM4/27/04
to
I am attempting to derive a facet from std::moneypunct that uses different
characters for the separators and so on. I have posted the code below.
I am expecting it to display "1c23c45c67p89" . But instead, my
linux/gcc 3.3.1 system displays "123456789" regardless of locale, and my
winnt/bcc 5.5.1 system displays "1,234,567.89" if the locale is "C",
and segfaults in the constructor of std::moneypunct<> if the locale
is anything else. The segfault still occurs if I comment out the
virtual methods in my class.

My compiler documentation included an example of doing the same thing
for "numpunct", which worked correctly on both my systems for locale de_DE
(the code for that is posted below my non-working code, for comparison). I
have tried to copy this working example as closely as possible.

Some other questions:
- is there a book that teaches locales and facets well?
(so far i'm just learning from the compiler documentation)
- what does the Intl template parameter on moneypunct signify exactly?
- how can I find out what locale names (eg. "de_DE") are supported
on my system?

#include <iostream>
#include <exception>
#include <string>
#include <locale>

template<typename charT, bool Intl = false>
class change_sep: public std::moneypunct_byname<charT,Intl>
{
public:
explicit change_sep(const char *name, size_t refs = 0)
: std::moneypunct_byname<charT,Intl>(name, refs) {}
protected:
virtual charT do_thousands_sep() const { return 'c'; }
virtual charT do_decimal_point() const { return 'p'; }
virtual std::string do_grouping() const { return "\2"; }
};

template<typename NumType>
std::ostream &price_put(std::ostream &os, NumType num)
{
typedef std::money_put<char> facet_t;
const facet_t &fac = std::use_facet<facet_t>(os.getloc());
fac.put(std::ostreambuf_iterator<char>(os), true, os, os.fill(), num);
return os;
}

int main()
{
try {
std::locale loc(std::locale("de_DE"),
new change_sep<char, false>("de_DE"));
std::cout.imbue(loc);
price_put(std::cout, 123456789);
}
catch(std::exception &e)
{
std::cout << "Error: " << e.what() << std::endl;
}
return 0;
}


Here is the code example from my compiler documentation, that works OK:

#include <iostream>
#include <string>
#include <locale>
using namespace std;

template <class charT>
class change_bool_names: public numpunct_byname<charT>
{
public:
typedef basic_string<charT> string_type;
explicit change_bool_names (const char* name,
const charT* t, const charT* f, size_t refs=0)
: numpunct_byname<charT> (name,refs),
true_string(t), false_string(f) { }
protected:
string_type do_truename () const { return true_string; }
string_type do_falsename () const { return false_string; }
private:
string_type true_string, false_string;
};

int main(int argc, char **)
{
locale loc(locale("de_DE"),
new change_bool_names<char>("de_DE","Ja.","Nein."));
cout.imbue(loc);
cout << "Argumente vorhanden? "
<< boolalpha << (argc > 1) << endl;
}

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Paolo Carlini

unread,
Apr 28, 2004, 5:15:51 AM4/28/04
to
Hi!

Old Wolf wrote:

> - is there a book that teaches locales and facets well?
> (so far i'm just learning from the compiler documentation)

Langer & Kreft, "Standard C++ IOStreams and Locales" is pretty good.

> - what does the Intl template parameter on moneypunct signify exactly?

international currency symbol or domestic currency symbol.

> - how can I find out what locale names (eg. "de_DE") are supported
> on my system?

localedef --list-archive for glibc.

> fac.put(std::ostreambuf_iterator<char>(os), true, os, os.fill(), num);

^^^^

Just change it consistently to /false/ and it works! Rather nice
example, by the way!

Paolo.

Old Wolf

unread,
Apr 29, 2004, 7:38:48 AM4/29/04
to
Paolo Carlini <pcar...@suse.de> wrote:
>
> > - how can I find out what locale names (eg. "de_DE") are supported
> > on my system?
>
> localedef --list-archive for glibc.

[OT] Is there anything vaguely portable for this? or some system
calls for common operating systems?

> > fac.put(std::ostreambuf_iterator<char>(os), true, os, os.fill(), num);
> ^^^^
>
> Just change it consistently to /false/ and it works! Rather nice
> example, by the way!

Thanks - I understand now: there are 2 different facets
moneypunct<charT, false> and moneypunct<charT, true>, since I defined
moneypunct<false>, I have to use 'false' as the parameter if I want
to invoke that facet.

I still get my segfault in Windows though, so I suppose that is a
bug in my implementation (it could at least throw an exception about
invalid locale).

ka...@gabi-soft.fr

unread,
Apr 29, 2004, 5:59:29 PM4/29/04
to
Paolo Carlini <pcar...@suse.de> wrote in message
news:<408E512C...@suse.de>...

> Old Wolf wrote:

> > - is there a book that teaches locales and facets well?
> > (so far i'm just learning from the compiler documentation)

> Langer & Kreft, "Standard C++ IOStreams and Locales" is pretty good.

> > - what does the Intl template parameter on moneypunct signify
exactly?

> international currency symbol or domestic currency symbol.

Where does it say that in the standard? About the closest I could find
is a non normative footnote which says that "for international
instantiations (second template parameter true) this is always four
charactter long, usually three letters and a space", talking about the
return value of do_curr_symbol(). I've probably missed it, but I can
find no normative text whatsoever concerning the meaning of the second
template parameter. It is named International, which is suggestive, but
I'm not sure of what.

In practice, of course, two cases can occur. If I'm working in a closed
environment, with a single currency, then I have no problem (and
presumably, the facet which interests me is the one with Interational ==
false). As soon as the possibility of multiple currencies raises its
head, howver, I need a type which contains not just the amount, but also
the currency. And part of the facet become pretty irrelevant with
regards to the formatting, since the actual format will depend on the
actual currency -- part of the "value" of what is being formatted: the
currency symbol, obviously, but also the number of fractional digits.

> > - how can I find out what locale names (eg. "de_DE") are supported
> > on my system?

> localedef --list-archive for glibc.

It's very system dependent. Posix defines a standard format for naming,
<country_code>_<language_code>.<encoding_name>. But even Posix
compliant systems tend to support a lot of "traditional" names not in
this format, Posix allows for defaults (so that "de" might mean the same
thing as "de_DE.iso_8859_1"), and of course, knowing the format doesn't
tell you whether it is actually available on a given machine. (On my
Posix compliant Solaris machine, the only locales I have available are
"C", "POSIX", "common", "en_US.UTF-8", and ïso_8859_1"; someone removed
the others to make more space on the disk. But it does give you an idea
concerning the usefulness of the standard naming format:-).)

I think that it is usual on Unix machines for the locales to be placed
in a directory called, somewhat strangely, "locale", somewhere under
/usr. I've seen /usr/lib/locale and /usr/share/locale -- the latter is
somewhat surprising, and the locale specific directories contain shared
objects, which are not sharable in the sense used in the directory name,
i.e. between machines running different hardware. Anyway, a little work
with find should do the trick. But I wouldn't like to have to do it
from within a program.

I'm less familiar with Windows. Locale names there tend to correspond
with usual English use: "French", "German", etc. I'm not sure how this
works in practice -- a quick check showed me that in the "French"
locale, the decimal character was '.', which may be true in Quebec, but
is certainly not the usual use in France. So how would you specify the
equivalent of "ch_DE.utf-8" -- Swiss German encoded in UTF-8.

The documentation for Windows says that there are some 100 locales, and
that all of them are always installed. I've been unable to find a
complete list, but I would imagine that it is in the documetation
somewhere.

All in all, we're still not to the point where you can write portable
internationalized code.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Michael Karcher

unread,
Apr 30, 2004, 7:22:51 AM4/30/04
to
ka...@gabi-soft.fr wrote:
>> international currency symbol or domestic currency symbol.
> Where does it say that in the standard? About the closest I could find
> is a non normative footnote which says that "for international
> instantiations (second template parameter true) this is always four
> charactter long, usually three letters and a space",

It's all about "USD " vs. "$". Or as we had in germany before the euro
introduction "DEM " (Intl) vs. "DM" (Domestic). Now it is "EUR " (intl)
vs. the euro symbol, if supported in the character set.

Michael Karcher

Ben Hutchings

unread,
Apr 30, 2004, 11:23:54 PM4/30/04
to
ka...@gabi-soft.fr wrote:
<snip>
> I'm less familiar with Windows. Locale names there tend to correspond
> with usual English use: "French", "German", etc. I'm not sure how this
> works in practice -- a quick check showed me that in the "French"
> locale, the decimal character was '.', which may be true in Quebec, but
> is certainly not the usual use in France. So how would you specify the
> equivalent of "ch_DE.utf-8" -- Swiss German encoded in UTF-8.

Internally Windows normally uses numeric locale IDs assigned by
Microsoft, though they do also have names.

The VC++ 7.1 documentation says you can use something similar to the
POSIX format:
lang ["_" country/region ["." code-page]]
So I suppose you would use "German_Switzerland.65001", though UTF-8
doesn't seem to be as fully supported in Windows as the older code
pages that use a maximum of 2 bytes per character.

I have a sneaking suspicion that the country and language names may
themselves be localised according to the system locale, though.

> The documentation for Windows says that there are some 100 locales, and
> that all of them are always installed.

<snip>

This is incorrect. Each version of Windows should recognise all the
locale IDs that were assigned at the time it was built, but the data
for those locales are only selectively installed. The same goes for
code pages and their IDs. There are some Win32 functions that allow
you to enumerate recognised or installed locales, code pages etc.

P.J. Plauger

unread,
May 3, 2004, 5:23:38 AM5/3/04
to
"Eugene Gershnik" <gers...@nospam.hotmail.com> wrote in message
news:Fu6dnQ-xUfi...@speakeasy.net...

> Ben Hutchings wrote:
> > ka...@gabi-soft.fr wrote:
> > <snip>


> >> So
> >> how would you specify the equivalent of "ch_DE.utf-8" -- Swiss
> >> German encoded in UTF-8.
> >
> > Internally Windows normally uses numeric locale IDs assigned by
> > Microsoft, though they do also have names.
> >
> > The VC++ 7.1 documentation says you can use something similar to the
> > POSIX format:
> > lang ["_" country/region ["." code-page]]
> > So I suppose you would use "German_Switzerland.65001", though UTF-8
> > doesn't seem to be as fully supported in Windows as the older code
> > pages that use a maximum of 2 bytes per character.
>

> This wouldn't work for variety of reasons. One is that Microsoft's
standard
> library cannot handle more than 2-byte multibyte encodings (at least
> according to asserts in its code). Another is that Windows itself doesn't
> allow arbitrary combinations of languages and codepages.
> The original question about ch_DE.utf-8 is meaningless on Windows. It
> doesn't have anything like Unix UTF-8 locales nor does it need them.

This is getting murkier by the minute.

1) A "code page" essentially defines a 256-byte character set.
You can treat that set of single-byte codes as a multibyte
encoding for a (very small) subset of Unicode/ISO-10646.
IIRC, the Swiss German code page to Unicode is one of the
conversions we also provide with our CoreX library.

2) UTF-8 is yet another multibyte encoding. It differs from
a code page in that it can represent *all* Unicode characters.
It takes up to three bytes to represent a character from
the 16-bit subset (aka UCS-2), up to six bytes to represent
all possible values that can be represented in 32 bits (roughly
aka UCS-4) -- somewhere in between for the current "maximum
number of characters that will ever be defined" (aka UTF-16).
We of course provide all these variant conversions in our
CoreX library.

3) Conversions are defined between a multibyte encoding and
a wide-character encoding. The latter often is some flavor
of Unicode, but it doesn't have to be.

So if you want the UTF-8 equivalent of the Swiss German
256-character encoding, you first convert it to 16-bit
Unicode and then convert that to UTF-8.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

ka...@gabi-soft.fr

unread,
May 3, 2004, 11:09:30 AM5/3/04
to
Michael...@writeme.com (Michael Karcher) wrote in message
news:<c6svsa$g1m20$1...@uni-berlin.de>...

> ka...@gabi-soft.fr wrote:
> >> international currency symbol or domestic currency symbol.
> > Where does it say that in the standard? About the closest I could
> > find is a non normative footnote which says that "for international
> > instantiations (second template parameter true) this is always four
> > charactter long, usually three letters and a space",

> It's all about "USD " vs. "$". Or as we had in germany before the euro
> introduction "DEM " (Intl) vs. "DM" (Domestic). Now it is "EUR "
> (intl) vs. the euro symbol, if supported in the character set.

But where does it say this? The non-normative foot-note sort of hints
at it, but only vaguely, and I can find nothing else.

(The other issue I'm wondering about is having the currency symbol
determined by the locale in an international environment, since I would
imagine that you are likely to be dealing with several different
currencies. For that matter, even without being international -- in
France today, it is frequent to display monetary values both in Euros
and in Francs.)

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,
May 3, 2004, 11:17:42 AM5/3/04
to
"Eugene Gershnik" <gers...@nospam.hotmail.com> wrote in message
news:<Fu6dnQ-xUfi...@speakeasy.net>...
> Ben Hutchings wrote:
> > ka...@gabi-soft.fr wrote:
> > <snip>
> >> So
> >> how would you specify the equivalent of "ch_DE.utf-8" -- Swiss
> >> German encoded in UTF-8.

> > Internally Windows normally uses numeric locale IDs assigned by
> > Microsoft, though they do also have names.

> > The VC++ 7.1 documentation says you can use something similar to the
> > POSIX format:
> > lang ["_" country/region ["." code-page]]
> > So I suppose you would use "German_Switzerland.65001", though UTF-8
> > doesn't seem to be as fully supported in Windows as the older code
> > pages that use a maximum of 2 bytes per character.

The Unix format uses the ISO 639 codes, in lower case, for the language,
and the ISO 3166 two letter codes, in upper case, for the country. (And
German speaking Switzerland should have been "de_CH", and not "ch_DE".)
As far as I know, the encoding names are ad hoc.

> This wouldn't work for variety of reasons. One is that Microsoft's
> standard library cannot handle more than 2-byte multibyte encodings
> (at least according to asserts in its code).

It was just meant as an example -- it isn't reasonable to expect any
machine to support every possible locale. (On the other hand, any
machine connected to the Internet really should be able to support
UTF-8, since that is pretty much the standard international encoding
used.)

And of course, if the encoding in the locale doesn't correspond to that
of the font being used for display, what you see won't be what the
program thinks it is displaying.

> Another is that Windows itself doesn't allow arbitrary combinations of
> languages and codepages.

What I specified was the *format* of the names. Obviously, no machines
can support all combinations, nor should they. Who would use
"eu_KE.shift_jis" (Basque, used in Kenya and writing with Shift JIS
encoding) even if it existed? With the exception of various Unicode
representation formats, I think that most encodings are only valid for
certain languages, and not every language will be spoken in every
country in the world.

It would be nice to somehow be able to separate the three aspects, with
e.g. monetary formatting dependant only on the country, messages only on
the language, and encoding only on the encoding (or the fonts being
used). But I don't have any simple solutions to propose. (If I'm
formatting French Francs for an English language publication, I will
probably use . as the decimal, but standard French formatting for the
rest, for example, and a function like toupper mixes both language,
country and encoding intimately.)

> The original question about ch_DE.utf-8 is meaningless on Windows. It
> doesn't have anything like Unix UTF-8 locales nor does it need them.

Anything which connects to the Internet needs some sort of support for
UTF-8, since it is pretty much the standard international codeset on the
Internet.

> > I have a sneaking suspicion that the country and language names may
> > themselves be localised according to the system locale, though.

> They are not. The link below explains why
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_8rse.asp
> The standard library uses LOCALE_SABBREVCTRYNAME, LOCALE_SENGCOUNTRY,
> LOCALE_SABBREVLANGNAME and LOCALE_SENGLANGUAGE to build C++ locale names.

But the initial posters question rests unanswered: how to find a list of
all supported locales. Except of course, for the somewhat vague answer:
it depends on the implementation. But I suspect that that is the best
we can do, since it does depend very strongly on the implementations.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Eugene Gershnik

unread,
May 4, 2004, 1:36:11 PM5/4/04
to
ka...@gabi-soft.fr wrote:
> "Eugene Gershnik" <gers...@nospam.hotmail.com> wrote

>> The original question about ch_DE.utf-8 is meaningless on Windows.
>> It
>> doesn't have anything like Unix UTF-8 locales nor does it need them.
>
> Anything which connects to the Internet needs some sort of support for
> UTF-8, since it is pretty much the standard international codeset on
> the Internet.

True but this has nothing to with UTF-8 locales. A system can support
conversions from an internal character set to UTF-8 (either directly or
through an intermediate format as Windows does) and be able to use the
Internet without knowing what a UTF-8 locale is.
Windows never uses UTF-8 as the encoding for narrow strings in any locale.
Instead it guarrantees that wchar_t encoding is locale-independent and is
always UTF-16. Conversions between UTF-16 and UTF-8 are pretty
straightforward. Thus, I'd say that a correct way to deal with UTF-8 in C++
on Windows is to work with wide streams and perform conversions in the
streambuf. This way the normal locale machinery deals with converting
between internal narrow encoding and "Unicode" while streambuf is
responsible for the "Unicode" representation on the wire i.e. UTF-8. Note
that it is very different from Unix where you must use some manual iconv()
wizardry if your user doesn't work in UTF-8 locale to begin with.

--
Eugene

ka...@gabi-soft.fr

unread,
May 5, 2004, 3:42:56 PM5/5/04
to
"Eugene Gershnik" <gers...@nospam.hotmail.com> wrote in message
news:<Q-OdndRI7N1...@speakeasy.net>...

> ka...@gabi-soft.fr wrote:
> > "Eugene Gershnik" <gers...@nospam.hotmail.com> wrote
> >> The original question about ch_DE.utf-8 is meaningless on Windows.
> >> It doesn't have anything like Unix UTF-8 locales nor does it need
> >> them.

> > Anything which connects to the Internet needs some sort of support
> > for UTF-8, since it is pretty much the standard international
> > codeset on the Internet.

> True but this has nothing to with UTF-8 locales.

According to the standard, conversion on input and output depend on the
locale embedded in the filebuf. If you want to encode to or from UTF-8,
you need a locale which supports it.

> A system can support conversions from an internal character set to
> UTF-8 (either directly or through an intermediate format as Windows
> does) and be able to use the Internet without knowing what a UTF-8
> locale is.

A system can support just about anything, in addition to what the
standard requires. The standard provides a more or less standard way of
specifying the transcoding between internal and external format: the
codecvt facet of the locales. IMHO, this part of the standard libaray
wasn't particularly well designed, but it is what the standard says, and
I would be very unhappy about an implementation that provided support
for the functionality, but didn't offer it as well through the standard
mechanism, given that they exist.

> Windows never uses UTF-8 as the encoding for narrow strings in any
> locale.

There are two separate issues here: what comes with a given compiler
(VC++, Borland, etc.), and what can be added. From a posting by
Plauger, I gather that it IS possible to at least add such support to
VC++. What I do know is that imbuing an [iofstream] with an UTF-8
locale under Windows works. What I don't know is how to name the
locale, nor whether the locale comes packaged with the compiler, or must
be acquired separately.

In general, of course, the fact that you need a specific locale doesn't
mean that your system provides it. And while I don't know about
Windows, under Unix, what is available will depend on the particular
installation of the system -- you simply cannot know beforehand. I
think that it is also possible to acquire locales not normally provided
from third party sources; I would be very surprised if Dinkumware didn't
have some to cover cases the system provider didn't think of, or didn't
think necessary.

> Instead it guarrantees that wchar_t encoding is locale-independent and
> is always UTF-16. Conversions between UTF-16 and UTF-8 are pretty
> straightforward. Thus, I'd say that a correct way to deal with UTF-8
> in C++ on Windows is to work with wide streams and perform conversions
> in the streambuf.

This is exactly what we are talking about. And the conversion in
streambuf (actually in filebuf) depends on the locale imbued in the
streambuf.

> This way the normal locale machinery deals with converting between
> internal narrow encoding and "Unicode" while streambuf is responsible
> for the "Unicode" representation on the wire i.e. UTF-8. Note that it
> is very different from Unix where you must use some manual iconv()
> wizardry if your user doesn't work in UTF-8 locale to begin with.

What's available on any given Unix machine will depend on what the
sysadmin decided to install -- by default, Solaris gives you just about
everything, but the Solaris systems I work on have small disks, and the
sysadmin stripped a lot of it out. Every Unix system I've seen recently
has had at least one UTF-8 locale installed -- by default, both Solaris
and Linux have UTF-8 versions of all of the national or language based
locales. With a conforming implementation (Sun CC, for example), you
imbue the filebuf (or the [io]fstream, which then imbues the filebuf)
with the desired locale, exactly like under Windows -- if the locale is
present (and it usually is under Unix), then it works.

The one thing that is different under Unix is that you often have to
work with older compilers -- even the latest version of Sun CC isn't as
conformant as VC++ 6.0, and the 3.x branch of g++ is only recently
become stable, and the 2.95.x branch didn't support standard iostream's
at all. With older compilers, you often have to deal with the C level
locales, and set the locale globallY, via setlocale, rather than imbuing
the stream. But it still worked. I've input and output UTF-8 under
both Solaris and Linux, and I've never heard of iconv.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Eugene Gershnik

unread,
May 7, 2004, 8:08:44 AM5/7/04
to
ka...@gabi-soft.fr wrote:
> "Eugene Gershnik" <gers...@nospam.hotmail.com> wrote in message
> news:<Q-OdndRI7N1...@speakeasy.net>...
>> ka...@gabi-soft.fr wrote:
>>> "Eugene Gershnik" <gers...@nospam.hotmail.com> wrote
>>>> The original question about ch_DE.utf-8 is meaningless on Windows.
>>>> It doesn't have anything like Unix UTF-8 locales nor does it need
>>>> them.
>
>>> Anything which connects to the Internet needs some sort of support
>>> for UTF-8, since it is pretty much the standard international
>>> codeset on the Internet.
>
>> True but this has nothing to with UTF-8 locales.
>
> According to the standard, conversion on input and output depend on
> the locale embedded in the filebuf. If you want to encode to or from
> UTF-8, you need a locale which supports it.

Alternatively I can use a custom streambuf. One will probably be required
anyway for a real-life networking.

>> Windows never uses UTF-8 as the encoding for narrow strings in any
>> locale.
>
> There are two separate issues here: what comes with a given compiler
> (VC++, Borland, etc.), and what can be added.

There is a third issue. If un underlying platform supports its own concept
of locales the C++ ones should better play it nice and be interoperable with
them. The C++ library cannot encompass all possible needs and resorting to
system specific calls may sometimes be necessary. If there is no good
mapping between a system locale and a C++ one this may make it hard if not
impossible. There is no such thing as UTF-8 system locale on Windows and a
C++ library should IMHO reflect this fact.

>> Instead it guarrantees that wchar_t encoding is locale-independent
>> and is always UTF-16. Conversions between UTF-16 and UTF-8 are
>> pretty straightforward. Thus, I'd say that a correct way to deal
>> with UTF-8 in C++ on Windows is to work with wide streams and
>> perform conversions in the streambuf.
>
> This is exactly what we are talking about. And the conversion in
> streambuf (actually in filebuf) depends on the locale imbued in the
> streambuf.

What I had in mind was to use a custom streambuf.

>> This way the normal locale machinery deals with converting between
>> internal narrow encoding and "Unicode" while streambuf is responsible
>> for the "Unicode" representation on the wire i.e. UTF-8. Note that
>> it is very different from Unix where you must use some manual iconv()
>> wizardry if your user doesn't work in UTF-8 locale to begin with.

> With a conforming implementation


> (Sun CC, for example), you imbue the filebuf (or the [io]fstream,
> which then imbues the filebuf) with the desired locale, exactly like
> under Windows -- if the locale is present (and it usually is under
> Unix), then it works.

[...]


> I've input and
> output UTF-8 under both Solaris and Linux, and I've never heard of
> iconv.

Here is the scenario I meet quite often. Suppose you have a text file in the
user's default encoding which is _not_ UTF-8 (say EUC or Shift-JIS). You
need to read and save it in another file ('network') encoded in UTF-8. I may
be dead wrong but I don't think you can generally avoid iconv() in this
case.

--
Eugene

P.J. Plauger

unread,
May 9, 2004, 8:15:23 AM5/9/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:BJSdnQonebx...@speakeasy.net...

> Here is the scenario I meet quite often. Suppose you have a text file
> in
the
> user's default encoding which is _not_ UTF-8 (say EUC or Shift-JIS).
> You need to read and save it in another file ('network') encoded in
> UTF-8. I
may
> be dead wrong but I don't think you can generally avoid iconv() in
> this case.

Unless you have a collection of handy codecvt facets that does all these
conversions, with supporting classes that make them easy to use. That's the
approach we took with our CoreX library.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

ka...@gabi-soft.fr

unread,
May 10, 2004, 6:03:21 AM5/10/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:<BJSdnQonebx...@speakeasy.net>...

> ka...@gabi-soft.fr wrote:
> > "Eugene Gershnik" <gers...@nospam.hotmail.com> wrote in message
> > news:<Q-OdndRI7N1...@speakeasy.net>...
> >> ka...@gabi-soft.fr wrote:
> >>> "Eugene Gershnik" <gers...@nospam.hotmail.com> wrote
> >>>> The original question about ch_DE.utf-8 is meaningless on
> >>>> Windows. It doesn't have anything like Unix UTF-8 locales nor
> >>>> does it need them.

> >>> Anything which connects to the Internet needs some sort of
> >>> support for UTF-8, since it is pretty much the standard
> >>> international codeset on the Internet.

> >> True but this has nothing to with UTF-8 locales.

> > According to the standard, conversion on input and output depend on
> > the locale embedded in the filebuf. If you want to encode to or
> > from UTF-8, you need a locale which supports it.

> Alternatively I can use a custom streambuf. One will probably be
> required anyway for a real-life networking.

The logical solution to this would be to use a separate filtering
streambuf for the code translation. It's a bit of a shame that the
standard merged this (logically separate) concept into filebuf, instead
of making it generally available. I believe that some third party
libraries do provide this as an extention. Even without it, however, it
would be foolish not to leverage off the existing library code (e.g. the
codecvt facet).

> >> Windows never uses UTF-8 as the encoding for narrow strings in any
> >> locale.

> > There are two separate issues here: what comes with a given
> > compiler (VC++, Borland, etc.), and what can be added.

> There is a third issue. If un underlying platform supports its own
> concept of locales the C++ ones should better play it nice and be
> interoperable with them. The C++ library cannot encompass all
> possible needs and resorting to system specific calls may sometimes be
> necessary. If there is no good mapping between a system locale and a
> C++ one this may make it hard if not impossible. There is no such
> thing as UTF-8 system locale on Windows and a C++ library should IMHO
> reflect this fact.

I'm not quite sure what you are saying here. That we should ignore the
standard anytime the local platform has a different way of doing
something? That the C++ library should not attempt to furnish behaviors
that the local platform doesn't furnish directly and in a compatible
form.

The C++ way of handing different file encodings is by means of the
codecvt facet. The Microsoft compiler, at least since 5.0, has had a
very good implementation of this -- for whatever reasons, Microsoft was
very much in advance of most C++ implementations in this regard. At the
character encoding, rather than specifying a separate character code for
each locale. But this has nothing to do with how the locales work in
C++ ; you can easily create a locale based on a language specific
locale, and then embed your UTF-8 specific facets in it. This would
seem to be the most logical and the simplest way to do things.

> >> Instead it guarrantees that wchar_t encoding is locale-independent
> >> and is always UTF-16. Conversions between UTF-16 and UTF-8 are
> >> pretty straightforward. Thus, I'd say that a correct way to deal
> >> with UTF-8 in C++ on Windows is to work with wide streams and
> >> perform conversions in the streambuf.

> > This is exactly what we are talking about. And the conversion in
> > streambuf (actually in filebuf) depends on the locale imbued in the
> > streambuf.

> What I had in mind was to use a custom streambuf.

Fine. But it would be very strange, or at least, very un-C++ish, if it
didn't use the codecvt facet for the code translation. No sense is
reinventing the wheel.

> >> This way the normal locale machinery deals with converting between
> >> internal narrow encoding and "Unicode" while streambuf is
> >> responsible for the "Unicode" representation on the wire
> >> i.e. UTF-8. Note that it is very different from Unix where you
> >> must use some manual iconv() wizardry if your user doesn't work in
> >> UTF-8 locale to begin with.

> > With a conforming implementation (Sun CC, for example), you imbue
> > the filebuf (or the [io]fstream, which then imbues the filebuf)
> > with the desired locale, exactly like under Windows -- if the
> > locale is present (and it usually is under Unix), then it works.

> [...]
> > I've input and
> > output UTF-8 under both Solaris and Linux, and I've never heard of
> > iconv.

> Here is the scenario I meet quite often. Suppose you have a text file
> in the user's default encoding which is _not_ UTF-8 (say EUC or
> Shift-JIS). You need to read and save it in another file ('network')
> encoded in UTF-8. I may be dead wrong but I don't think you can
> generally avoid iconv() in this case.

You're dead wrong. The standard idiom would be to open the source file
with a locale using the user's default encoding, and to open the
destination file with a locale supporting UTF-8, and then copy.
Something like:

std::ifstream source( sourceFilename.c_str() ) ;
std::ofstream dest( destFilename.c_str() ) ;
dest.imbue( std::locale( std::locale(),
"en_US.utf-8",
std::locale::ctype ) ) ;
dest << source.rdbuf() ;

(Modulo erreur handling, of course. You would normally verify that the
open's worked, for example.)

In theory, anyway -- in practice, Unix compilers tend to be far behind
Microsoft in terms of standard conformance.

IMHO, this is also the preferred solution under Windows. It should
work, provided you change the locale name to whatever the Windows
conventions require.

--
James Kanze GABI Software

Conseils en informatique orientée objet/

Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Eugene Gershnik

unread,
May 11, 2004, 5:01:21 PM5/11/04
to
ka...@gabi-soft.fr wrote:
> "Eugene Gershnik" <gers...@hotmail.com> wrote

>> There is a third issue. If un underlying platform supports its own
>> concept of locales the C++ ones should better play it nice and be
>> interoperable with them. The C++ library cannot encompass all
>> possible needs and resorting to system specific calls may sometimes be
>> necessary. If there is no good mapping between a system locale and a
>> C++ one this may make it hard if not impossible. There is no such
>> thing as UTF-8 system locale on Windows and a C++ library should IMHO
>> reflect this fact.

> I'm not quite sure what you are saying here. That we should ignore the
> standard anytime the local platform has a different way of doing
> something?

No.
(But then the area where any platform I ever worked on intersected with C or
C++ standard library is so small that I never formed a firm opnion about
that. FWIW working with Java taught me that it is very hard to override the
platform)

> That the C++ library should not attempt to furnish behaviors
> that the local platform doesn't furnish directly and in a compatible
> form.

No. What I am saying is that the additions should not cause problems. For
example a non-standard UTF-8 codecvt facet is a good thing that doesn't
impose any problems on anyone. A special set of UTF-8 locales that don't map
to platform locales (when there are platform locales) is a bad thing. It
creates a problem ('how to map?') where there weren't any.

>> Here is the scenario I meet quite often. Suppose you have a text
>> file in the user's default encoding which is _not_ UTF-8 (say EUC
>> or Shift-JIS). You need to read and save it in another file
>> ('network') encoded in UTF-8. I may be dead wrong but I don't
>> think you can generally avoid iconv() in this case.
>
> You're dead wrong. The standard idiom would be to open the source
> file with a locale using the user's default encoding, and to open the
> destination file with a locale supporting UTF-8, and then copy.
> Something like:
>
> std::ifstream source( sourceFilename.c_str() ) ;
> std::ofstream dest( destFilename.c_str() ) ;
> dest.imbue( std::locale( std::locale(),
> "en_US.utf-8",
> std::locale::ctype ) ) ;
> dest << source.rdbuf() ;

I am missing something here I think. What is the encoding of data read from
'source' and what is the encoding that 'dest' expects to see? My practical
observations tell me that whatever you read from 'source' will keep its
encoding and 'dest' would expect its input to be in UTF-8 to begin with.
However, in order for the above to work 'source' and 'dest' must agree on
the same internal character set.
The above _may_ work if wide streams were used and then only if in the
user's locale the encoding of wchar_t was the same as in .utf-8 one. The
only way to fix that is to use a special codecvt like those mentioned by
P.J. Plauger in his post.

--
Eugene

ka...@gabi-soft.fr

unread,
May 12, 2004, 12:39:32 PM5/12/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:<c8-dnaGd87w...@speakeasy.net>...

[...]


> > That the C++ library should not attempt to furnish behaviors that
> > the local platform doesn't furnish directly and in a compatible
> > form.

> No. What I am saying is that the additions should not cause
> problems. For example a non-standard UTF-8 codecvt facet is a good
> thing that doesn't impose any problems on anyone. A special set of
> UTF-8 locales that don't map to platform locales (when there are
> platform locales) is a bad thing. It creates a problem ('how to map?')
> where there weren't any.

I don't see the contradiction. There is no problem creating special
locales by replacing the existing codecvt in an existing locale. This
is the standard way of handling foreign codesets in C++. (Standard in
the sense of: supported by the C++ standard.)

> >> Here is the scenario I meet quite often. Suppose you have a text
> >> file in the user's default encoding which is _not_ UTF-8 (say EUC
> >> or Shift-JIS). You need to read and save it in another file
> >> ('network') encoded in UTF-8. I may be dead wrong but I don't think
> >> you can generally avoid iconv() in this case.

> > You're dead wrong. The standard idiom would be to open the source
> > file with a locale using the user's default encoding, and to open
> > the destination file with a locale supporting UTF-8, and then copy.
> > Something like:

> > std::ifstream source( sourceFilename.c_str() ) ;
> > std::ofstream dest( destFilename.c_str() ) ;
> > dest.imbue( std::locale( std::locale(),
> > "en_US.utf-8",
> > std::locale::ctype ) ) ;
> > dest << source.rdbuf() ;

> I am missing something here I think. What is the encoding of data read
> from 'source' and what is the encoding that 'dest' expects to see?

Externally, source expects to see the encoding specified by the default
locale on the system. Dest expects to see UTF-8.

Internally, both expect to see the default encoding. In fact, to be
really useful, such a program should probably use wifstream and
wofstrream, since these normally guarantee a non-lossy code translation.
But as long as one of the streams is using the standard encoding
externally, it shouldn't make a difference.

> My practical observations tell me that whatever you read from 'source'
> will keep its encoding and 'dest' would expect its input to be in
> UTF-8 to begin with.

Dest is an ostream -- it doesn't handle input, only output.

Internally, both streams expect to use the default encoding.

> However, in order for the above to work 'source' and 'dest' must agree
> on the same internal character set.

They normally do: whatever is default for the platform.

A better implementation would use wiostream, since there is normally
only one encoding for wchar_t.

> The above _may_ work if wide streams were used and then only if in the
> user's locale the encoding of wchar_t was the same as in .utf-8 one.
> The only way to fix that is to use a special codecvt like those
> mentioned by P.J. Plauger in his post.

The intent of the standard streams, and the use of imbued locales, is
that the locale specify the external code set, and that the program use
a constant internal code set. This is almost guaranteed in the case of
wide character streams. In the case of narrow streams, it will normally
work as long as one of the streams uses the same encoding as is used
internally. This is less certain, however, since many applications will
probably read the streams in "C" locale, in order to have the actual
external encoding internally (and to avoid any risk of information
loss). The standard seems a little ambiguous about this, however, since
on one hand, it will imbue all new streams to use the current global
locale (and thus, translate the code to some common codeset), but will
also, by default, use the global locale for such things as determining
what letters are upper case. The safest solution anytime you have to
deal with several different codesets is doubtlessly to do everything in
wchar_t.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Eugene Gershnik

unread,
May 13, 2004, 7:35:54 PM5/13/04
to
ka...@gabi-soft.fr wrote:
> "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> news:<c8-dnaGd87w...@speakeasy.net>...
>
> [...]
>>> That the C++ library should not attempt to furnish behaviors that
>>> the local platform doesn't furnish directly and in a compatible
>>> form.
>
>> No. What I am saying is that the additions should not cause
>> problems. For example a non-standard UTF-8 codecvt facet is a good
>> thing that doesn't impose any problems on anyone. A special set of
>> UTF-8 locales that don't map to platform locales (when there are
>> platform locales) is a bad thing. It creates a problem ('how to
>> map?') where there weren't any.
>
> I don't see the contradiction. There is no problem creating special
> locales by replacing the existing codecvt in an existing locale.
> This is the standard way of handling foreign codesets in C++.
> (Standard in the sense of: supported by the C++ standard.)

But such a locale wouldn't have a name, would it?

>>>> Here is the scenario I meet quite often. Suppose you have a text
>>>> file in the user's default encoding which is _not_ UTF-8 (say EUC
>>>> or Shift-JIS). You need to read and save it in another file
>>>> ('network') encoded in UTF-8. I may be dead wrong but I don't
>>>> think you can generally avoid iconv() in this case.
>
>>> You're dead wrong. The standard idiom would be to open the source
>>> file with a locale using the user's default encoding, and to open
>>> the destination file with a locale supporting UTF-8, and then
>>> copy. Something like:
>
>>> std::ifstream source( sourceFilename.c_str() ) ;
>>> std::ofstream dest( destFilename.c_str() ) ;
>>> dest.imbue( std::locale( std::locale(),
>>> "en_US.utf-8",
>>> std::locale::ctype ) ) ;
>>> dest << source.rdbuf() ;
>
>> I am missing something here I think. What is the encoding of data
>> read from 'source' and what is the encoding that 'dest' expects to
>> see?
>
> Externally, source expects to see the encoding specified by the
> default locale on the system. Dest expects to see UTF-8.
>
> Internally, both expect to see the default encoding.

Deafult in the sence "defined by default locale" or something else?

> In fact, to be
> really useful, such a program should probably use wifstream and
> wofstrream, since these normally guarantee a non-lossy code
> translation. But as long as one of the streams is using the
> standard encoding externally, it shouldn't make a difference.
>
>> My practical observations tell me that whatever you read from
>> 'source' will keep its encoding and 'dest' would expect its input
>> to be in UTF-8 to begin with.
>
> Dest is an ostream -- it doesn't handle input, only output.

Well maybe my terminology is wrong but what do you call "xyz" in the next
line

cout << "xyz";

if not an input to cout?

> Internally, both streams expect to use the default encoding.
>
>> However, in order for the above to work 'source' and 'dest' must
>> agree on the same internal character set.
>
> They normally do: whatever is default for the platform.

Same question: what is the default?

> A better implementation would use wiostream, since there is normally
> only one encoding for wchar_t.

Not on Solaris AFAIK. The wchar_t encoding is locale dependent there. Which
incidentally means that a wide stream using .utf-8 locale will most
certainly assume different wchar_t encoding than a wide stream using EUC
one.

>> The above _may_ work if wide streams were used and then only if in
>> the user's locale the encoding of wchar_t was the same as in
>> .utf-8 one. The only way to fix that is to use a special codecvt
>> like those mentioned by P.J. Plauger in his post.
>
> The intent of the standard streams, and the use of imbued locales,
> is that the locale specify the external code set, and that the
> program use a constant internal code set. This is almost
> guaranteed in the case of wide character streams.

It is most definitely not guaranteed. There is even a special macro in C99
__STDC_ISO_10646__ that is supposed to tell when it is. Not yet implemented
on most compilers unfortunately.

> In the case of
> narrow streams, it will normally work as long as one of the streams
> uses the same encoding as is used internally. This is less
> certain, however, since many applications will probably read the
> streams in "C" locale, in order to have the actual external
> encoding internally (and to avoid any risk of information loss).
> The standard seems a little ambiguous about this, however, since on
> one hand, it will imbue all new streams to use the current global
> locale (and thus, translate the code to some common codeset), but
> will also, by default, use the global locale for such things as
> determining what letters are upper case.

Now I am totally consfused. All implementations I ever worked on didn't
convert anything for narrow streams. Does it mean that they were essentially
using the "C" locale? Is such behavior a deficiency in all libraries I have
worked with?
Assuming it is what should be this special internal codeset?

> The safest solution
> anytime you have to deal with several different codesets is
> doubtlessly to do everything in wchar_t.

Apparently I do not understand something very fundamental about C++
iostreams so I don't know if this could be true there. What I know for sure
is that when using plain C/Posix on Solaris wchar_t doesn't solve anything
(unless you always force your users to work in .utf-8 locales).

--
Eugene

ka...@gabi-soft.fr

unread,
May 14, 2004, 2:39:27 PM5/14/04
to
"Eugene Gershnik" <gers...@nospam.hotmail.com> wrote in message
news:<H--cnQBTSus...@speakeasy.net>...

> ka...@gabi-soft.fr wrote:
> > "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> > news:<c8-dnaGd87w...@speakeasy.net>...

> > [...]
> >>> That the C++ library should not attempt to furnish behaviors that
> >>> the local platform doesn't furnish directly and in a compatible
> >>> form.

> >> No. What I am saying is that the additions should not cause
> >> problems. For example a non-standard UTF-8 codecvt facet is a good
> >> thing that doesn't impose any problems on anyone. A special set of
> >> UTF-8 locales that don't map to platform locales (when there are
> >> platform locales) is a bad thing. It creates a problem ('how to
> >> map?') where there weren't any.

> > I don't see the contradiction. There is no problem creating special
> > locales by replacing the existing codecvt in an existing locale.
> > This is the standard way of handling foreign codesets in C++.
> > (Standard in the sense of: supported by the C++ standard.)

> But such a locale wouldn't have a name, would it?

I don't think so. All system provided locales have a name; no user
defined locales have a name. I'm not sure what happens if, when mixing
facets from several different system provided locales, you end up with a
locale which is identical to another system specific locale.

> >>>> Here is the scenario I meet quite often. Suppose you have a text
> >>>> file in the user's default encoding which is _not_ UTF-8 (say EUC
> >>>> or Shift-JIS). You need to read and save it in another file
> >>>> ('network') encoded in UTF-8. I may be dead wrong but I don't
> >>>> think you can generally avoid iconv() in this case.

> >>> You're dead wrong. The standard idiom would be to open the source
> >>> file with a locale using the user's default encoding, and to open
> >>> the destination file with a locale supporting UTF-8, and then
> >>> copy. Something like:

> >>> std::ifstream source( sourceFilename.c_str() ) ;
> >>> std::ofstream dest( destFilename.c_str() ) ;
> >>> dest.imbue( std::locale( std::locale(),
> >>> "en_US.utf-8",
> >>> std::locale::ctype ) ) ;
> >>> dest << source.rdbuf() ;

> >> I am missing something here I think. What is the encoding of data
> >> read from 'source' and what is the encoding that 'dest' expects to
> >> see?

> > Externally, source expects to see the encoding specified by the
> > default locale on the system. Dest expects to see UTF-8.

> > Internally, both expect to see the default encoding.

> Deafult in the sence "defined by default locale" or something else?

Implementation defined. According to §22.2.1.5/3,

The implementations required in the Table 51 (22.1.1.1.1), namely
codecvt<wchar_t,char,mbstate_t> and codecvt<char,char,mbstate_t>,
convert the implementation-defined native character set.
codecvt<char,char,mbstate_t> implements a degenerate conversion; it
does not convert at all. codecvt<wchar_t,char,mbstate_t> converts
between the native character sets for tiny and wide
characters. [...]

I'm not quite sure what this is supposed to mean if you have more than
one locale (and thus, more than one instantiation of codecvt for each
pair of types); maybe the intent IS that all conversions char<->char are
"degenerate", but all that is mentionned is conversions involving the
"implementation defined native character sets".

And of course, this only concerns the standard instantiations; an
implementation is free to provide others.

In fact, a locale defines two code sets (at least), a wide character
code set, and a narrow one. Presumably, the intent reflected in the
Unix naming convention is that the wide character code set always be the
same -- at least, all of the names I can see for the encoding refer to 8
bit encodings, and I don't know how you would specify a locale in which
both the narrow and the wide character encodings differed from the
default.

(To tell the truth, I don't quite understand the intent behind a lot of
this. In order to be able to redefine my own code translation, at least
for multi-byte characters, I need to be able to define my own state_t.
But then, I can't use any of the standard streams. The whole thing
seems like an essay in futility; more an effort to see just how far you
can push templates, rather than an attempt to produce something useful.)

> > In fact, to be really useful, such a program should probably use
> > wifstream and wofstrream, since these normally guarantee a non-lossy
> > code translation. But as long as one of the streams is using the
> > standard encoding externally, it shouldn't make a difference.

> >> My practical observations tell me that whatever you read from
> >> 'source' will keep its encoding and 'dest' would expect its input
> >> to be in UTF-8 to begin with.

> > Dest is an ostream -- it doesn't handle input, only output.

> Well maybe my terminology is wrong but what do you call "xyz" in the
> next line

> cout << "xyz";

> if not an input to cout?

I guess you could. I usually speak of it as output, because that is
what I'm doing with it, but of course, anytime you have a source/sink
situation, one man's output is the next man's input.

In this case, dest expects its input to be in the native encoding, and
its output to be a translation of that into the external encoding
determined by the imbued facet. What the internal encoding (the native
character set) is is implementation defined.

> > Internally, both streams expect to use the default encoding.

> >> However, in order for the above to work 'source' and 'dest' must
> >> agree on the same internal character set.

> > They normally do: whatever is default for the platform.

> Same question: what is the default?

Same answer: implementation defined.

> > A better implementation would use wiostream, since there is normally
> > only one encoding for wchar_t.

> Not on Solaris AFAIK. The wchar_t encoding is locale dependent there.

> Which incidentally means that a wide stream using . utf-8 locale will


> most certainly assume different wchar_t encoding than a wide stream
> using EUC one.

As far as I know, EUC is dead. But I could be wrong.

With regards to Solaris or Sun CC, I could find no specification
whatsoever as to what the native character set is for wchar_t. The
question is probably academic anyway: any attempt to use std::wcout with
Sun CC results in a core dump, and with g++ a compile-time error
('wcout' undeclared in namespace 'std'). I sure do envy the tools you
have under Windows.

> >> The above _may_ work if wide streams were used and then only if in
> >> the user's locale the encoding of wchar_t was the same as in .utf-8
> >> one. The only way to fix that is to use a special codecvt like
> >> those mentioned by P.J. Plauger in his post.

> > The intent of the standard streams, and the use of imbued locales,
> > is that the locale specify the external code set, and that the
> > program use a constant internal code set. This is almost guaranteed
> > in the case of wide character streams.

> It is most definitely not guaranteed. There is even a special macro
> in C99 __STDC_ISO_10646__ that is supposed to tell when it is. Not
> yet implemented on most compilers unfortunately.

I didn't say you were guaranteed ISO 10646. I said that the character
encoding didn't change with the locale. As I understand it, the
*intent* of wchar_t is that 1) it is not multibyte, and 2) it doesn't
change with the locale. Of course, neither is guaranteed, and wchar_t
doesn't need to be any bigger than a char.

In the case of narrow streams, it isn't at all clear what the intent is
with respect to character encoding.

> > In the case of narrow streams, it will normally work as long as one
> > of the streams uses the same encoding as is used internally. This
> > is less certain, however, since many applications will probably read
> > the streams in "C" locale, in order to have the actual external
> > encoding internally (and to avoid any risk of information loss).
> > The standard seems a little ambiguous about this, however, since on
> > one hand, it will imbue all new streams to use the current global
> > locale (and thus, translate the code to some common codeset), but
> > will also, by default, use the global locale for such things as
> > determining what letters are upper case.

> Now I am totally consfused. All implementations I ever worked on
> didn't convert anything for narrow streams.

Did you imbue them with a locale which should have done code
translation? (Whether such a locale is present is obviously
implementation defined. The "C" locale IS forbidden from any code
translation. And perhaps it is the intent that all codecvt<char,char>
are supposed to use a degenerate code translation.)

> Does it mean that they were essentially using the "C" locale? Is such
> behavior a deficiency in all libraries I have worked with? Assuming
> it is what should be this special internal codeset?

First, unless the default global locale is the "C" locale, and the
default locale for a stream is the global locale at the moment you
created the stream. (This means, for example, that unless you
specifically imbue, cout, cin and their wide stream counterparts will
use the "C" locale.) Also, until the standard, the possibility of
imbuing a stream didn't exist (and there weren't wide character
streams); many compilers are not up to date. (Locales don't seem to
work very well in g++ 3.3.1 or Sun CC 5.1, for example. And you can't
use wide character streams in either.)

> > The safest solution anytime you have to deal with several different
> > codesets is doubtlessly to do everything in wchar_t.

> Apparently I do not understand something very fundamental about C++
> iostreams so I don't know if this could be true there. What I know
> for sure is that when using plain C/Posix on Solaris wchar_t doesn't
> solve anything (unless you always force your users to work in .utf-8
> locales).

You don't force the users to do anything with regards to locales. If
you know the encoding of a file, You imbue the locale yourself. In
theory; neither Sun CC nor g++ actually support locales or wide
characters in any sensible way.

If you had a standards conforming compiler and library (Comeau with the
Dinkumware library, for example), you should be able to code translate
files by simply imbuing the correct locales in the files. Supposing, of
course, that 1) the correct locales have been installed on the system,
and 2) the library knows how to recognize them -- the standard only
guarantees the presence of two locales: "C" and "", and they may be
indentical.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

llewelly

unread,
May 15, 2004, 3:13:58 PM5/15/04
to
ka...@gabi-soft.fr writes:
[snip]

> With regards to Solaris or Sun CC, I could find no specification
> whatsoever as to what the native character set is for wchar_t. The
> question is probably academic anyway: any attempt to use std::wcout with
> Sun CC results in a core dump, and with g++ a compile-time error
> ('wcout' undeclared in namespace 'std'). I sure do envy the tools you
> have under Windows.
[snip]

Then you can envy the tools we have under linux and freeBSD, too. :-)

#include<iostream>
#include<ostream>

int main()
{
std::wcout << L"hello world!" << std::endl;
}

compiles and works as expected with g++ 3.2.2, 3.3.3, and 3.4.0 (I'm
nearly certain it worked with 3.0 too, but I don't have that one
to test.)

It *doesn't* work using g++ 3.2.2 (yes, a version that supports wcout
fine under linux) on solaris 2.8 . I don't know why. The trouble
seems to be libstdc++-v3's reliance on functions such as
wvsprintf(), which aren't in solaris libc (or, AFAIK, in ISO C90.)

Probably, you should file a bug report; the maintainers mostly have
linux boxen, and need reminding of solaris issues... :-(

Eugene Gershnik

unread,
May 17, 2004, 4:46:55 PM5/17/04
to
>> ka...@gabi-soft.fr wrote:
>>> "Eugene Gershnik" <gers...@hotmail.com>
>
>>> [...]
>>>>> That the C++ library should not attempt to furnish behaviors that
>>>>> the local platform doesn't furnish directly and in a compatible
>>>>> form.
>
>>>> A special set of
>>>> UTF-8 locales that don't map to platform locales (when there are
>>>> platform locales) is a bad thing. It creates a problem ('how to
>>>> map?') where there weren't any.
>
>>> I don't see the contradiction. There is no problem creating special
>>> locales by replacing the existing codecvt in an existing locale.
>>> This is the standard way of handling foreign codesets in C++.
>>> (Standard in the sense of: supported by the C++ standard.)
>
>> But such a locale wouldn't have a name, would it?
>
> I don't think so. All system provided locales have a name; no user
> defined locales have a name. I'm not sure what happens if, when
> mixing facets from several different system provided locales, you end
> up with a locale which is identical to another system specific locale.

An interesting question actually. Ignoring this issue for a second the above
means that a locale name is somewhat special. If a locale has a name I can
be sure it came from the pre-defined set. As long as I stay within this set
_and_ there is one-to-one mapping between C++ and system locale names my
code can use both facilities naturally. If a user selects system locale name
in UI I know what C++ locale to use. If a function receives a C++ locale
argument it can resort to platform specific tricks to do something standard
library does not allow. In such code I can easily treat nameless locales as
an error and disallow composing locales from arbitrary facets.
An alternative to above is for the library to provide an extension that
allows mapping of a facet to a system locale component. For example a
codecvt could have an additional member that gives an internal and external
Windows code pages. The downside is that all facets used by the application
will have to support this extension.

Ok now I think I am less confused. Evidently VC library decided to make all
char<->char conversions degenrate by default and there is no single
implementation defined internal charset that all streams share. This is so
natural on Windows that I never even thought that there may be any other way
to do it. To put it in a different way I don't want a stream to perform any
conversions unless I specifically ask it to do so.
Going back to the code example the above means that it will fail on any
platform that have chosen to perform degenrate conversions. Thus, if I want
consistent semantics unaffected by my library choices I can only rely on
some non-standard codecvt implementation.

[...]

> In fact, a locale defines two code sets (at least), a wide character
> code set, and a narrow one. Presumably, the intent reflected in the
> Unix naming convention is that the wide character code set always be
> the same -- at least, all of the names I can see for the encoding
> refer to 8 bit encodings, and I don't know how you would specify a
> locale in which both the narrow and the wide character encodings
> differed from the default.
>
> (To tell the truth, I don't quite understand the intent behind a lot
> of this. In order to be able to redefine my own code translation, at
> least for multi-byte characters, I need to be able to define my own
> state_t. But then, I can't use any of the standard streams. The
> whole thing seems like an essay in futility; more an effort to see
> just how far you can push templates, rather than an attempt to
> produce something useful.)

Sadly I have come to the same conclusion. As another twist I even have no
idea about how to initialize mbstate_t. IIRC some compilers don't like

mbstate_t state = 0;

and some don't like

mbstate_t state = {0};

The only thing that seems to always work is

mbstate_t state;
memset(&state, 0, sizeof(mbstate_t));

[...]

>>> A better implementation would use wiostream, since there is normally
>>> only one encoding for wchar_t.
>
>> Not on Solaris AFAIK. The wchar_t encoding is locale dependent
>> there. Which incidentally means that a wide stream using . utf-8
>> locale will most certainly assume different wchar_t encoding than a
>> wide stream using EUC one.
>
> As far as I know, EUC is dead. But I could be wrong.

Well the localization standards at my workplace say I have to support it.

> With regards to Solaris or Sun CC, I could find no specification
> whatsoever as to what the native character set is for wchar_t.

And the only way to find this out dynamically for narrow chars,
nl_langinfo(CODESET), doesn't provide information about wchar_ts either.

> The
> question is probably academic anyway: any attempt to use std::wcout
> with Sun CC results in a core dump, and with g++ a compile-time error
> ('wcout' undeclared in namespace 'std'). I sure do envy the tools you
> have under Windows.

It has been some time since I worked on Solaris but I beleive you can make
it work with STLPort. I may be wrong though.

[...]


>
>>> The intent of the standard streams, and the use of imbued locales,
>>> is that the locale specify the external code set, and that the
>>> program use a constant internal code set. This is almost guaranteed
>>> in the case of wide character streams.
>
>> It is most definitely not guaranteed. There is even a special macro
>> in C99 __STDC_ISO_10646__ that is supposed to tell when it is. Not
>> yet implemented on most compilers unfortunately.
>
> I didn't say you were guaranteed ISO 10646. I said that the character
> encoding didn't change with the locale.

Sure, what I wanted to say is that when this macro is defined you have the
guarranteed charset for wchar_t.

> As I understand it, the
> *intent* of wchar_t is that 1) it is not multibyte, and 2) it doesn't
> change with the locale. Of course, neither is guaranteed, and wchar_t
> doesn't need to be any bigger than a char.

Windows and AFAIK AIX violate (1) while Solaris violates both (1) and (2).
The end result is that a portable project has to define its own wide
character type like utf32_t with all the headaches that result from that.

[...]


>
>> Now I am totally consfused. All implementations I ever worked on
>> didn't convert anything for narrow streams.
>
> Did you imbue them with a locale which should have done code
> translation? (Whether such a locale is present is obviously
> implementation defined. The "C" locale IS forbidden from any code
> translation. And perhaps it is the intent that all codecvt<char,char>
> are supposed to use a degenerate code translation.)
>
>> Does it mean that they were essentially using the "C" locale? Is
>> such behavior a deficiency in all libraries I have worked with?
>> Assuming it is what should be this special internal codeset?
>
> First, unless the default global locale is the "C" locale, and the
> default locale for a stream is the global locale at the moment you
> created the stream. (This means, for example, that unless you
> specifically imbue, cout, cin and their wide stream counterparts will
> use the "C" locale.)

Yes of course. I just tried a variation of your example on VC 7.1. As long
as normal named locales are used the library doesn't even invoke the
codecvt.

[...]

>>> The safest solution anytime you have to deal with several different
>>> codesets is doubtlessly to do everything in wchar_t.
>
>> Apparently I do not understand something very fundamental about C++
>> iostreams so I don't know if this could be true there. What I know
>> for sure is that when using plain C/Posix on Solaris wchar_t doesn't
>> solve anything (unless you always force your users to work in .utf-8
>> locales).
>
> You don't force the users to do anything with regards to locales. If
> you know the encoding of a file, You imbue the locale yourself. In
> theory; neither Sun CC nor g++ actually support locales or wide
> characters in any sensible way.
>
> If you had a standards conforming compiler and library (Comeau with
> the Dinkumware library, for example), you should be able to code
> translate files by simply imbuing the correct locales in the files.

Unless the library decides to implement "always degenerate" conversions as
Dinkumware library apparently does.

--
Eugene

ka...@gabi-soft.fr

unread,
May 18, 2004, 5:54:19 PM5/18/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:<opWdnQEs-MN...@speakeasy.net>...

> ka...@gabi-soft.fr wrote:
> > "Eugene Gershnik" <gers...@nospam.hotmail.com>
> >> ka...@gabi-soft.fr wrote:
> >>> "Eugene Gershnik" <gers...@hotmail.com>

> >>> [...]
> >>>>> That the C++ library should not attempt to furnish behaviors
> >>>>> that the local platform doesn't furnish directly and in a
> >>>>> compatible form.

> >>>> A special set of UTF-8 locales that don't map to platform locales
> >>>> (when there are platform locales) is a bad thing. It creates a
> >>>> problem ('how to map?') where there weren't any.

> >>> I don't see the contradiction. There is no problem creating
> >>> special locales by replacing the existing codecvt in an existing
> >>> locale. This is the standard way of handling foreign codesets in
> >>> C++. (Standard in the sense of: supported by the C++ standard.)

> >> But such a locale wouldn't have a name, would it?

> > I don't think so. All system provided locales have a name; no user
> > defined locales have a name. I'm not sure what happens if, when
> > mixing facets from several different system provided locales, you
> > end up with a locale which is identical to another system specific
> > locale.

> An interesting question actually.

I wish that there were a few less "interesting questions" where locales
were concerned.

> Ignoring this issue for a second
> the above means that a locale name is somewhat special. If a locale
> has a name I can be sure it came from the pre-defined set. As long as
> I stay within this set _and_ there is one-to-one mapping between C++
> and system locale names my code can use both facilities naturally. If
> a user selects system locale name in UI I know what C++ locale to use.
> If a function receives a C++ locale argument it can resort to platform
> specific tricks to do something standard library does not allow. In
> such code I can easily treat nameless locales as an error and disallow
> composing locales from arbitrary facets. An alternative to above is
> for the library to provide an extension that allows mapping of a facet
> to a system locale component. For example a codecvt could have an
> additional member that gives an internal and external Windows code
> pages. The downside is that all facets used by the application will
> have to support this extension.

It's also interesting here to come back to the original problem: reading
a UTF-8 encoded (or any other non-standard) file under Windows. I will
start from the premise that the global locale (the one set by
"std::locale::global( std::locale( "" ) )") will always correspond to a
named locale, AND that, for the reasons you have given, it is
preferrable to stick with named locales (the standard Windows locales)
as much as possible. Given this, I would still argue that the "correct"
solution (or the "standard" solution) would be to acquire a
corresponding codecvt facet -- Dinkumware has them, there are probably
other sources as well, and if worse comes to worse, it shouldn't be that
difficult to write one yourself (although I don't think you can't do so
portably, as far as I can tell, since you have to somehow figure out how
to keep your state in a non-documented state_t). I would then create
the w[io]fstream (I think you're right that they should be wide
streams), open the file, and then use something like:

file.imbue( std::locale( file.getloc(), &special_codecvt ) ) ;

It's true that after that, the imbued locale probably doesn't have a
name, and possibly doesn't fit in with the rest of the system. But the
imbued locale is particular to this w[io]fstream -- it isn't present
anywhere else. I don't really see how that could cause a problem. (I
do agree that you are on shakier grounds if you set such a locale as the
globale locale.)

The "default" char<->char is required to be degenerate. You should be
able to replace it with one (derived from codecvt or codecvt_byname)
that isn't. Otherwise, why bother with all the complexity?

The real questions are: what locales/codecvt's (in addition to the
default) come with your compiler, and what ones can you easily add?

> This is so natural on Windows that I never even thought that there may
> be any other way to do it. To put it in a different way I don't want
> a stream to perform any conversions unless I specifically ask it to do
> so. Going back to the code example the above means that it will fail
> on any platform that have chosen to perform degenrate conversions.
> Thus, if I want consistent semantics unaffected by my library choices
> I can only rely on some non-standard codecvt implementation.

If you want any specific locale behavior, you have to rely on
implementation defined behavior. The standard requires the presence of
the "C" locale, and specifies it pretty tightly -- islower( 'à' ) must
return false, for example. The standard also requires that the
constructor to locale take an empty string as an argument, to create a
implementation defined locale. Under Unix, at least, this is
traditionally understood as the locale specified by the LC_xxx
environment variables, but the standard certainly doesn't require this.
In fact, it doesn't require the presence of any locale except "C".

The entire locale mechanism should only be understood as a standard
syntax for accessing implementation defined behavior. And even there,
the standard is only partial -- the program syntax has been
standardized, but things like how the locales are named are still
implementation defined.

> [...]

> > In fact, a locale defines two code sets (at least), a wide character
> > code set, and a narrow one. Presumably, the intent reflected in the
> > Unix naming convention is that the wide character code set always be
> > the same -- at least, all of the names I can see for the encoding
> > refer to 8 bit encodings, and I don't know how you would specify a
> > locale in which both the narrow and the wide character encodings
> > differed from the default.

> > (To tell the truth, I don't quite understand the intent behind a lot
> > of this. In order to be able to redefine my own code translation,
> > at least for multi-byte characters, I need to be able to define my
> > own state_t. But then, I can't use any of the standard streams. The
> > whole thing seems like an essay in futility; more an effort to see
> > just how far you can push templates, rather than an attempt to
> > produce something useful.)

> Sadly I have come to the same conclusion. As another twist I even have
> no idea about how to initialize mbstate_t. IIRC some compilers don't
> like

> mbstate_t state = 0;

> and some don't like

> mbstate_t state = {0};

> The only thing that seems to always work is

> mbstate_t state;
> memset(&state, 0, sizeof(mbstate_t));

Which of course might fail if mbstate_t has a user defined
constructor:-). (I always thought that that was the intent. How else
is the user supposed to know how to initialize it?)

Perhaps the safest solution would be to declare a static mbstate_t, and
use the copy constructor from that. Although I can't find anywhere in
the standard where mbstate_t is required to have a default constructor,
I can hardly imagine that this wasn't the intent.

> [...]

> >>> A better implementation would use wiostream, since there is
> >>> normally only one encoding for wchar_t.

> >> Not on Solaris AFAIK. The wchar_t encoding is locale dependent
> >> there. Which incidentally means that a wide stream using . utf-8
> >> locale will most certainly assume different wchar_t encoding than a
> >> wide stream using EUC one.

> > As far as I know, EUC is dead. But I could be wrong.

> Well the localization standards at my workplace say I have to support
> it.

Externally, or internally. File encodings never die. But I don't think
that EUC is relevant for internal encoding.

With the Solaris based compilers I have access to, wchar_t is a 32 bit
type. I would automatically suppose ISO 10646. But given the little
amount of support for using it, it is hard to say exactly -- I suspect
that it is mostly whatever you like.

> > With regards to Solaris or Sun CC, I could find no specification
> > whatsoever as to what the native character set is for wchar_t.

> And the only way to find this out dynamically for narrow chars,
> nl_langinfo(CODESET), doesn't provide information about wchar_ts
> either.

And it doesn't tell the truth about the narrow chars, either. At least
not on my machine -- it reports 646, but I'm using an 8859-1 locale (and
the various functions in <ctype.h> return information for this locale).

Of course, the CODESET parameter isn't documented in the man page
either:-). So maybe 646 is there way of abbreviating ISO 10646, and it
really is trying to tell me about the wide char codeset. But I doubt
it.

> > The question is probably academic anyway: any attempt to use
> > std::wcout with Sun CC results in a core dump, and with g++ a
> > compile-time error ('wcout' undeclared in namespace 'std'). I sure
> > do envy the tools you have under Windows.

> It has been some time since I worked on Solaris but I beleive you can
> make it work with STLPort. I may be wrong though.

Perhaps. We have to link with third party libraries, and mixing library
implementations is definitly a no no, so we are stuck with Sun's
(actually Rogue Wave's). Maybe I should give it a try with g++,
although for the moment, the official line is to use the native
compilers, preferrably with the oldest versions we can still get support
for:-). (We use g++ 2.95.2 on the Linux machines.)

> [...]

> >>> The intent of the standard streams, and the use of imbued locales,
> >>> is that the locale specify the external code set, and that the
> >>> program use a constant internal code set. This is almost
> >>> guaranteed in the case of wide character streams.

> >> It is most definitely not guaranteed. There is even a special
> >> macro in C99 __STDC_ISO_10646__ that is supposed to tell when it
> >> is. Not yet implemented on most compilers unfortunately.

> > I didn't say you were guaranteed ISO 10646. I said that the
> > character encoding didn't change with the locale.

> Sure, what I wanted to say is that when this macro is defined you have
> the guarranteed charset for wchar_t.

OK. But it isn't defined with Sun CC nor with gcc on my machine. I
don't know whether this is because they don't use ISO 10646, or because
they are not C99. Or both.

> > As I understand it, the *intent* of wchar_t is that 1) it is not
> > multibyte, and 2) it doesn't change with the locale. Of course,
> > neither is guaranteed, and wchar_t doesn't need to be any bigger
> > than a char.

> Windows and AFAIK AIX violate (1) while Solaris violates both (1) and
> (2).

Windows got bitten by a changing standard. I suspect that this will be
the case on other machines as well (AIX, etc.). Under Solaris, nothing
changes in the few locales I have installed on my machine, but that
doesn't tell me much.

> The end result is that a portable project has to define its own wide
> character type like utf32_t with all the headaches that result from
> that.

There is a proposal before the C committee for a char16_t and a
char32_t. For UTF-16 and UTF-32 respectively. So maybe, if we wait
long enough:-).

> [...]

> >> Now I am totally consfused. All implementations I ever worked on
> >> didn't convert anything for narrow streams.

> > Did you imbue them with a locale which should have done code
> > translation? (Whether such a locale is present is obviously
> > implementation defined. The "C" locale IS forbidden from any code
> > translation. And perhaps it is the intent that all
> > codecvt<char,char> are supposed to use a degenerate code
> > translation.)

> >> Does it mean that they were essentially using the "C" locale? Is
> >> such behavior a deficiency in all libraries I have worked with?
> >> Assuming it is what should be this special internal codeset?

> > First, unless the default global locale is the "C" locale, and the
> > default locale for a stream is the global locale at the moment you
> > created the stream. (This means, for example, that unless you
> > specifically imbue, cout, cin and their wide stream counterparts
> > will use the "C" locale.)

> Yes of course. I just tried a variation of your example on VC 7.1. As
> long as normal named locales are used the library doesn't even invoke
> the codecvt.

Not even to ask if it IS a normal named locale? By calling
always_noconv() on it? I believe that it is the intent that the library
call always_noconv(), and skip the code translation phase if it returns
true.

> [...]

> >>> The safest solution anytime you have to deal with several
> >>> different codesets is doubtlessly to do everything in wchar_t.

> >> Apparently I do not understand something very fundamental about C++
> >> iostreams so I don't know if this could be true there. What I know
> >> for sure is that when using plain C/Posix on Solaris wchar_t
> >> doesn't solve anything (unless you always force your users to work
> >> in .utf-8 locales).

> > You don't force the users to do anything with regards to locales.
> > If you know the encoding of a file, You imbue the locale yourself.
> > In theory; neither Sun CC nor g++ actually support locales or wide
> > characters in any sensible way.

> > If you had a standards conforming compiler and library (Comeau with
> > the Dinkumware library, for example), you should be able to code
> > translate files by simply imbuing the correct locales in the files.

> Unless the library decides to implement "always degenerate"
> conversions as Dinkumware library apparently does.

Is it just a case that all of the codecvt's delivered with the compiler
are always degenerate, or is it a case where the library actually has
code which wouldn't support a non-degenerate form?

It was always my believe that an implementation was required to support
the latter. The only text I have found in the standard, however, seems
a bit ambiguous for the case of narrow character streams.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

P.J. Plauger

unread,
May 18, 2004, 6:00:25 PM5/18/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:opWdnQEs-MN...@speakeasy.net...

> > First, unless the default global locale is the "C" locale, and the
> > default locale for a stream is the global locale at the moment you
> > created the stream. (This means, for example, that unless you
> > specifically imbue, cout, cin and their wide stream counterparts will
> > use the "C" locale.)
>
> Yes of course. I just tried a variation of your example on VC 7.1. As long
> as normal named locales are used the library doesn't even invoke the
> codecvt.

It's not quite that simple, but the use of codecvt facets is indeed
optimized away whenever possible.

> [...]


> > If you had a standards conforming compiler and library (Comeau with
> > the Dinkumware library, for example), you should be able to code
> > translate files by simply imbuing the correct locales in the files.
>
> Unless the library decides to implement "always degenerate" conversions as
> Dinkumware library apparently does.

It doesn't. It has way more code-conversion support than other C++
libraries. With our CoreX library it has even more.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Eugene Gershnik

unread,
May 19, 2004, 10:16:15 PM5/19/04
to
P.J. Plauger wrote:
> "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> news:opWdnQEs-MN...@speakeasy.net...
>> ka...@gabi-soft.fr wrote:
>>> First, unless the default global locale is the "C" locale, and
>>> the default locale for a stream is the global locale at the
>>> moment you created the stream. (This means, for example, that
>>> unless you specifically imbue, cout, cin and their wide stream
>>> counterparts will use the "C" locale.)

>> Yes of course. I just tried a variation of your example on VC 7.1.
>> As long as normal named locales are used the library doesn't even
>> invoke the codecvt.
>
> It's not quite that simple, but the use of codecvt facets is indeed
> optimized away whenever possible.

ka...@gabi-soft.fr wrote:
> Not even to ask if it IS a normal named locale? By calling
> always_noconv() on it? I believe that it is the intent that the
> library call always_noconv(), and skip the code translation phase if
> it returns true.

It does ask. From casual looking at the sources it appears that the
conversions are optimized away if a pointer to codecvt stored in a
basic_filebuf is 0. Whether it is 0 is determined by calling facet's
always_noconv(). I have found only two implementations of either
always_noconv() or do_always_noconv() in the library sources. One for
codecvt<char, char, mbstate_t> that returns true and another for
codecvt<wchar_t, char, mbstate_t> that returns false. So unless I missed
something (which very well could be the case) it appears that narrow streams
are always degenerate. (If anyone wants to check this the relevant files
appear to be <fstream> and <xlocale>)

>>> If you had a standards conforming compiler and library (Comeau
>>> with the Dinkumware library, for example), you should be able to
>>> code translate files by simply imbuing the correct locales in
>>> the files.
>> Unless the library decides to implement "always degenerate"
>> conversions as Dinkumware library apparently does.
>
> It doesn't.

Well, then what am I missing in the description above? Is there a way to
imbue a named locale in a _narrow_ stream and have a codecvt that is not
'degenerate'?

> It has way more code-conversion support than other C++
> libraries.

Yes but all of them are between wchar_t and char aren't they?

--
Eugene

Eugene Gershnik

unread,
May 20, 2004, 5:59:16 AM5/20/04
to
ka...@gabi-soft.fr wrote:
> "Eugene Gershnik" <gers...@hotmail.com> wrote in message

>>> I don't think so. All system provided locales have a name; no
>>> user defined locales have a name. I'm not sure what happens if,
>>> when mixing facets from several different system provided
>>> locales, you end up with a locale which is identical to another
>>> system specific locale.
>
>> An interesting question actually.
>
> I wish that there were a few less "interesting questions" where
> locales were concerned.

Me too :-)

[...]

Well I think this is the best way to solve the problem.

[...]

>> Ok now I think I am less confused. Evidently VC library decided to
>> make all char<->char conversions degenrate by default and there is
>> no single implementation defined internal charset that all streams
>> share.
>
> The "default" char<->char is required to be degenerate. You should be
> able to replace it with one (derived from codecvt or codecvt_byname)
> that isn't. Otherwise, why bother with all the complexity?

If I put my own codecvt into a streambuf it will use it (see my reply to
P.J. Plauger).

> The real questions are: what locales/codecvt's (in addition to the
> default) come with your compiler, and what ones can you easily add?

My understanding is that VC library supplies 1 + many codecvt facets. The 1
is for any char<->char conversions and it is degenerate. The 'many' are for
wchar_t<->char conversions and they use OS primitives to do their job. These
facets are constructed dynamically so you effectively have almost as many as
there are installed codepages on your system. Almost because some codepages
like utf-8 are not supported.

>> Thus, if I want consistent
>> semantics unaffected by my library choices I can only rely on some
>> non-standard codecvt implementation.
>
> If you want any specific locale behavior, you have to rely on
> implementation defined behavior. The standard requires the presence
> of the "C" locale, and specifies it pretty tightly -- islower( 'à' )
> must return false, for example. The standard also requires that the
> constructor to locale take an empty string as an argument, to create a
> implementation defined locale. Under Unix, at least, this is
> traditionally understood as the locale specified by the LC_xxx
> environment variables, but the standard certainly doesn't require
> this.

Posix does for "XSI conformant systems" whatever that means. See
http://www.opengroup.org/onlinepubs/007908799/xsh/setlocale.html

> In fact, it doesn't require the presence of any locale except "C".
>
> The entire locale mechanism should only be understood as a standard
> syntax for accessing implementation defined behavior. And even there,
> the standard is only partial -- the program syntax has been
> standardized, but things like how the locales are named are still
> implementation defined.

This is a very good point.
BTW I think this is an only example of the standard trying to handle a big
optional platform facility. If so I hope this will serve as example of how
_not_ to do it with threads or dynamic loading.

[...]

>> As another twist I even
>> have no idea about how to initialize mbstate_t. IIRC some
>> compilers don't like
>>
>> mbstate_t state = 0;
>>
>> and some don't like
>>
>> mbstate_t state = {0};
>>
>> The only thing that seems to always work is
>>
>> mbstate_t state;
>> memset(&state, 0, sizeof(mbstate_t));
>
> Which of course might fail if mbstate_t has a user defined
> constructor:-). (I always thought that that was the intent. How else
> is the user supposed to know how to initialize it?)
> Perhaps the safest solution would be to declare a static mbstate_t,
> and use the copy constructor from that. Although I can't find
> anywhere in
> the standard where mbstate_t is required to have a default
> constructor,
> I can hardly imagine that this wasn't the intent.

Well it is shared with C library so it is probably a POD. FWIW the Posix
definition from
http://www.opengroup.org/onlinepubs/009695399/basedefs/wchar.h.html is

mbstate_t An object type other than an array type that can hold the
conversion state information necessary to convert between sequences of
(possibly multi-byte) characters and wide characters. [XSI] If a codeset is
being used such that an mbstate_t needs to preserve more than 2 levels of
reserved state, the results are unspecified.

[...]

>>>>> A better implementation would use wiostream, since there is
>>>>> normally only one encoding for wchar_t.
>>>>>
>>>>> Not on Solaris AFAIK. The wchar_t encoding is locale dependent
>>>>> there. Which incidentally means that a wide stream using . utf-8
>>>>> locale will most certainly assume different wchar_t encoding
>>>>> than a wide stream using EUC one.

>>>> As far as I know, EUC is dead. But I could be wrong.

>>> Well the localization standards at my workplace say I have to
>>> support it.

> Externally, or internally. File encodings never die. But I don't
> think that EUC is relevant for internal encoding.

Externally of course. However, sometimes there is no choice but to preserve
external encoding internally.

>
> With the Solaris based compilers I have access to, wchar_t is a 32 bit
> type. I would automatically suppose ISO 10646. But given the little
> amount of support for using it, it is hard to say exactly -- I suspect
> that it is mostly whatever you like.

It is always 32-bit but it is ISO 10646 only for a .utf-8 locale. There is
precious little information on the web about this issue but googling
reveales things like this:
http://mail.nl.linux.org/linux-utf8/2001-09/msg00076.html

>>> With regards to Solaris or Sun CC, I could find no specification
>>> whatsoever as to what the native character set is for wchar_t.
>
>> And the only way to find this out dynamically for narrow chars,
>> nl_langinfo(CODESET), doesn't provide information about wchar_ts
>> either.
>
> And it doesn't tell the truth about the narrow chars, either. At
> least not on my machine -- it reports 646, but I'm using an 8859-1
> locale (and the various functions in <ctype.h> return information for
> this locale).
>
> Of course, the CODESET parameter isn't documented in the man page
> either:-).

I think it is. Googling reveals
http://docs.sun.com/db/doc/816-0218/6m6nirqkr?a=view

> So maybe 646 is there way of abbreviating ISO 10646, and
> it really is trying to tell me about the wide char codeset. But I
> doubt
> it.

It appears that this is a 'feature' of this OS. 646 means 8859-1 :-)

[...]

>> The end result is that a portable project has to define its own
>> wide character type like utf32_t with all the headaches that
>> result from > that.
>
> There is a proposal before the C committee for a char16_t and a
> char32_t. For UTF-16 and UTF-32 respectively. So maybe, if we wait
> long enough:-).

But as far as P.J. Plauger told me here
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&selm=DUV%25b.2559%24C65.1533%40nwrddc01.gnilink.net&rnum=11
(look at the very end) these won't be required to hold UTF. Yet another
loophole for vendors to exploit ;-)

[...]

>> I just tried a variation of your example on VC 7.1.
>> As long as normal named locales are used the library doesn't even
>> invoke the codecvt.
>
> Not even to ask if it IS a normal named locale? By calling
> always_noconv() on it? I believe that it is the intent that the
> library call always_noconv(), and skip the code translation phase if
> it returns true.

It does ask. More details are in my answer to P.J. Plauger.

[...]

>>> If you had a standards conforming compiler and library (Comeau
>>> with the Dinkumware library, for example), you should be able to
>>> code translate files by simply imbuing the correct locales in
>>> the files.
>
>> Unless the library decides to implement "always degenerate"
>> conversions as Dinkumware library apparently does.
>
> Is it just a case that all of the codecvt's delivered with the
> compiler are always degenerate, or is it a case where the library
> actually has
> code which wouldn't support a non-degenerate form?
>
> It was always my believe that an implementation was required to
> support the latter. The only text I have found in the standard,
> however, seems
> a bit ambiguous for the case of narrow character streams.

My understanding is that the library doesn't supply a non-degenerate narrow
codecvt but can certainly use one if it is provided by user.

--
Eugene

Ben Hutchings

unread,
May 20, 2004, 6:53:59 AM5/20/04
to
ka...@gabi-soft.fr wrote:
> "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> news:<opWdnQEs-MN...@speakeasy.net>...
> > ka...@gabi-soft.fr wrote:
<snip>

> > > (To tell the truth, I don't quite understand the intent behind a lot
> > > of this. In order to be able to redefine my own code translation,
> > > at least for multi-byte characters, I need to be able to define my
> > > own state_t. But then, I can't use any of the standard streams. The
> > > whole thing seems like an essay in futility; more an effort to see
> > > just how far you can push templates, rather than an attempt to
> > > produce something useful.)
>
> > Sadly I have come to the same conclusion. As another twist I even have
> > no idea about how to initialize mbstate_t.
<snip>

> > The only thing that seems to always work is
>
> > mbstate_t state;
> > memset(&state, 0, sizeof(mbstate_t));
>
> Which of course might fail if mbstate_t has a user defined
> constructor:-). (I always thought that that was the intent. How else
> is the user supposed to know how to initialize it?)

mbstate_t is part of the C library so it must be POD. C99 says it
must be of "an object type other than an array type" and that various
functions require it to be "initialized to zero" which I assume means
the same as "zero-initialized" in C++.

> Perhaps the safest solution would be to declare a static mbstate_t, and
> use the copy constructor from that.

That sounds right to me.

> Although I can't find anywhere in the standard where mbstate_t is
> required to have a default constructor, I can hardly imagine that
> this wasn't the intent.

It must do since it's POD.

<snip>


> > And the only way to find this out dynamically for narrow chars,
> > nl_langinfo(CODESET), doesn't provide information about wchar_ts
> > either.
>
> And it doesn't tell the truth about the narrow chars, either. At least
> not on my machine -- it reports 646, but I'm using an 8859-1 locale (and
> the various functions in <ctype.h> return information for this locale).
>
> Of course, the CODESET parameter isn't documented in the man page
> either:-). So maybe 646 is there way of abbreviating ISO 10646, and it
> really is trying to tell me about the wide char codeset. But I doubt
> it.

<snip>

nl_langinfo(CODESET) is specified in POSIX.1
<http://www.opengroup.org/onlinepubs/009695399/basedefs/langinfo.h.html>
and "646" is indirectly documented as representing ISO 646 under
Solaris <http://docs.sun.com/db/doc/806-6642/6jfipqu80?a=view>. (Of
course ISO 646 defines several character sets; hopefully they mean ISO
646:1986 IRV which is identical to ASCII.)

So I'd say this is a bug.

P.J. Plauger

unread,
May 20, 2004, 10:32:16 AM5/20/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:48qdnftYetc...@speakeasy.net...

They're always degenerate if you use the default codecvt facets, yes.
That's required by the C++ Standard, IIRC.

> >>> If you had a standards conforming compiler and library (Comeau
> >>> with the Dinkumware library, for example), you should be able to
> >>> code translate files by simply imbuing the correct locales in
> >>> the files.
> >> Unless the library decides to implement "always degenerate"
> >> conversions as Dinkumware library apparently does.
> >
> > It doesn't.
>
> Well, then what am I missing in the description above? Is there a way to
> imbue a named locale in a _narrow_ stream and have a codecvt that is not
> 'degenerate'?

Yes. Just do it.

> > It has way more code-conversion support than other C++
> > libraries.
>
> Yes but all of them are between wchar_t and char aren't they?

No. Look at our on-line CoreX manual. All codecvt facets are
templatized on the "wide" character type. Granted, most make
sense only for wide characters that are at least 16 bits, but
not all. See, for example, template class codecvt_ebcdic (in
include/codecvt/ebcdic). It converts between ASCII (ISO 8859-1)
and EBCDIC using one-to-one mapping tables. Works fine char
to char. And you can, of course, replace the tables with
whatever mapping your heart desires.

CoreX is a *very* flexible kit of tools for doing code conversions.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

P.J. Plauger

unread,
May 20, 2004, 12:36:22 PM5/20/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:6MKdnWdKwbV...@speakeasy.net...

> >> Ok now I think I am less confused. Evidently VC library decided to
> >> make all char<->char conversions degenrate by default and there is
> >> no single implementation defined internal charset that all streams
> >> share.
> >
> > The "default" char<->char is required to be degenerate. You should be
> > able to replace it with one (derived from codecvt or codecvt_byname)
> > that isn't. Otherwise, why bother with all the complexity?
>
> If I put my own codecvt into a streambuf it will use it (see my reply to
> P.J. Plauger).

Correct. It even has a good chance of doing what you want, unlike
with other C++ libraries.

> > The real questions are: what locales/codecvt's (in addition to the
> > default) come with your compiler, and what ones can you easily add?
>
> My understanding is that VC library supplies 1 + many codecvt facets. The
1
> is for any char<->char conversions and it is degenerate. The 'many' are
for
> wchar_t<->char conversions and they use OS primitives to do their job.
These
> facets are constructed dynamically so you effectively have almost as many
as
> there are installed codepages on your system. Almost because some
codepages
> like utf-8 are not supported.

Correct again. This is another feature unique to the Dinkumware
implementation of the Standard C++ library. Wherever possible,
we piggyback on the existing C locales and their effect on
mbtowc/wctomb. So even though we supply only two generic
codecvt facets with our library, the wchar_t/char one taps into
the broad assortment of conversions supported by Windows.

> >> The end result is that a portable project has to define its own
> >> wide character type like utf32_t with all the headaches that
> >> result from > that.
> >
> > There is a proposal before the C committee for a char16_t and a
> > char32_t. For UTF-16 and UTF-32 respectively. So maybe, if we wait
> > long enough:-).
>
> But as far as P.J. Plauger told me here
>
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&selm=DUV%25b.2559%24C65.1533%40nwrddc01.gnilink.net&rnum=11
> (look at the very end) these won't be required to hold UTF. Yet another
> loophole for vendors to exploit ;-)

Or, to put it another way, it's yet another bit of latitude
extended to compilers for, say, embedded systems so they
can have efficient code generation and small libraries yet
still claim conformance.

If you use our CoreX library with the char16_t and char32_t
character types, you can choose from a broad assortment of
encodings, regardless of what the compiler vendor supplies.
This assortment does include UTF-16 as a "wide character"
type, even though it breaks the fundamental rules of Standard
C. So you can convert between UTF-8 externally and UTF-16
internally on any system, no matter what it chooses for the
representation of wchar_t or a wide character encoding.

> >> I just tried a variation of your example on VC 7.1.
> >> As long as normal named locales are used the library doesn't even
> >> invoke the codecvt.

Doesn't have to. See above.

> > Not even to ask if it IS a normal named locale? By calling
> > always_noconv() on it? I believe that it is the intent that the
> > library call always_noconv(), and skip the code translation phase if
> > it returns true.
>
> It does ask. More details are in my answer to P.J. Plauger.

And more details are in my earlier reply. You've overlooked a *lot*
of functionality that we supply.

> >>> If you had a standards conforming compiler and library (Comeau
> >>> with the Dinkumware library, for example), you should be able to
> >>> code translate files by simply imbuing the correct locales in
> >>> the files.
> >
> >> Unless the library decides to implement "always degenerate"
> >> conversions as Dinkumware library apparently does.
> >
> > Is it just a case that all of the codecvt's delivered with the
> > compiler are always degenerate, or is it a case where the library
> > actually has
> > code which wouldn't support a non-degenerate form?
> >
> > It was always my believe that an implementation was required to
> > support the latter. The only text I have found in the standard,
> > however, seems
> > a bit ambiguous for the case of narrow character streams.
>
> My understanding is that the library doesn't supply a non-degenerate
narrow
> codecvt but can certainly use one if it is provided by user.

Correct. That's what the C++ Standard calls for.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

llewelly

unread,
May 20, 2004, 1:02:07 PM5/20/04
to
"Eugene Gershnik" <gers...@hotmail.com> writes:
[snip]

> BTW I think this is an only example of the standard trying to handle a big
> optional platform facility.

Arguably, hosted vs. freestanding implementations is another.

> If so I hope this will serve as example of how
> _not_ to do it with threads or dynamic loading.

[snip]

llewelly

unread,
May 20, 2004, 1:03:01 PM5/20/04
to
Ben Hutchings <do-not-s...@bwsint.com> writes:

> ka...@gabi-soft.fr wrote:
> > "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> > news:<opWdnQEs-MN...@speakeasy.net>...
> > > ka...@gabi-soft.fr wrote:
> <snip>
> > > > (To tell the truth, I don't quite understand the intent behind a lot
> > > > of this. In order to be able to redefine my own code translation,
> > > > at least for multi-byte characters, I need to be able to define my
> > > > own state_t. But then, I can't use any of the standard streams. The
> > > > whole thing seems like an essay in futility; more an effort to see
> > > > just how far you can push templates, rather than an attempt to
> > > > produce something useful.)
> >
> > > Sadly I have come to the same conclusion. As another twist I even have
> > > no idea about how to initialize mbstate_t.
> <snip>
> > > The only thing that seems to always work is
> >
> > > mbstate_t state;
> > > memset(&state, 0, sizeof(mbstate_t));

[snip]

I don't know if this is portable, but I think the standard requires

mbstate_t state= {0};

to zero-initialize mbstate_t whether mbstate_t is builtin or POD. I
get this from 8.5/13, which says:

# If T is a scalar type, then a declaration of the form
# T x = { a };
# is equivalent to
# T x = a;

Certainly gcc 2.95 and up compile:

typedef int mbstate_t;

int main()
{
mbstate_t state= {0};
}

and give the expected behavior.

Eugene Gershnik

unread,
May 21, 2004, 6:01:32 AM5/21/04
to
P.J. Plauger wrote:
> "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> news:48qdnftYetc...@speakeasy.net...
>
>> P.J. Plauger wrote:
>>> "Eugene Gershnik" <gers...@hotmail.com> wrote in message
>>> news:opWdnQEs-MN...@speakeasy.net...
>>>> Yes of course. I just tried a variation of your example on VC 7.1.
>>>> As long as normal named locales are used the library doesn't even
>>>> invoke the codecvt.
>>>
>>> It's not quite that simple, but the use of codecvt facets is indeed
>>> optimized away whenever possible.
>>
>> ka...@gabi-soft.fr wrote:
>>> Not even to ask if it IS a normal named locale? By calling
>>> always_noconv() on it? I believe that it is the intent that the
>>> library call always_noconv(), and skip the code translation phase if
>>> it returns true.
>>
>> It does ask.
[...]

>> So unless I
>> missed something (which very well could be the case) it appears that
>> narrow streams are always degenerate. (If anyone wants to check this
>> the relevant files appear to be <fstream> and <xlocale>)
>
> They're always degenerate if you use the default codecvt facets, yes.
> That's required by the C++ Standard, IIRC.

So you do think that §22.2.1.5/3 inplies that default codecvt<char,char> is
degenerate? To quote James Kanze from earlier in this thread

<quote>


I'm not quite sure what this is supposed to mean if you have more than
one locale (and thus, more than one instantiation of codecvt for each
pair of types); maybe the intent IS that all conversions char<->char are
"degenerate", but all that is mentionned is conversions involving the
"implementation defined native character sets".

</quote>


>>>>> If you had a standards conforming compiler and library (Comeau
>>>>> with the Dinkumware library, for example), you should be able to
>>>>> code translate files by simply imbuing the correct locales in
>>>>> the files.
>>>> Unless the library decides to implement "always degenerate"
>>>> conversions as Dinkumware library apparently does.
>>>
>>> It doesn't.
>>
>> Well, then what am I missing in the description above? Is there a
>> way to imbue a named locale in a _narrow_ stream and have a codecvt
>> that is not 'degenerate'?
>
> Yes. Just do it.

I fail to understand what you are saying here. Just a few sentences above
you agreed that codecvt<char,char> supplied in a named locale is a no-op. If
so embedding a *named* locale in a *narrow* stream will put a no-op codecvt
there.

--
Eugene

Eugene Gershnik

unread,
May 21, 2004, 6:02:09 AM5/21/04
to
P.J. Plauger wrote:
> "Eugene Gershnik" wrote
>> ka...@gabi-soft.fr wrote:
>>> "Eugene Gershnik" wrote

>>>> The end result is that a portable project has to define its own
>>>> wide character type like utf32_t with all the headaches that
>>>> result from that.

>>> There is a proposal before the C committee for a char16_t and a
>>> char32_t. For UTF-16 and UTF-32 respectively. So maybe, if we
>>> wait long enough:-).
>>
>> But as far as P.J. Plauger told me here
>>
>>
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&selm=DUV%25b.2559%24C65.1533%40nwrddc01.gnilink.net&rnum=11
>> (look at the very end) these won't be required to hold UTF. Yet
>> another loophole for vendors to exploit ;-)
>
> Or, to put it another way, it's yet another bit of latitude
> extended to compilers for, say, embedded systems so they
> can have efficient code generation and small libraries yet
> still claim conformance.

I find a curious similarity between this argument and an idea to inherit all
classes from Object so that any object can be put into a container. Both
share the same problems IMHO.
How about introducing levels of functionality so an implementation could
legally say "I do not support char16_t" and still be conformant?

> If you use our CoreX library with the char16_t and char32_t
> character types, you can choose from a broad assortment of
> encodings, regardless of what the compiler vendor supplies.
> This assortment does include UTF-16 as a "wide character"
> type, even though it breaks the fundamental rules of Standard
> C. So you can convert between UTF-8 externally and UTF-16
> internally on any system, no matter what it chooses for the
> representation of wchar_t or a wide character encoding.

Yes I can use a 3rd party library for that. Although in this case I'd prefer
some specialized one like ICU that calls things what they are and allows me
to do my job rather then spend time thinking how to mold my job into
something compatible with iostreams.

>>> Not even to ask if it IS a normal named locale? By calling
>>> always_noconv() on it? I believe that it is the intent that the
>>> library call always_noconv(), and skip the code translation phase
>>> if it returns true.
>>
>> It does ask. More details are in my answer to P.J. Plauger.
>
> And more details are in my earlier reply. You've overlooked a *lot*
> of functionality that we supply.

I couldn't find any details in that message. Could you tell what exactly
have I overlooked with regards to the topic of this sub-thread?

--
Eugene

ka...@gabi-soft.fr

unread,
May 21, 2004, 5:48:09 PM5/21/04
to
llewelly <llewe...@xmission.dot.com> wrote in message
news:<86pt8za...@Zorthluthik.local.bar>...

> "Eugene Gershnik" <gers...@hotmail.com> writes:
> [snip]

> > BTW I think this is an only example of the standard trying to handle
> > a big optional platform facility.

> Arguably, hosted vs. freestanding implementations is another.

That's a special case. Arguably, the support for files, in general, is
one, since [io]fstream::open is allowed to fail, regardless of the
parameters:-). More explicitly, the C standard allows time() to fail,
although I'm not sure that one would consider this a "big" facility.

IMHO: the standard pretends to only support two versions, free standing
and hosted. In fact, it supports many optional features. The results
are that for the most part, you can only detect the presence or absence
of optional features at runtime. Not a good thing. Personally, I'd
like to see a standard interface to dynamic linking. And to sockets.
And for a GUI. And threads. But before getting bogged down in the
details of any of these, I really think that the standard has to address
the question of optional packages. If my platform doesn't support a
GUI, I don't want to have to wait until I've linked everything and tried
to execute to find out.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,
May 21, 2004, 5:52:25 PM5/21/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:<6MKdnWdKwbV...@speakeasy.net>...

> ka...@gabi-soft.fr wrote:
> > "Eugene Gershnik" <gers...@hotmail.com> wrote in message
[...]

> > The real questions are: what locales/codecvt's (in addition to the
> > default) come with your compiler, and what ones can you easily add?

> My understanding is that VC library supplies 1 + many codecvt
> facets. The 1 is for any char<->char conversions and it is
> degenerate.

That is the minimum required by the standard.

> The 'many' are for wchar_t<->char conversions and they use
> OS primitives to do their job. These facets are constructed
> dynamically so you effectively have almost as many as there are
> installed codepages on your system. Almost because some codepages like
> utf-8 are not supported.

Is UTF-8 a codepage? Or is it simply an encoding for which there is no
codepage? (What is a codepage, exactly? I remember it from MS-DOS, but
there, it was really, in the end, a mapping of bytes to display bit
patterns. If this is still the basis, then having a codepage for a
multibyte encoding is a contradiction in terms.)

> >> Thus, if I want consistent semantics unaffected by my library
> >> choices I can only rely on some non-standard codecvt
> >> implementation.

> > If you want any specific locale behavior, you have to rely on
> > implementation defined behavior. The standard requires the
> > presence of the "C" locale, and specifies it pretty tightly --
> > islower( 'à' ) must return false, for example. The standard also
> > requires that the constructor to locale take an empty string as an
> > argument, to create a implementation defined locale. Under Unix,
> > at least, this is traditionally understood as the locale specified
> > by the LC_xxx environment variables, but the standard certainly
> > doesn't require this.

> Posix does for "XSI conformant systems" whatever that means. See
> http://www.opengroup.org/onlinepubs/007908799/xsh/setlocale.html

I meant the C standard. I sort of thought that this "typical Unix
behavior" was actually specified by Posix, but was too lazy to look it
up.

> > In fact, it doesn't require the presence of any locale except "C".

> > The entire locale mechanism should only be understood as a standard
> > syntax for accessing implementation defined behavior. And even
> > there, the standard is only partial -- the program syntax has been
> > standardized, but things like how the locales are named are still
> > implementation defined.

> This is a very good point.

> BTW I think this is an only example of the standard trying to handle a
> big optional platform facility. If so I hope this will serve as
> example of how _not_ to do it with threads or dynamic loading.

It's not the only example. The C standard officially allows time() to
fail if the functionality is not supported, and of course, there is
nothing which requires any given string passed to filebuf::open to
succeed, so if your system doesn't support files, you know what to
do:-).

I agree that locale is a pretty good model of how not to handle this
sort of thing.

> [...]

> >> As another twist I even have no idea about how to initialize
> >> mbstate_t. IIRC some compilers don't like

> >> mbstate_t state = 0;

> >> and some don't like

> >> mbstate_t state = {0};

> >> The only thing that seems to always work is

> >> mbstate_t state;
> >> memset(&state, 0, sizeof(mbstate_t));

> > Which of course might fail if mbstate_t has a user defined
> > constructor:-). (I always thought that that was the intent. How
> > else is the user supposed to know how to initialize it?) Perhaps
> > the safest solution would be to declare a static mbstate_t, and use
> > the copy constructor from that. Although I can't find anywhere in
> > the standard where mbstate_t is required to have a default
> > constructor, I can hardly imagine that this wasn't the intent.

> Well it is shared with C library so it is probably a POD.

Yes. I missed that point.

> mbstate_t An object type other than an array type that can hold the
> conversion state information necessary to convert between sequences of
> (possibly multi-byte) characters and wide characters. [XSI] If a
> codeset is being used such that an mbstate_t needs to preserve more
> than 2 levels of reserved state, the results are unspecified.

That's just a copy of what the C standard says. (With the exception of
the XSI addition, which doesn't really add anything.)

> [...]

> >>>>> A better implementation would use wiostream, since there is
> >>>>> normally only one encoding for wchar_t.

> >>>>> Not on Solaris AFAIK. The wchar_t encoding is locale
> >>>>> dependent there. Which incidentally means that a wide stream
> >>>>> using . utf-8 locale will most certainly assume different
> >>>>> wchar_t encoding than a wide stream using EUC one.

> >>>> As far as I know, EUC is dead. But I could be wrong.
>
> >>> Well the localization standards at my workplace say I have to
> >>> support it.
>
> > Externally, or internally. File encodings never die. But I don't
> > think that EUC is relevant for internal encoding.
>
> Externally of course. However, sometimes there is no choice but to preserve
> external encoding internally.
>

> > type. I would automatically suppose ISO 10646. But given the little


> > amount of support for using it, it is hard to say exactly -- I suspect
> > that it is mostly whatever you like.

> It is always 32-bit but it is ISO 10646 only for a .utf-8 locale. There is
> precious little information on the web about this issue but googling
> reveales things like this:
> http://mail.nl.linux.org/linux-utf8/2001-09/msg00076.html

> >>> With regards to Solaris or Sun CC, I could find no specification
> >>> whatsoever as to what the native character set is for wchar_t.

> >> And the only way to find this out dynamically for narrow chars,
> >> nl_langinfo(CODESET), doesn't provide information about wchar_ts
> >> either.

> > And it doesn't tell the truth about the narrow chars, either. At
> > least not on my machine -- it reports 646, but I'm using an 8859-1
> > locale (and the various functions in <ctype.h> return information
> > for this locale).

> > Of course, the CODESET parameter isn't documented in the man page
> > either:-).

> I think it is. Googling reveals
> http://docs.sun.com/db/doc/816-0218/6m6nirqkr?a=view

I'm doing "man langinfo" on a Sparc under Solaris 2.8, and I'm not
seeing it. Which is surprising -- it's in the header file.

> > So maybe 646 is there way of abbreviating ISO 10646, and it really
> > is trying to tell me about the wide char codeset. But I doubt it.

> It appears that this is a 'feature' of this OS. 646 means 8859-1 :-)

> [...]

> >> The end result is that a portable project has to define its own
> >> wide character type like utf32_t with all the headaches that
> >> result from that.

> > There is a proposal before the C committee for a char16_t and a
> > char32_t. For UTF-16 and UTF-32 respectively. So maybe, if we
> > wait long enough:-).

> But as far as P.J. Plauger told me here
> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&selm=DUV%25b.2559%24C65.1533%40nwrddc01.gnilink.net&rnum=11
> (look at the very end) these won't be required to hold UTF. Yet
> another loophole for vendors to exploit ;-)

I know. But at least you will be able to check it at compile time.
Supposing, of course, that the proposal is adopted, and that vendors pay
some attention to it -- currently, support for C99 is running at about
the same level as support of C++98.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

P.J. Plauger

unread,
May 21, 2004, 5:53:50 PM5/21/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:_4WdnS-MBY7...@speakeasy.net...

If you mean the sentence:

: codecvt<char,char,mbstate_t> implements a degenerate conversion;


: it does not convert at all.

then I would say that, reading between the lines, yes, you could
say that this implies that the conversion is degenerate.

> To quote James Kanze from earlier in this thread
>
> <quote>
> I'm not quite sure what this is supposed to mean if you have more than
> one locale (and thus, more than one instantiation of codecvt for each
> pair of types); maybe the intent IS that all conversions char<->char are
> "degenerate", but all that is mentionned is conversions involving the
> "implementation defined native character sets".
> </quote>

All it says is that the default char/char conversion is one-to-one.
Nothing prohibits you from imbuing your own codecvt facet that
does something else.

> >>>>> If you had a standards conforming compiler and library (Comeau
> >>>>> with the Dinkumware library, for example), you should be able to
> >>>>> code translate files by simply imbuing the correct locales in
> >>>>> the files.
> >>>> Unless the library decides to implement "always degenerate"
> >>>> conversions as Dinkumware library apparently does.
> >>>
> >>> It doesn't.
> >>
> >> Well, then what am I missing in the description above? Is there a
> >> way to imbue a named locale in a _narrow_ stream and have a codecvt
> >> that is not 'degenerate'?
> >
> > Yes. Just do it.
>
> I fail to understand what you are saying here. Just a few sentences above
> you agreed that codecvt<char,char> supplied in a named locale is a no-op.
If
> so embedding a *named* locale in a *narrow* stream will put a no-op
codecvt
> there.

Sorry, I missed the "named" part. Nothing prevents a named locale
from having a nontrivial char/char conversion. I just don't happen
to know of any that do. We have, still in house, an etended
localedef utility that lets you specify code conversions as part
of a named locale -- it's a feature left out of the Posix localedef.
So we *could* contrive a named locale that would control this
facet.

But there's certainly nothing to prevent you, right now, from imbuing
your own codecvt<char, char, mbstate_t> and doing as you please.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

P.J. Plauger

unread,
May 21, 2004, 6:52:48 PM5/21/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:ipSdnfJjzPa...@speakeasy.net...

> P.J. Plauger wrote:
> > "Eugene Gershnik" wrote
> >> ka...@gabi-soft.fr wrote:
> >>> "Eugene Gershnik" wrote
> >>>> The end result is that a portable project has to define its own
> >>>> wide character type like utf32_t with all the headaches that
> >>>> result from that.
>
> >>> There is a proposal before the C committee for a char16_t and a
> >>> char32_t. For UTF-16 and UTF-32 respectively. So maybe, if we
> >>> wait long enough:-).
> >>
> >> But as far as P.J. Plauger told me here
> >>
> >>
>
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&selm=DUV%25b.2559%24C65.1533%40nwrddc01.gnilink.net&rnum=11
> >> (look at the very end) these won't be required to hold UTF. Yet
> >> another loophole for vendors to exploit ;-)
> >
> > Or, to put it another way, it's yet another bit of latitude
> > extended to compilers for, say, embedded systems so they
> > can have efficient code generation and small libraries yet
> > still claim conformance.
>
> I find a curious similarity between this argument and an idea to inherit
all
> classes from Object so that any object can be put into a container. Both
> share the same problems IMHO.

It's not an "argument", it's one of the deliberate principles that
the C committee adopted over 20 years ago, and continues to honor.
By using the pejorative phrase "yet another loophole for vendors
to exploit" you reveal a cynical worldview that implies:

1) The C Standard is full of loopholes -- they must be sloppy.

2) Vendors will take advantage of these loopholes -- they have
nefarious motives for doing so.

I'm simply giving you another viewpoint -- which happens to be
the viewpoint held by the C committee when they added this
particular "loophole".

Interestingly enough, your analogy is quite apt. The object
oriented "idea" is quite a valid worldview that has proven to
be successful in a variety of environments over the decades.
It's not the only worldview, by any means. And you can accuse
if of having "problems" not shared by other approaches. But
those approaches have their own problems not always shared
by OOP languages.

Open your mind to the possibility that there may be more than
one valid way to look at all sorts of things.

> How about introducing levels of functionality so an implementation could
> legally say "I do not support char16_t" and still be conformant?

You've got it. char16_t is part of a Technical Report that is
non-normative. You can claim conformance to the C Standard
without implementing any of the TRs that have been approved or
are now in the works. We added char16_t and char32_t with
TR 19769 as a way to give the UTF-16 crowd what they needed
to write reasonably portable code that uses their favorite
character encoding, regardless of the size or encoding of
wchar_t on any given implementation. We even defined a macro
that advertises whether their favorite character encoding is
actually used. What they *wanted* was to make UTF-16 the *one
and only, mandatory* character encoding for C and C++. We
chose to give them what they needed instead of what they
wanted, which doubtless will not satisfy many UTF-16 enthusiasts.
But that's the way the C committee tries to do things.

> > If you use our CoreX library with the char16_t and char32_t
> > character types, you can choose from a broad assortment of
> > encodings, regardless of what the compiler vendor supplies.
> > This assortment does include UTF-16 as a "wide character"
> > type, even though it breaks the fundamental rules of Standard
> > C. So you can convert between UTF-8 externally and UTF-16
> > internally on any system, no matter what it chooses for the
> > representation of wchar_t or a wide character encoding.
>
> Yes I can use a 3rd party library for that. Although in this case I'd
prefer
> some specialized one like ICU that calls things what they are and allows
me
> to do my job rather then spend time thinking how to mold my job into
> something compatible with iostreams.

Okay, but this discussion has been all about using codecvt facets
with iostreams to do this particular job. If you're hoping to
find everything you want included for free with existing locale
machinery, you're in for a disappointment.

> >>> Not even to ask if it IS a normal named locale? By calling
> >>> always_noconv() on it? I believe that it is the intent that the
> >>> library call always_noconv(), and skip the code translation phase
> >>> if it returns true.
> >>
> >> It does ask. More details are in my answer to P.J. Plauger.
> >
> > And more details are in my earlier reply. You've overlooked a *lot*
> > of functionality that we supply.
>
> I couldn't find any details in that message. Could you tell what exactly
> have I overlooked with regards to the topic of this sub-thread?

You poked a bit at our code and told the world that it was missing
all sorts of functionality. What you didn't see, and what I've
outlined in other postings to this thread, is:

1) Where you've seen "degenerate" mappings, they're required by
the C++ Standard.

2) Our code does let you install custom codecvt facets and honors
them. (Other libraries either stub out this facility or implement
it with so many bugs that only the simplest codecvt facets have a
chance of working properly.)

3) Our standard package also automatically takes advantage of
the changing behavior of mbtowc/wctomb to leverage whatever
code conversions are supplied by the vendor of the underlying
C library. In the case of Windows, that's a nontrivial assortment.

4) The C++ Standard intentionally leaves locale support open
ended, just as the C Standard does. We developed our CoreX
library in the spirit envisioned by the C++ committee, as a
natural next step after the Standard C++ library for those who
want richer support for code conversions (among other things).

So you didn't see everything that's there and you complained
about the omission of things that are not required to be
there.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Ben Hutchings

unread,
May 22, 2004, 5:35:39 AM5/22/04
to
ka...@gabi-soft.fr wrote:
> "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> news:<6MKdnWdKwbV...@speakeasy.net>...
<snip>

>> The 'many' are for wchar_t<->char conversions and they use
>> OS primitives to do their job. These facets are constructed
>> dynamically so you effectively have almost as many as there are
>> installed codepages on your system. Almost because some codepages like
>> utf-8 are not supported.
>
> Is UTF-8 a codepage?
> Or is it simply an encoding for which there is no codepage?

It is not supported by the core Win32 character conversion functions
but it does have a code page ID (CP_UTF8 or 65001). Some support for
it is provided by a library included with IE, which may or may not be
part of Windows.

> (What is a codepage, exactly? I remember it from MS-DOS, but
> there, it was really, in the end, a mapping of bytes to display bit
> patterns. If this is still the basis, then having a codepage for a
> multibyte encoding is a contradiction in terms.)

Originally it was that, but some versions of MS-DOS supported
multibyte code pages, e.g. Shift_JIS was invented for use in the
Japanese version. So a "code page" is really an encoding. The Win32
system of code pages is similar to that used in MS-DOS and only allows
1 or 2 bytes per character with no shift states. This is why UTF-8 is
not properly supported.

P.J. Plauger

unread,
May 22, 2004, 5:38:12 AM5/22/04
to
<ka...@gabi-soft.fr> wrote in message
news:d6652001.04052...@posting.google.com...

> "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> news:<6MKdnWdKwbV...@speakeasy.net>...
> > ka...@gabi-soft.fr wrote:
> > > "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> [...]
> > > The real questions are: what locales/codecvt's (in addition to the
> > > default) come with your compiler, and what ones can you easily add?
>
> > My understanding is that VC library supplies 1 + many codecvt
> > facets. The 1 is for any char<->char conversions and it is
> > degenerate.
>
> That is the minimum required by the standard.
>
> > The 'many' are for wchar_t<->char conversions and they use
> > OS primitives to do their job. These facets are constructed
> > dynamically so you effectively have almost as many as there are
> > installed codepages on your system. Almost because some codepages like
> > utf-8 are not supported.
>
> Is UTF-8 a codepage? Or is it simply an encoding for which there is no
> codepage? (What is a codepage, exactly? I remember it from MS-DOS, but
> there, it was really, in the end, a mapping of bytes to display bit
> patterns. If this is still the basis, then having a codepage for a
> multibyte encoding is a contradiction in terms.)

Best I can tell, a codepage is roughly analogous to one of the ISO
8859-x encodings. It assigns glyphs to the 256 possible single-byte
encodings. Usually, if not always, the first 128 positions are the
good old ASCII, aka ISO 8859-1 (maybe ISO 646) common subset.
But each of the encodings also corresponds to *some* character in
the UCS-2/UTF-16 subset of Unicode/ISO 10646. So it does make sense
to specify a multibyte-to-wide-character encoding for each code
page that maps its particular 256-element subset to the more or
less universal 16-bit wide-character codes. If you then convert
*that* wide-character code to UTF-8, you have a two-stage process
that translates each code page to a multibyte encoding.

We provide all the TinkerToys needed to do this sort of thing
in our CoreX library, but there's such endemic confusion about
everything connected with multibyte and wide character encodings
that it's hard to explain to people what's going on.

> > >> The end result is that a portable project has to define its own
> > >> wide character type like utf32_t with all the headaches that
> > >> result from that.
>
> > > There is a proposal before the C committee for a char16_t and a
> > > char32_t. For UTF-16 and UTF-32 respectively. So maybe, if we
> > > wait long enough:-).
>
> > But as far as P.J. Plauger told me here
> >
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&selm=DUV%25b.2559%24C65.1533%40nwrddc01.gnilink.net&rnum=11
> > (look at the very end) these won't be required to hold UTF. Yet
> > another loophole for vendors to exploit ;-)
>
> I know. But at least you will be able to check it at compile time.
> Supposing, of course, that the proposal is adopted, and that vendors pay
> some attention to it -- currently, support for C99 is running at about
> the same level as support of C++98.

It's been approved by ISO, if my understanding is correct, as a
non-normative addendum. It doesn't require anything particularly
C99-ish, and it's small, so you may see some implementations of
it before terribly long. Dinkumware has had a version working
in-house for some time -- we're currently staging the libraries
for *all* the TRs approved or in the works for both C and C++.
And as I've mentioned before, the codecvt facets in our CoreX
library aren't tied to wchar_t, or to fstream I/O for that
matter, so you can use them with any C++ library available
today.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

James Kanze

unread,
May 22, 2004, 9:41:28 AM5/22/04
to
llewelly <llewe...@xmission.dot.com> writes:

|> Ben Hutchings <do-not-s...@bwsint.com> writes:

|> mbstate_t state= {0};

That sounds about right to me.

>From what we've seen so far: mbstate_t is compatible with C, so it must
be a POD, thus an agglomerate, so the braces initialization is legal.
It is also legal in C90, so it isn't some new innovation; I find it
thus rather surprising that a compiler would get it wrong.

While I'm at it, I might add that formally, the solution with memcpy is
NOT legal, contains undefined behavior, and might core dump.
Pratically, I can't imagine it ever causing a problem, though.

|> Certainly gcc 2.95 and up compile:

|> typedef int mbstate_t;

|> int main()
|> {
|> mbstate_t state= {0};
|> }

|> and give the expected behavior.

I just looked it up. In K&R 1 (copyright 1978): "When an initializer
applies to a scalar (a pointer or an object of arithmetic type), it
consists of a single expression, perhaps in braces." So it really isn't
a new feature:-).

--
James Kanze


Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

9 place Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34

James Kanze

unread,
May 16, 2004, 7:43:56 PM5/16/04
to
llewelly <llewe...@xmission.dot.com> writes:

|> ka...@gabi-soft.fr writes:
|> [snip]
|> > With regards to Solaris or Sun CC, I could find no specification
|> > whatsoever as to what the native character set is for wchar_t.
|> > The question is probably academic anyway: any attempt to use
|> > std::wcout with Sun CC results in a core dump, and with g++ a
|> > compile-time error ('wcout' undeclared in namespace 'std'). I
|> > sure do envy the tools you have under Windows.
|> [snip]

|> Then you can envy the tools we have under linux and freeBSD, too.
|> :-)

Well, one would expect g++ to have a slight edge there (although
curiously, emacs seems to run better on both Solaris and Windows than
under Linux -- can't figure out why).

|> #include<iostream>
|> #include<ostream>

|> int main()
|> {
|> std::wcout << L"hello world!" << std::endl;
|> }

|> compiles and works as expected with g++ 3.2.2, 3.3.3, and 3.4.0 (I'm
|> nearly certain it worked with 3.0 too, but I don't have that one
|> to test.)

|> It *doesn't* work using g++ 3.2.2 (yes, a version that supports wcout
|> fine under linux) on solaris 2.8 . I don't know why. The trouble
|> seems to be libstdc++-v3's reliance on functions such as
|> wvsprintf(), which aren't in solaris libc (or, AFAIK, in ISO C90.)

|> Probably, you should file a bug report; the maintainers mostly have
|> linux boxen, and need reminding of solaris issues... :-(

Well, I don't have the very latest version of g++ installed, and I know
that this is one area they are actively working on. In 3.0, setting the
global locale was a no-op; even the C setlocale had stopped working. And
I did file a bug report. For the rest, since I know that it is a part of
the library in active evolution, I'm not going to bother them unless I'm
sure it doesn't work in the latest version, and that they think it
should. (On the other hand, as you say -- if it works on Linux, maybe
they do think it is working.)

Still, this worked in VC++ 5.0. And that was how many years ago? Like it
or not, the library implementation that comes with VC++ is still about
the best around, and given the amount of standards innovation in the
library, this more than makes up for any weaknesses in the compiler.

--
James Kanze


Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

9 place Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34

Eugene Gershnik

unread,
May 23, 2004, 7:39:53 AM5/23/04
to
{Please be careful that this thread keeps focused on technical issues.
-mod}

P.J. Plauger wrote:
> "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> news:ipSdnfJjzPa...@speakeasy.net...
>

>>> Or, to put it another way, it's yet another bit of latitude
>>> extended to compilers for, say, embedded systems so they
>>> can have efficient code generation and small libraries yet
>>> still claim conformance.
>>
>> I find a curious similarity between this argument and an idea to
>> inherit all classes from Object so that any object can be put
>> into a container. Both share the same problems IMHO.
>
> It's not an "argument", it's one of the deliberate principles that
> the C committee adopted over 20 years ago, and continues to honor.
> By using the pejorative phrase "yet another loophole for vendors
> to exploit" you reveal a cynical worldview that implies:
>
> 1) The C Standard is full of loopholes -- they must be sloppy.
>
> 2) Vendors will take advantage of these loopholes -- they have
> nefarious motives for doing so.

The parts after '--' are things I never said or implied. I don't beleive the
C or C++ committees are sloppy or that vendors have any nefarious motives
with regards to loopholes. What I do beleive with respect to any standard,
not just C or C++, is that

1) Vendors may not fully understand the standard or rationale for various
requirements. This is especially true for specialized areas where particular
implementor may not have much experience.
2) When presented with a feature that may require some effort to implement
vendors may prefer to take an 'easy way' even though this isn't the best way
for their particular platform.

Nothing in this is 'nefarious'.

> I'm simply giving you another viewpoint -- which happens to be
> the viewpoint held by the C committee when they added this
> particular "loophole".
>
> Interestingly enough, your analogy is quite apt. The object
> oriented "idea" is quite a valid worldview that has proven to
> be successful in a variety of environments over the decades.
> It's not the only worldview, by any means. And you can accuse
> if of having "problems" not shared by other approaches. But
> those approaches have their own problems not always shared
> by OOP languages.

Precisely. So to decide which approach is better one needs to examine the
problem at hand. In this particular case what does the 'virtual function' --
existence of char16_t and char32_t -- give to a user if its implementation
can do pretty much anything?

> Open your mind to the possibility that there may be more than
> one valid way to look at all sorts of things.

I'd rather skip discussion of my mind in favor of more technical issues. ;-)

>> How about introducing levels of functionality so an implementation
>> could
>> legally say "I do not support char16_t" and still be conformant?
>
> You've got it. char16_t is part of a Technical Report that is
> non-normative. You can claim conformance to the C Standard
> without implementing any of the TRs that have been approved or
> are now in the works.

But this will change in some future version of the standard, right? My
understanding (possibly incorrect) is that TRs are to be eventually included
in a standard.

> We added char16_t and char32_t with
> TR 19769 as a way to give the UTF-16 crowd what they needed
> to write reasonably portable code that uses their favorite
> character encoding, regardless of the size or encoding of
> wchar_t on any given implementation.

I am not sure if I am a member of this crowd but how can I write portable
code if say char16_t may be pretty much anything? Use my own typedef? I can
do it right now with uint16_t or a custom 16-bit wide struct. What exactly
does this type give me as a user?

> We even defined a macro
> that advertises whether their favorite character encoding is
> actually used. What they *wanted* was to make UTF-16 the *one
> and only, mandatory* character encoding for C and C++.

Now this is something *I* don't want.

>>> If you use our CoreX library with the char16_t and char32_t
>>> character types, you can choose from a broad assortment of
>>> encodings, regardless of what the compiler vendor supplies.
>>> This assortment does include UTF-16 as a "wide character"
>>> type, even though it breaks the fundamental rules of Standard
>>> C. So you can convert between UTF-8 externally and UTF-16
>>> internally on any system, no matter what it chooses for the
>>> representation of wchar_t or a wide character encoding.
>>
>> Yes I can use a 3rd party library for that. Although in this case I'd
>> prefer some specialized one like ICU that calls things what they are
>> and allows me to do my job rather then spend time thinking how
>> to mold my job into something compatible with iostreams.
>
> Okay, but this discussion has been all about using codecvt facets
> with iostreams to do this particular job. If you're hoping to
> find everything you want included for free with existing locale
> machinery, you're in for a disappointment.

Precisely. That's why I only care about what is provided by standard library
which works (or will eventually work) on any platform. If I need to shop for
an add-on library to do something, then nothing constrains me to standard
library way of doing things.

>>> And more details are in my earlier reply. You've overlooked a
>>> *lot* of functionality that we supply.
>>
>> I couldn't find any details in that message. Could you tell what
>> exactly
>> have I overlooked with regards to the topic of this sub-thread?
>
> You poked a bit at our code and told the world that it was missing
> all sorts of functionality.

I think you are misreading my posts and the whole thread. There was a
precise question asked about what is the nature of conversions performed by
*the default set of locales* provided by the library. To which I answered
and you agreed to every point. I didn't tell anything about what was missing
since there wasn't a definition of 'missing' in this discussion. In addition
it was established early on that anything can be done with a custom codecvt
facet and this wasn't a subject anymore.

> What you didn't see, and what I've
> outlined in other postings to this thread, is:
>
> 1) Where you've seen "degenerate" mappings, they're required by
> the C++ Standard.

Whether this was required was one of the question raised and I am glad you
gave your opinion on that. BTW the word "degenerate" didn't imply anything
negative but only the fact that the conversions are no-op.

> 2) Our code does let you install custom codecvt facets and honors
> them. (Other libraries either stub out this facility or implement
> it with so many bugs that only the simplest codecvt facets have a
> chance of working properly.)

This fact wasn't even contended.

> 3) Our standard package also automatically takes advantage of
> the changing behavior of mbtowc/wctomb to leverage whatever
> code conversions are supplied by the vendor of the underlying
> C library. In the case of Windows, that's a nontrivial assortment.
>
> 4) The C++ Standard intentionally leaves locale support open
> ended, just as the C Standard does. We developed our CoreX
> library in the spirit envisioned by the C++ committee, as a
> natural next step after the Standard C++ library for those who
> want richer support for code conversions (among other things).

The last two points are IMHO completely irrelevant to what was discussed.

> So you didn't see everything that's there and you complained
> about the omission of things that are not required to be
> there.

I don't know why do you think that I complained about anything. I gave an
answer to a particular question and that's it. There was no complaints or
attempt to discuss wide range of functionality your library supplies.

--
Eugene

Eugene Gershnik

unread,
May 23, 2004, 7:40:15 AM5/23/04
to
Ben Hutchings wrote:
> ka...@gabi-soft.fr wrote:
> > "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> > news:<6MKdnWdKwbV...@speakeasy.net>...
> <snip>
> >> The 'many' are for wchar_t<->char conversions and they use
> >> OS primitives to do their job. These facets are constructed
> >> dynamically so you effectively have almost as many as there are
> >> installed codepages on your system. Almost because some codepages
> like >> utf-8 are not supported.
> >
> > Is UTF-8 a codepage?
> > Or is it simply an encoding for which there is no codepage?
>
> It is not supported by the core Win32 character conversion functions
> but it does have a code page ID (CP_UTF8 or 65001).

It _is_ supported by the core conversion functions
(MultiByteToWideChar/WideCharToMultiByte). See
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp
for details.
There are some restrictions on what you can do with these codepages but they
_are_ codepages.

--
Eugene

Eugene Gershnik

unread,
May 23, 2004, 7:42:12 AM5/23/04
to
ka...@gabi-soft.fr wrote:
> Is UTF-8 a codepage? Or is it simply an encoding for which there is
> no
> codepage?

It is a codepage though there are some restrictions on its use.

(What is a codepage, exactly? I remember it from MS-DOS,
> but
> there, it was really, in the end, a mapping of bytes to display bit
> patterns. If this is still the basis, then having a codepage for a
> multibyte encoding is a contradiction in terms.)

Here is the Microsoft definition (taken from
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp)

<quote>
A code page is an ordered set of characters in which a numeric index (also
known as a code point value) is associated with each character. The first
128 characters of each code page are functionally the same and include all
characters needed to type English text. The upper 128 characters of OEM and
ANSI code pages contain characters used in a language or group of languages.
</quote>

Practically a codepage is a table that maps from its code-point values to
and from UTF-16 which Windows uses internally. Physically they are stored in
all these *.nls files in system32 directory.

>>> Of course, the CODESET parameter isn't documented in the man page
>>> either:-).
>
>> I think it is. Googling reveals
>> http://docs.sun.com/db/doc/816-0218/6m6nirqkr?a=view
>
> I'm doing "man langinfo" on a Sparc under Solaris 2.8, and I'm not
> seeing it. Which is surprising -- it's in the header file.

Same here but I long ago stopped using command line man in favor of googling
for "man something solaris".

--
Eugene

Sergiy Kanilo

unread,
May 23, 2004, 7:49:41 AM5/23/04
to

"llewelly" <llewe...@xmission.dot.com> wrote in message
news:86ekpfa...@Zorthluthik.local.bar...

> > > > Sadly I have come to the same conclusion. As another twist I even
have
> > > > no idea about how to initialize mbstate_t.
> > <snip>
> > > > The only thing that seems to always work is
> > >
> > > > mbstate_t state;
> > > > memset(&state, 0, sizeof(mbstate_t));
> [snip]
>
> I don't know if this is portable, but I think the standard requires
>
> mbstate_t state= {0};
>
> to zero-initialize mbstate_t whether mbstate_t is builtin or POD. I

if mbstate_t is enum, or first field of mbstate_t structure is enum,
it will not compile. Instead, you can try

mbstate_t state = {};

Cheers,
Serge

P.J. Plauger

unread,
May 23, 2004, 9:49:13 PM5/23/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:CYqdnZ2s-OG...@speakeasy.net...

> 1) Vendors may not fully understand the standard or rationale for various
> requirements. This is especially true for specialized areas where
particular
> implementor may not have much experience.
> 2) When presented with a feature that may require some effort to implement
> vendors may prefer to take an 'easy way' even though this isn't the best
way
> for their particular platform.
>
> Nothing in this is 'nefarious'.

Perhaps not, but you're still selling them short. In some markets,
the *customers* don't care about certain aspects of Standard C.
Vendors know this and want the latitude to streamline their
compilers/libraries in the unimportant areas so as to better
serve their customers. That's the motive behind making certain
things possibly trivial to implement in the C Standard. It is
an aspect of both the standard and the motivations of vendors
that is often overlooked, or sold short.

> > Interestingly enough, your analogy is quite apt. The object
> > oriented "idea" is quite a valid worldview that has proven to
> > be successful in a variety of environments over the decades.
> > It's not the only worldview, by any means. And you can accuse
> > if of having "problems" not shared by other approaches. But
> > those approaches have their own problems not always shared
> > by OOP languages.
>
> Precisely. So to decide which approach is better one needs to examine the
> problem at hand. In this particular case what does the 'virtual
function' --
> existence of char16_t and char32_t -- give to a user if its implementation
> can do pretty much anything?

If you read the actual TR, you will find that it calls for these
two new character types, plus the minimal set of library
conversion functions to go with them, plus literals. Vendors
who see a commercial advantage in supporting UTF-16 can
supply this machinery in a form that does so, without altering
any past uses of wchar_t.

Even if the compiler doesn't choose the character encoding you'd
like, you still have these types to do with as you please for
your alternate character encoding.

But if you want a guarantee that you can manipulate UTF-16 encoded
text with compiler and library support on and Standard C implementation,
you're not likely to get it. The C Standard doesn't even promise to
support ASCII on all conforming implementations (and not all do).

> >> How about introducing levels of functionality so an implementation
> >> could
> >> legally say "I do not support char16_t" and still be conformant?
> >
> > You've got it. char16_t is part of a Technical Report that is
> > non-normative. You can claim conformance to the C Standard
> > without implementing any of the TRs that have been approved or
> > are now in the works.
>
> But this will change in some future version of the standard, right? My
> understanding (possibly incorrect) is that TRs are to be eventually
included
> in a standard.

The *may* be. Or they give the community enough experience to know
what we really want to do in some future standard.

> > We added char16_t and char32_t with
> > TR 19769 as a way to give the UTF-16 crowd what they needed
> > to write reasonably portable code that uses their favorite
> > character encoding, regardless of the size or encoding of
> > wchar_t on any given implementation.
>
> I am not sure if I am a member of this crowd but how can I write portable
> code if say char16_t may be pretty much anything? Use my own typedef? I
can
> do it right now with uint16_t or a custom 16-bit wide struct. What exactly
> does this type give me as a user?

See above. You can certainly write portable code even without this
TR. But I'm guessing that your idea of portable really means,
"in the style I want, with the extra features I want, and with
guarantees that every conforming compiler must give me the support
I want." You're not likely to get that.

> > We even defined a macro
> > that advertises whether their favorite character encoding is
> > actually used. What they *wanted* was to make UTF-16 the *one
> > and only, mandatory* character encoding for C and C++.
>
> Now this is something *I* don't want.

Then don't use it.

> >>> If you use our CoreX library with the char16_t and char32_t
> >>> character types, you can choose from a broad assortment of
> >>> encodings, regardless of what the compiler vendor supplies.
> >>> This assortment does include UTF-16 as a "wide character"
> >>> type, even though it breaks the fundamental rules of Standard
> >>> C. So you can convert between UTF-8 externally and UTF-16
> >>> internally on any system, no matter what it chooses for the
> >>> representation of wchar_t or a wide character encoding.
> >>
> >> Yes I can use a 3rd party library for that. Although in this case I'd
> >> prefer some specialized one like ICU that calls things what they are
> >> and allows me to do my job rather then spend time thinking how
> >> to mold my job into something compatible with iostreams.
> >
> > Okay, but this discussion has been all about using codecvt facets
> > with iostreams to do this particular job. If you're hoping to
> > find everything you want included for free with existing locale
> > machinery, you're in for a disappointment.
>
> Precisely. That's why I only care about what is provided by standard
library
> which works (or will eventually work) on any platform. If I need to shop
for
> an add-on library to do something, then nothing constrains me to standard
> library way of doing things.

You're still wise to write portable C/C++ as much as possible, and
encapsulate any nonstandard stuff so your choices don't permeate
the code.

[remaining quibbles omitted]

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Eugene Gershnik

unread,
May 24, 2004, 7:58:06 AM5/24/04
to
P.J. Plauger wrote:

> In some markets,
> the *customers* don't care about certain aspects of Standard C.
> Vendors know this and want the latitude to streamline their
> compilers/libraries in the unimportant areas so as to better
> serve their customers.

In all markets I ever worked most users don't care about standard C or C++
at all. The people who care are small minority that has to write lots of
portable code. By the time they discover that they need something the
problems are set in stone by existing implementations and there is nothing
they can do. I guess very few users of any kind complain to vendors
(statistics at my workplace says that only 10% of unhappy customers ever
complain). On most platforms there isn't any serious competition between
vendors so unhappy customers cannot even show their displeasure by going
away. Given the above I don't think vendors' opinions about what better
serves their customers are necessarily accurate.

[...]

> But if you want a guarantee that you can manipulate UTF-16 encoded
> text with compiler and library support on and Standard C
> implementation, you're not likely to get it.

Fair enough. The original discussion was precisely about whether the above
would ever be possible.

--
Eugene

P.J. Plauger

unread,
May 24, 2004, 12:10:43 PM5/24/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:7NednRPZYOW...@speakeasy.net...

> P.J. Plauger wrote:
>
> > In some markets,
> > the *customers* don't care about certain aspects of Standard C.
> > Vendors know this and want the latitude to streamline their
> > compilers/libraries in the unimportant areas so as to better
> > serve their customers.
>
> In all markets I ever worked most users don't care about standard C or C++
> at all. The people who care are small minority that has to write lots of
> portable code. By the time they discover that they need something the
> problems are set in stone by existing implementations and there is nothing
> they can do. I guess very few users of any kind complain to vendors
> (statistics at my workplace says that only 10% of unhappy customers ever
> complain). On most platforms there isn't any serious competition between
> vendors so unhappy customers cannot even show their displeasure by going
> away. Given the above I don't think vendors' opinions about what better
> serves their customers are necessarily accurate.

Noted. I now have a fairly accurate notion of your opinion of software
tools vendors. I'm outta here.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Ben Hutchings

unread,
May 24, 2004, 6:20:13 PM5/24/04
to
Eugene Gershnik wrote:
> Ben Hutchings wrote:
> > ka...@gabi-soft.fr wrote:
> > > "Eugene Gershnik" <gers...@hotmail.com> wrote in message
> > > news:<6MKdnWdKwbV...@speakeasy.net>...
> > <snip>
> > >> The 'many' are for wchar_t<->char conversions and they use
> > >> OS primitives to do their job. These facets are constructed
> > >> dynamically so you effectively have almost as many as there are
> > >> installed codepages on your system. Almost because some codepages
> > >> like utf-8 are not supported.
> > >
> > > Is UTF-8 a codepage?
> > > Or is it simply an encoding for which there is no codepage?
> >
> > It is not supported by the core Win32 character conversion functions
> > but it does have a code page ID (CP_UTF8 or 65001).
>
> It _is_ supported by the core conversion functions
> (MultiByteToWideChar/WideCharToMultiByte). See
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp
> for details.
>
> There are some restrictions on what you can do with these codepages but they
> _are_ codepages.

Yes, you're right; I think I misremembered "not usefully supported" as
"unsupported". Unfortunately there are versions of Windows (Win98 and
WinMe) where UTF-8 is supported by those two yet not by GetCPInfo().
I would consider that also a core conversion function, since it's
rather useful for deciding how large an output buffer to create. It's
also needed for implementation of the standard MB_CUR_MAX macro.
(Hey, I'm back on topic.)

0 new messages