std::wofstream and wide characters

Edward Diener

unread,

May 29, 2002, 6:46:14 PM5/29/02

to

I could not make out from my .pdf copy of the C++ standard whether
std::wofstream is guaranteed to be outputting wide characters to a file when
using the operator << on a wide string, ie.

#include <fstream>
std::wofstream str("afilename.txt");
str << L"a wide string";

I have an implementation which outputs an Ansi string to the file instead of
a wide string and believe it is a bug in the implementation, but want to
check here and see what pother people think.

Also, if anyone wants to conjecture: why does std::wofstream take a const
char * for the filename instead of a const wchar_t * ? Surely countries
whose characters embrace most of a Unicode character ( Japan etc. ) set
can't be happy about that limitation of having to create all their file
names from the first 128 characters in their set when using std::wofstream.

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]
[ about comp.lang.c++.moderated. First time posters: do this! ]

Dietmar Kuehl

unread,

May 30, 2002, 12:24:05 PM5/30/02

to

Edward Diener wrote:

> I could not make out from my .pdf copy of the C++ standard whether
> std::wofstream is guaranteed to be outputting wide characters to a
> file when using the operator << on a wide string,

There is no guarantee about the default encoding used for wide character
streams. The default encoding may eg. use UTF-8 which happens to look
like ASCII if you only use ASCII characters. You might want to look at
the result of writing a non-ASCII character.

I think that the standard effectively requires that a wide character
sequence written to a file with the default code conversion facet and
read in from a file again results in the same string but I'm not sure
about even that much. I would consider it a bad implementation if it
does something different, however.

If you are unhappy with the used encoding, eg. because you expected
something like UTF-16 (which is broken beyond all repair, of course),
you can install your own code conversion facet or use a different than
the default one if there are multiple 'std::codecvt' facets shipping
with your standard library. The purpose of the code conversion facet is
to convert sequences of internal characters into sequences of bytes and
vice versa. The bytes are taken by the file stream and written to the
corresponding file or where obtained from a file by the file stream.

> #include <fstream>
> std::wofstream str("afilename.txt");
> str << L"a wide string";
>
> I have an implementation which outputs an Ansi string to the file
> instead of a wide string and believe it is a bug in the
> implementation, but want to check here and see what pother people
> think.

It is hard to tell ASCII from UTF-8 when you only use ASCII characters
because the ASCII characters are encoded like the ASCII characters in
UTF-8. That is, the implementation may actually write a wide character
encoding and you just failed to realize. Of course, the implementation
may indeed be in error... The thing is: Characters, whether wide or not,
are written to disk in some form of encoding. This encoding can be a
trivial mapping as if often the case for narrow characters but it is in
no way required to be this way. Actually, wide characters (ie. those
with more than one internal byte) are in some form always encoded
because they are broken down to individual bytes. Of course, the bytes
representing the character can be the same used internally as is eg. the
case for UCS-4. Still, it is, at least logically, an encoding.

Encoding can be, however, rather non-trivial. UTF-8 is one of these more
complex encodings where characters are encoded in different width
depending on their Unicode value: The number of bytes is encoded using a
unary representation in the leading non-zero bits (the actual number of
bytes is the number of these bits plus one). The values are then
collected over the sequence of bytes. Other encoding may use a shift
state which is basically some form of escape introducing the use of a
different meaning for the characters. This is actually another option
which you may have encountered: The default state would be use ASCII and
when eg. an arabic character is used, an escape is issued first.

> Also, if anyone wants to conjecture: why does std::wofstream take a
> const char * for the filename instead of a const wchar_t * ? Surely
> countries whose characters embrace most of a Unicode character ( Japan

> etc. ) set can't be happy about that limitation of having to create
> all their file names from the first 128 characters in their set when
> using std::wofstream.

This was discussed in past already. The basic reason is that operating
systems often only support a rather restricted set of characters and
requiring support for other characters would require a definition how
thse are mapped. This, however, is hardly the purpose of the C++
standard. Whether these arguments are considered to be sufficient is
subjective, of course.
--
<mailto:dietma...@yahoo.com> <http://www.dietmar-kuehl.de/> Phaidros
eaSE - Easy Software Engineering: <http://www.phaidros.com/>

Daniel Miller

unread,

May 30, 2002, 5:47:00 PM5/30/02

to

Dietmar Kuehl wrote:

[...snip...]

> If you are unhappy with the used encoding, eg. because you expected
> something like UTF-16 (which is broken beyond all repair, of course),

What is "broken beyond all repair" regarding UTF-16? Don't you mean UCS-2 (and the
UCS-2-oriented 16-bit wchar_t) as being broken beyond all repair in that one still must resort to
shift-states like multibyte character-sets do but that UCS-2 has no such shift-states as it was
intended to be a fixed-width encoding as a reaction to MBCSes? (And ASCII & ISO8859 are also broken
beyond all repair in that neither UCS-2 nor ASCII nor the ISO8859-series can represent all
characters known to humankind, nor have the shift-states that UTF-8 & UTF-16 have.)

I place UTF-8, UTF-16, and UTF-32 in the "works fine" category. I place ASCII, ISO8859, UCS-2,
(and to a lesser degree) UCS-4 in the "deprecated" or "limited applicability" categories.

[...snip...]

> Encoding can be, however, rather non-trivial. UTF-8 is one of these more
> complex encodings where characters are encoded in different width
> depending on their Unicode value: The number of bytes is encoded using a
> unary representation in the leading non-zero bits (the actual number of
> bytes is the number of these bits plus one). The values are then
> collected over the sequence of bytes. Other encoding may use a shift
> state which is basically some form of escape introducing the use of a
> different meaning for the characters. This is actually another option
> which you may have encountered: The default state would be use ASCII and
> when eg. an arabic character is used, an escape is issued first.

UTF-16 works analogously to UTF-8. When a character outside of the UCS-2 space is used, a shift
state is used, analogous to the shifting that you describe above for UTF-8.

Dietmar Kuehl

unread,

May 31, 2002, 12:48:37 PM5/31/02

to

Daniel Miller wrote:

> Dietmar Kuehl wrote:
>>If you are unhappy with the used encoding, eg. because you expected
>>something like UTF-16 (which is broken beyond all repair, of course),

> What is "broken beyond all repair" regarding UTF-16?

People use UTF-16 for internal processing which is a bad idea because it
is a multi-width encoding. It is reasonably OK when externalizing the
characters but then, why use UTF-16 rather than UTF-8: Both are
multi-width encodings (well, UTF-16 results in smaller files but many
operating systems don't have an interface to write 16 bit entities while
most operating systems have interfaces writing 8 bit entities).

I'd say this discussion is moving swiftly off-topic for
comp.lang.c++.moderated, however...

--
<mailto:dietma...@yahoo.com> <http://www.dietmar-kuehl.de/> Phaidros
eaSE - Easy Software Engineering: <http://www.phaidros.com/>

James Kanze

unread,

Jun 4, 2002, 8:39:15 AM6/4/02

to

Dietmar Kuehl <dietma...@yahoo.com> writes:

|> Edward Diener wrote:

|> > I could not make out from my .pdf copy of the C++ standard
|> > whether std::wofstream is guaranteed to be outputting wide
|> > characters to a file when using the operator << on a wide
|> > string,

|> There is no guarantee about the default encoding used for wide
|> character streams. The default encoding may eg. use UTF-8 which
|> happens to look like ASCII if you only use ASCII characters. You
|> might want to look at the result of writing a non-ASCII character.

The default encoding may be straight ASCII, or ISO 8859-1, or
something similar, as well. There is definitely no requirement that
all of the characters in wchar_t can be represented in the external
encoding.

The default encoding is the one you get with the "C" locale. Not that
that tells us anything.

|> I think that the standard effectively requires that a wide
|> character sequence written to a file with the default code
|> conversion facet and read in from a file again results in the same
|> string but I'm not sure about even that much. I would consider it
|> a bad implementation if it does something different, however.

I'm pretty sure that this is the intent.

I think that there is some disagreement as to the required behavior if
you change the locale in the middle of outputting. Changing it before
the first character is output should work, however.

--
James Kanze mailto:ka...@gabi-soft.de
Do you need real expertise in C++? in Java? in OO design?
I am available, see my CV at www.gabi-soft.de
Ziegelhttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)69 63198627

James Kanze

unread,

Jun 4, 2002, 8:39:35 AM6/4/02

to

"Edward Diener" <eldi...@earthlink.net> writes:

|> I could not make out from my .pdf copy of the C++ standard whether
|> std::wofstream is guaranteed to be outputting wide characters to a
|> file when using the operator << on a wide string, ie.

There is no such thing in C++ as outputting wide characters. In
27.8.1, "A File provides an external source/sing stream whose
underlaid character type is char (byte)."

If the file actually does contain wide characters, then it must be
treated as a sequence of multibyte characters.

|> #include <fstream>
|> std::wofstream str("afilename.txt");
|> str << L"a wide string";

|> I have an implementation which outputs an Ansi string to the file
|> instead of a wide string and believe it is a bug in the
|> implementation, but want to check here and see what pother people
|> think.

The standard is quite open about this. Wide character strings are
required to obtain the codecvt facet from the imbued locale to do the
conversion between wide and narrow characters. But there is nothing
in the standard which specifies what the default conversion (the
codecvt in the default locale) does, nor what conversions should be
supported.

All of which means that any useful answer would be extremely
implementation dependant, and that the best solution is to ask your
vendor.

|> Also, if anyone wants to conjecture: why does std::wofstream take
|> a const char * for the filename instead of a const wchar_t * ?

Because no one knows what the semantics should be for wchar_t const*.

|> Surely countries whose characters embrace most of a Unicode
|> character ( Japan etc. ) set can't be happy about that limitation
|> of having to create all their file names from the first 128
|> characters in their set when using std::wofstream.

Or when using std::ofstream, for that matter. The character set of
the filename is independant of the character sets used in the file.

--
James Kanze mailto:ka...@gabi-soft.de
Do you need real expertise in C++? in Java? in OO design?
I am available, see my CV at www.gabi-soft.de
Ziegelhttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)69 63198627

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Pete Becker

unread,

Jun 4, 2002, 11:52:28 AM6/4/02

to

James Kanze wrote:

>
> "Edward Diener" <eldi...@earthlink.net> writes:
>
> |> Surely countries whose characters embrace most of a Unicode
> |> character ( Japan etc. ) set can't be happy about that limitation
> |> of having to create all their file names from the first 128
> |> characters in their set when using std::wofstream.
>
> Or when using std::ofstream, for that matter. The character set of
> the filename is independant of the character sets used in the file.
>

If there's any doubt about this, consider std::ofstream<double>.
Obscure, perhaps, but legal. Should file names for this type be arrays
of doubles?

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

James Kanze

unread,

Jun 5, 2002, 9:15:46 AM6/5/02

to

Pete Becker <peteb...@acm.org> writes:

|> James Kanze wrote:

|> > "Edward Diener" <eldi...@earthlink.net> writes:

|> > |> Surely countries whose characters embrace most of a Unicode
|> > |> character ( Japan etc. ) set can't be happy about that
|> > |> limitation of having to create all their file names from the
|> > |> first 128 characters in their set when using std::wofstream.

|> > Or when using std::ofstream, for that matter. The character set
|> > of the filename is independant of the character sets used in the
|> > file.

|> If there's any doubt about this, consider std::ofstream<double>.
|> Obscure, perhaps, but legal. Should file names for this type be
|> arrays of doubles?

Not really relevant to this thread, but I think that despite the
template in <iostream>, std::ofstream<double> isn't legal. An
implementor is not required to furnish a general instance of
char_traits, and the user cannot provide a specialization for double,
so there will be no std::char_traits<double> for the default
parameter.

Even if the implementor does provide a general instance, the standard
doesn't really specify its semantics. What should int_type be, for
example, for the instantiation of char_traits<double>?

In many ways, the fact that the iostream's are templates is
misleading. In practice, it is very difficult, if not impossible, to
use anything but char and wchar_t. And the instantiations for char
and for wchar_t have subtly different semantics, due to the fact that
codecvt<char,char,mbstate_t> is required to be a no-op (where as
codecvt<wchar_t,char,mbstate_t> cannot be.

--
James Kanze mailto:ka...@gabi-soft.de
Do you need real expertise in C++? in Java? in OO design?
I am available, see my CV at www.gabi-soft.de
Ziegelhttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)69 63198627

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Pete Becker

unread,

Jun 5, 2002, 1:49:34 PM6/5/02

to

James Kanze wrote:
>
> Pete Becker <peteb...@acm.org> writes:
>
> |> James Kanze wrote:
>
> |> > "Edward Diener" <eldi...@earthlink.net> writes:
>
> |> > |> Surely countries whose characters embrace most of a Unicode

> |> > |> character ( Japan etc. ) set can't be happy about that > |>

> |> limitation of having to create all their file names from the > |>

> |> first 128 characters in their set when using std::wofstream.
>
> |> > Or when using std::ofstream, for that matter. The character set

> |> > of the filename is independant of the character sets used in the

> |> > file.
>
> |> If there's any doubt about this, consider std::ofstream<double>.
> |> Obscure, perhaps, but legal. Should file names for this type be
> |> arrays of doubles?
>
> Not really relevant to this thread, but I think that despite the
> template in <iostream>, std::ofstream<double> isn't legal.

The point, of course, is that the type of the character in the file name
has no inherent connection with the type of the data object that the
stream traffics in.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Pete Becker

unread,

Jun 5, 2002, 3:56:35 PM6/5/02

to

James Kanze wrote:
>
> Pete Becker <peteb...@acm.org> writes:
>
> |> James Kanze wrote:
>
> |> > "Edward Diener" <eldi...@earthlink.net> writes:
>
> |> > |> Surely countries whose characters embrace most of a Unicode

> |> > |> character ( Japan etc. ) set can't be happy about that > |>

> |> limitation of having to create all their file names from the > |>

> |> first 128 characters in their set when using std::wofstream.
>
> |> > Or when using std::ofstream, for that matter. The character set

> |> > of the filename is independant of the character sets used in the

> |> > file.
>
> |> If there's any doubt about this, consider std::ofstream<double>.
> |> Obscure, perhaps, but legal. Should file names for this type be
> |> arrays of doubles?
>
> Not really relevant to this thread, but I think that despite the
> template in <iostream>, std::ofstream<double> isn't legal. An
> implementor is not required to furnish a general instance of
> char_traits, and the user cannot provide a specialization for double,
> so there will be no std::char_traits<double> for the default
> parameter.
>
> Even if the implementor does provide a general instance, the standard
> doesn't really specify its semantics. What should int_type be, for
> example, for the instantiation of char_traits<double>?
>

Fine. Pretend I had said; "IF AN IMPLEMENTOR SUPPLIES A SUITABLE
TEMPLATE, should file names for this type be arrays of doubles?" Whether
the standard mandates this is irrelevant. Certainly there are many
things that are unclear about what such a thing should do, but one of
the things that is quite clear is that file names for such a stream are
not required to be written as arrays of doubles.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Edward Diener

unread,

Jun 6, 2002, 6:38:50 AM6/6/02

to

"Pete Becker" <peteb...@acm.org> wrote in message
news:3CFE11C7...@acm.org...

> James Kanze wrote:
> >
> > Pete Becker <peteb...@acm.org> writes:
> >
> > |> James Kanze wrote:
> >
> > |> > "Edward Diener" <eldi...@earthlink.net> writes:
> >
> > |> > |> Surely countries whose characters embrace most of a Unicode
>
> > |> > |> character ( Japan etc. ) set can't be happy about that > |>
>
> > |> limitation of having to create all their file names from the > |>
>
> > |> first 128 characters in their set when using std::wofstream.
> >
> > |> > Or when using std::ofstream, for that matter. The character set
>
> > |> > of the filename is independant of the character sets used in the
>
> > |> > file.
> >
> > |> If there's any doubt about this, consider std::ofstream<double>.
> > |> Obscure, perhaps, but legal. Should file names for this type be
> > |> arrays of doubles?
> >
> > Not really relevant to this thread, but I think that despite the
> > template in <iostream>, std::ofstream<double> isn't legal.
>
> The point, of course, is that the type of the character in the file name
> has no inherent connection with the type of the data object that the
> stream traffics in.

Still it seems logical that in order to support filenames for languages
which need at least a 16 bit encoding, the std::fstream constructors and
open functions should be overloaded to support const wchar * flenames and
not just const char * filenames. This would theoretically allow a Japanese
programmer on a Japanese version of Windows, let's say, to use this
functionality to create files whose filename was a string of wide
characters. I relaize that their is no mention in the C++ standard of the
wchar_t type being equivalent to some size Unicode character, but it is
diffiicult for me to believe that the C++ standards committee is not
interested in possibly accomodating such character encodings in the concept
of a filename and therefore nationalities and languages which use them.

There are some other places in the C++ standard library where it is just
assumed that char strings are the only type which matter ( exception "error"
strings as an example ), once again making it hard for people whose language
encoding is much closer to wchsr_t strings to programming using C++
effectively.

James Kanze

unread,

Jun 6, 2002, 7:00:40 AM6/6/02

to

Pete Becker <peteb...@acm.org> writes:

|> James Kanze wrote:

|> > Pete Becker <peteb...@acm.org> writes:

|> > |> James Kanze wrote:

|> > |> > "Edward Diener" <eldi...@earthlink.net> writes:

Pete, I said that my comment was NOT relevant to this thread. I
totally agree with your premise -- I don't even have the slightest
idea what an implementation would do with filenames for my
std::ofstream<MyPODClass,MyTraits>, for example (which is legal, and
potentially useful), if the type of the filenames was MyPODClass
const*.

I just felt it worth pointing out that the fact that iostream's are
templates doesn't buy us as much as one might think. It is possible
to use other instances that char or wchar_t, but it is a lot of work,
and you really have to know what you are doing. Nothing like the
templated vector, for example.

--
James Kanze mailto:ka...@gabi-soft.de
Do you need real expertise in C++? in Java? in OO design?
I am available, see my CV at www.gabi-soft.de
Ziegelhttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)69 63198627

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Dietmar Kuehl

unread,

Jun 6, 2002, 10:34:08 AM6/6/02

to

James Kanze <ka...@alex.gabi-soft.de> wrote in message:

> Not really relevant to this thread, but I think that despite the
> template in <iostream>, std::ofstream<double> isn't legal.

Well, 'std::ofstream<double>' is illegal anyway, because 'std::ofstream'
is not a class template: 'std:ofstream' is a typedef. However,
'std::basic_ofstream<double, traits>' is definitely legal, assuming that
'traits' provides suitable character traits for character type 'double'.

> What should int_type be, for example, for the instantiation of
> char_traits<double>?

Well, 'long double' would be a reasonable choice...

> In many ways, the fact that the iostream's are templates is
> misleading. In practice, it is very difficult, if not impossible, to
> use anything but char and wchar_t.

Difficult: yes, impossible: no.

> And the instantiations for char
> and for wchar_t have subtly different semantics, due to the fact that
> codecvt<char,char,mbstate_t> is required to be a no-op (where as
> codecvt<wchar_t,char,mbstate_t> cannot be.

This statement is plain wrong: The base class 'std::codecvt<char, char,
wchar_t>' is required to implement a degenerate encoding but the
implementation cannot in any form rely on the fact this facet indeed
does a degenerate encoding because a user may derive from this class and
provide a non-degenerate encoding instead. Basically, the only thing
which is special about 'std::codecvt<char, char, mbstate_t>' is that an
encoding which happens to be identical to the internal representation is
possible. Of course, this can also be the case for
'std::codecvt<wchar_t, wchar_t, mbstate_t>' or, in general, for all code
conversion facets where the internal and external character types are
identical.

--
<mailto:dietma...@yahoo.com> <http://www.dietmar-kuehl.de/> Phaidros
eaSE - Easy Software Engineering: <http://www.phaidros.com/>

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

Jun 6, 2002, 11:37:18 AM6/6/02

to

"Edward Diener" <eldi...@earthlink.net> writes:

|> > The point, of course, is that the type of the character in the
|> > file name has no inherent connection with the type of the data
|> > object that the stream traffics in.

|> Still it seems logical that in order to support filenames for
|> languages which need at least a 16 bit encoding, the std::fstream
|> constructors and open functions should be overloaded to support
|> const wchar * flenames and not just const char * filenames.

The point is being considered by the committee. There is some argument
that the basic templates (basic_filebuf, etc.) should be extended to
take either a char const* or a wchar_t const* for the filename. (This
means, of course, that you could open a narrow character file with a
wide character filename, or vice versa.)

The problem is that while a large number of people seem to favor the
idea in general, there is no consensus with regards to the semantics.
It's nice to say we need a feature, but in order to standardize it, we
should at least have some idea what it should mean -- even if the
standard says it is implementation defined, we need to know what we have
a right to expect from a quality implementation in certain specific
contexts.

|> This would theoretically allow a Japanese programmer on a Japanese
|> version of Windows, let's say, to use this functionality to create
|> files whose filename was a string of wide characters. I relaize that

|> their is no mention in the C++ standard of the wchar_t type being
|> equivalent to some size Unicode character, but it is diffiicult for
|> me to believe that the C++ standards committee is not interested in
|> possibly accomodating such character encodings in the concept of a
|> filename and therefore nationalities and languages which use them.

The case of Windows is easiest, I think, because it is one of the rare
systems which supports wide character filenames natively. What should a
Unix system do when passed a wide character filename (given that UTF-8
filenames are legal, but not very common, whereas on my machine, there
are filenames using accented characters from 8859-1, which are not
compatible with UTF-8).

|> There are some other places in the C++ standard library where it is

|> just assumed that char strings are the only type which matter (
|> exception "error" strings as an example ), once again making it hard

|> for people whose language encoding is much closer to wchsr_t strings

|> to programming using C++ effectively.

In the case of exceptions, I would expect the string to be some sort of
a message key, maybe understandable to the developpers, but certainly
not to a normal user. Any message actually displayed would be obtained
from a message facet anyway. So I don't see this as a problem.

However, I do think that you are right in a way. Someone should
probably make a list of all places where narrow character strings are
supported, but not wide character strings. Then we could discuss each
one, as to whether wide character support is useful, and what semantics
it should have.

--
James Kanze mailto:ka...@gabi-soft.de

Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)69 63198627

Edward Diener

unread,

Jun 6, 2002, 6:28:46 PM6/6/02

to

"James Kanze" <ka...@alex.gabi-soft.de> wrote in message

news:86k7pce...@alex.gabi-soft.de...

> "Edward Diener" <eldi...@earthlink.net> writes:
>
> |> > The point, of course, is that the type of the character in the
> |> > file name has no inherent connection with the type of the data
> |> > object that the stream traffics in.
>
> |> Still it seems logical that in order to support filenames for
> |> languages which need at least a 16 bit encoding, the std::fstream
> |> constructors and open functions should be overloaded to support
> |> const wchar * flenames and not just const char * filenames.
>
> The point is being considered by the committee. There is some
> argument that the basic templates (basic_filebuf, etc.) should be
> extended to take either a char const* or a wchar_t const* for the
> filename. (This means, of course, that you could open a narrow
> character file with a wide character filename, or vice versa.)

I think this is a good idea as it should be made as flexible as
possible.

>
> The problem is that while a large number of people seem to favor the
> idea in general, there is no consensus with regards to the semantics.
> It's nice to say we need a feature, but in order to standardize it, we

> should at least have some idea what it should mean -- even if the
> standard says it is implementation defined, we need to know what we
> have a right to expect from a quality implementation in certain
> specific contexts.

I understand that but being as the C++ standard makes no attempt to tie
wchar_t to any particular character encoding, is it really necessary to
understand how an implementation will use wchar_t filenames ?

>
> |> This would theoretically allow a Japanese programmer on a Japanese

> |> version of Windows, let's say, to use this functionality to create
> |> files whose filename was a string of wide characters. I relaize
> |> that
>
> |> their is no mention in the C++ standard of the wchar_t type being
> |> equivalent to some size Unicode character, but it is diffiicult
> |> for me to believe that the C++ standards committee is not
> |> interested in possibly accomodating such character encodings in
> |> the concept of a filename and therefore nationalities and
> |> languages which use them.
>
> The case of Windows is easiest, I think, because it is one of the rare

> systems which supports wide character filenames natively. What should

> a Unix system do when passed a wide character filename (given that
> UTF-8 filenames are legal, but not very common, whereas on my machine,

> there are filenames using accented characters from 8859-1, which are
> not compatible with UTF-8).

One could just as easily posit an operating system which only
understands wide character filenames. What should such a system do with
narrow character filenames ? I believe there are code conversions
between narrow and wide character defined by one of the locale facets. I
do understand that data will not usually be lost when converting from
char to wchar_t but may be when converting the other way. Still, in the
case you specified, the C++ standard could either specify the use of the
code conversion facet, run-time failure because wide character filenames
are not supported in that implementation, or leave it as
implementation-defined which action is taken.

>
> |> There are some other places in the C++ standard library where it
> |> is
>
> |> just assumed that char strings are the only type which matter (
> |> exception "error" strings as an example ), once again making it
> |> hard
>
> |> for people whose language encoding is much closer to wchsr_t
> |> strings
>
> |> to programming using C++ effectively.
>
> In the case of exceptions, I would expect the string to be some sort
> of a message key, maybe understandable to the developpers, but
> certainly not to a normal user. Any message actually displayed would
> be obtained from a message facet anyway. So I don't see this as a
> problem.

Yes, Peter Dimov said the same thing in an online discussion I had with
him about this issue. Nonetheless, in actual programming I would expect
a human readable error message to most normally occur from any exception
system rathere than an numerically encoded message..

>
> However, I do think that you are right in a way. Someone should
> probably make a list of all places where narrow character strings are
> supported, but not wide character strings. Then we could discuss each

> one, as to whether wide character support is useful, and what
> semantics it should have.

I definitely encourage this. I know it is hard work for the committee to
decide on such issues but I would like C++ to be considered an effective
programming in large parts of the world which do not normally use a
narrow character coding for visual names, messages, and displays. While
I do not view it as a great imposition that, lets us say, a Japanese or
Chinese programmer must use the Ansi C++ characters for writing their
programs, I do see it as an imposition on such nationalities when they
have to be so constrained in dealing with their operating system or with
their users. I think C++ needs to be a viable language in which to
program for these people else they will choose other languages which
have been created from the ground up to support wide character encodings
via the size of their character type.

I would also like the committee, in their decisions in this area, to
look toward the future if it becomes necessary to expand the native
character types form the current "char" and "wchar_t". Of course class
and function templates often provide such possible expandibility without
a great deal of re-engineering.

Pete Becker

unread,

Jun 6, 2002, 10:34:57 PM6/6/02

to

Edward Diener wrote:
>
> I understand that but being as the C++ standard makes no attempt to tie
> wchar_t to any particular character encoding, is it really necessary to
> understand how an implementation will use wchar_t filenames ?
>

Yes, absolutely. Suppose I write data to a file named L"abc" and I then
try to read data from a file named "abc". Am I reading the data that I
just wrote?

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

Jun 7, 2002, 10:17:00 AM6/7/02

to

dietma...@yahoo.com (Dietmar Kuehl) writes:

|> James Kanze <ka...@alex.gabi-soft.de> wrote in message:
|> > Not really relevant to this thread, but I think that despite the
|> > template in <iostream>, std::ofstream<double> isn't legal.

|> Well, 'std::ofstream<double>' is illegal anyway, because
|> 'std::ofstream' is not a class template: 'std:ofstream' is a
|> typedef.

Oops.

|> However, 'std::basic_ofstream<double, traits>' is definitely
|> legal, assuming that 'traits' provides suitable character traits
|> for character type 'double'.

That is correct up to a point. It is std::basic_ofstream<double>
which is not legal, because of the (possible) absense of
std::char_traits<double>, and the impossibility of the user to supply
it. However, even std::basic_ofstream<double,traits> can only be used
under limited conditions; the user must ensure that it is imbued
(either explicitly, or by modifying the global locale) with a locale
which has the necessary codecvt.

And some of the functionality of ostream will still not be usable.
Numeric IO, for example, relies on num_put and num_get, which rely on
std::numpunct. And the user cannot provide a specialization of this
class for double.

|> > What should int_type be, for example, for the instantiation of
|> > char_traits<double>?

|> Well, 'long double' would be a reasonable choice...

On many platforms, long double and double are identical.

Why not "struct IntType { double; bool ; }" ?

Obviously, this class will require some overloaded operators (at least
== and !=), and the traits class will need some special functions as
well.

|> > In many ways, the fact that the iostream's are templates is
|> > misleading. In practice, it is very difficult, if not
|> > impossible, to use anything but char and wchar_t.

|> Difficult: yes, impossible: no.

In practice, I think I'd say impossible for any other built-in types.

Using a user defined character type allows the user to provide all of
the necessary specializations. But it is a lot more work that one
would first think.

|> > And the instantiations for char and for wchar_t have subtly
|> > different semantics, due to the fact that
|> > codecvt<char,char,mbstate_t> is required to be a no-op (where as
|> > codecvt<wchar_t,char,mbstate_t> cannot be.

|> This statement is plain wrong: The base class 'std::codecvt<char,
|> char, wchar_t>' is required to implement a degenerate encoding but
|> the implementation cannot in any form rely on the fact this facet
|> indeed does a degenerate encoding because a user may derive from
|> this class and provide a non-degenerate encoding instead.

This is part of what I don't really understand too well. Are you
saying that when I do "use_facet<codecvt<char,char,mbstate_t>(...)",
the facet I actually get may not have the same semantics as
codecvt<char,char,mbstate_t>? (This seems more reasonable, or at
least more useful, than what I thought. But as I say, I am having
real difficulty in understanding these parts of the standard.)

|> Basically, the only thing which is special about
|> 'std::codecvt<char, char, mbstate_t>' is that an encoding which
|> happens to be identical to the internal representation is
|> possible. Of course, this can also be the case for
|> 'std::codecvt<wchar_t, wchar_t, mbstate_t>' or, in general, for
|> all code conversion facets where the internal and external
|> character types are identical.

The standard doesn't forbid other codecvt from implementing the no-op
conversion -- in such cases, I presume that the semantics would be the
same as a static_cast from the source to the destination type.

Whether any of this is actually useful is another question:-).

--
James Kanze mailto:ka...@gabi-soft.de

Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)69 63198627

Edward Diener

unread,

Jun 9, 2002, 7:21:01 PM6/9/02

to

"Pete Becker" <peteb...@acm.org> wrote in message

news:3CFFFD08...@acm.org...

> Edward Diener wrote:
> >
> > I understand that but being as the C++ standard makes no attempt to tie
> > wchar_t to any particular character encoding, is it really necessary to
> > understand how an implementation will use wchar_t filenames ?
> >
>
> Yes, absolutely. Suppose I write data to a file named L"abc" and I then
> try to read data from a file named "abc". Am I reading the data that I
> just wrote?

It depends on the OS. If the OS supports wide character filenames which are
distinct from narrow character filenames, you are not reading the data you
just wrote, else you are. Or, at least, that is what my assumption would be.
But yes, you have made your point that the implementation must at least be
clarified.

A similar situation occurs between a filename of "afile" and a filename of
"AFILE". In C++ these are two different names, in Windows they are the same
file name, in Linux they are two different file names.

My point is simply that C++ should treat wide character file names in this
same OS dependent way. In C++ L"abc" and "abc" are two different names but
if an OS treats them as the same file, it is an implementation issue.

Pete Becker

unread,

Jun 10, 2002, 12:43:23 PM6/10/02

to

Edward Diener wrote:
>
> "Pete Becker" <peteb...@acm.org> wrote in message
> news:3CFFFD08...@acm.org...

> > Yes, absolutely. Suppose I write data to a file named L"abc" and I
> > then try to read data from a file named "abc". Am I reading the data

> > that I just wrote?
>

> A similar situation occurs between a filename of "afile" and a
> filename of "AFILE". In C++ these are two different names, in Windows
> they are the same file name, in Linux they are two different file
> names.

Similar but much simpler. And well understood by most programmers.

>
> My point is simply that C++ should treat wide character file names in
> this same OS dependent way. In C++ L"abc" and "abc" are two different
> names but if an OS treats them as the same file, it is an
> implementation issue.
>

The OS doesn't know anything about L"abc". That's something you write in
your C or C++ source code to tell the compiler to generate a wide
character string literal in an implementation-specific manner (which may
be locale dependent). Which means that the wide character string that
you get can vary from compiler to compiler, even on the same OS. And, of
course, if the OS doesn't support wide character file names then you
need to decide how to map this wide character string to a byte string
for the OS. And again you've got a locale dependency.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Edward Diener

unread,

Jun 10, 2002, 5:04:59 PM6/10/02

to

"Pete Becker" <peteb...@acm.org> wrote in message

news:3D040C9D...@acm.org...

> Edward Diener wrote:
> >
> > "Pete Becker" <peteb...@acm.org> wrote in message
> > news:3CFFFD08...@acm.org...
> > > Yes, absolutely. Suppose I write data to a file named L"abc" and I
> > > then try to read data from a file named "abc". Am I reading the data
>
> > > that I just wrote?
> >
> > A similar situation occurs between a filename of "afile" and a
> > filename of "AFILE". In C++ these are two different names, in Windows
> > they are the same file name, in Linux they are two different file
> > names.
>
> Similar but much simpler. And well understood by most programmers.

A specious argument. What becomes well understood by programmers is often
what becomes standardized and used over time.

>
> >
> > My point is simply that C++ should treat wide character file names in
> > this same OS dependent way. In C++ L"abc" and "abc" are two different
> > names but if an OS treats them as the same file, it is an
> > implementation issue.
> >
>
> The OS doesn't know anything about L"abc". That's something you write in
> your C or C++ source code to tell the compiler to generate a wide
> character string literal in an implementation-specific manner (which may
> be locale dependent).

Total agreement.

> Which means that the wide character string that
> you get can vary from compiler to compiler, even on the same OS.

It can but in most cases I do not believe it will be for a given locale. If
I use VC++ or C++ Builder to open a narrow character file name, I would
expect both to do essentially the same thing. I don't see that the result
with a wide character file name for a given locale should be very different.

> And, of
> course, if the OS doesn't support wide character file names then you
> need to decide how to map this wide character string to a byte string
> for the OS. And again you've got a locale dependency.

Agreed. I think the use has to be largely implementation defined with some
possible guidelines given by the C++ standard. In actual use, let us say on
Windows which does specifically support wide character file names, I would
expect every compiler to treat it the same, as an eventual call to
CreateFileW with the name and the appropriate open parameters. That the name
itself may also be dependent on the locale I grant, but C++ already has many
locale specific functions, as it should.

I am not saying that the issues regarding wide character file names,
locales, or operating systems are easy to address and decide upon for the
C++ standard. I am just saying that they shouldn't be ignored because some
difficult decisions have to be made and a consensus must be created. I think
ignoring them and saying that the C++ standard library will not support them
because there is no common consensus on what they mean in current usage is
much worse than making some decision to support them in the appropriate
places in the library even if the final decision is not appropriate for all
OSs and computers systems which have C++ compilers.

I could be wrong but the reality to me seems to be that wide characters,
however they are defned, will become more prevalent in the future of
computers and that the 8-bit character will become less important if not
eventually obsolete. I don't expect to shoehorn cultures which have a rich
set of characters into a more limited set because we want to say that
ideographs are the past and alphabets the future. I expect the opposite.
Signs and symbols will need to become richer to reflect a richer reality.
This is just my philosophical bent and the C++ committee may feel otherwise.

Pete Becker

unread,

Jun 10, 2002, 8:26:14 PM6/10/02

to

Edward Diener wrote:
>
> I am not saying that the issues regarding wide character file names,
> locales, or operating systems are easy to address and decide upon for
> the
> C++ standard. I am just saying that they shouldn't be ignored because

> C++ some

> difficult decisions have to be made and a consensus must be created.

They're not being ignored.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

Jun 11, 2002, 12:02:46 PM6/11/02

to

"Edward Diener" <eldi...@earthlink.net> wrote in message
news:<p8PM8.5610$Pv2....@newsread2.prod.itd.earthlink.net>...

> > that I just wrote?

I don't think that there is any doubt that the standard should simply
say "implementation defined". But we still should have some idea of
what to expect from a quality implementation before standardizing it.

I don't think that the problem is insolvable. Just that to date, no one
has really applied themselves to it. And that most of the people
requesting the feature don't seem to realize the problem.

--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)69 63198627

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

Jun 15, 2002, 11:16:20 AM6/15/02

to

Pete Becker <peteb...@acm.org> writes:

|> The OS doesn't know anything about L"abc". That's something you
|> write in your C or C++ source code to tell the compiler to
|> generate a wide character string literal in an
|> implementation-specific manner (which may be locale dependent).

There is something I don't understand here. What do you mean by saing
that the wide character string literal may be locale dependant? How
can the compiler know about the locale (which is a runtime feature)?
Or do you mean that it may depend on the locale settings when the
compiler is executing? But I thought that internally, the compiler
was more or less required to use the locale "C" -- I certainly don't
write 3,14159 in my C++ programs just because I have
LC_NUMERIC="fr_FR" in my environment.

--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)69 63198627

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Pete Becker

unread,

Jun 15, 2002, 5:58:36 PM6/15/02

to

James Kanze wrote:
>
> Pete Becker <peteb...@acm.org> writes:
>
> |> The OS doesn't know anything about L"abc". That's something you
> |> write in your C or C++ source code to tell the compiler to
> |> generate a wide character string literal in an
> |> implementation-specific manner (which may be locale dependent).
>
> There is something I don't understand here. What do you mean by saing
> that the wide character string literal may be locale dependant? How
> can the compiler know about the locale (which is a runtime feature)?
> Or do you mean that it may depend on the locale settings when the
> compiler is executing? But I thought that internally, the compiler
> was more or less required to use the locale "C" -- I certainly don't
> write 3,14159 in my C++ programs just because I have
> LC_NUMERIC="fr_FR" in my environment.
>

I should probably add that the reason I went to the C99 standard is that
the C++ standard says only that a wide string literal produces an array
that "is initialized with the given characters." This is, of course,
meaningless.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Pete Becker

unread,

Jun 15, 2002, 5:59:05 PM6/15/02

to

James Kanze wrote:
>
> Pete Becker <peteb...@acm.org> writes:
>
> |> The OS doesn't know anything about L"abc". That's something you
> |> write in your C or C++ source code to tell the compiler to
> |> generate a wide character string literal in an
> |> implementation-specific manner (which may be locale dependent).
>
> There is something I don't understand here. What do you mean by saing
> that the wide character string literal may be locale dependant? How
> can the compiler know about the locale (which is a runtime feature)?
> Or do you mean that it may depend on the locale settings when the
> compiler is executing? But I thought that internally, the compiler
> was more or less required to use the locale "C" -- I certainly don't
> write 3,14159 in my C++ programs just because I have
> LC_NUMERIC="fr_FR" in my environment.
>

In 6.4.5/5 the C99 standard (I don't have C89 handy) says that wide
character literals are used to initialize elements of a static array as
follows:

for wide string literals, the array elements have type
wchar_t, and are initialized with the sequence of wide
characters corresponding to the multibyte character
sequence, as defined by the mbstowcs function with an
implementation-defined current locale.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]