A very puzzling problem: cout vs. wcout, fstream vs. wfstream

tomgee

unread,

Jun 1, 2006, 7:21:53 AM6/1/06

to

Here is a string I need to write to the console or a file
"Start date or time can't be later than ending date or time!"

here below is the demo code(compiled with VC2005)

#include <fstream>
#include <string>

using namespace std;

int main(int argc, char *argv[])
{
string s("Start date or time can't be later than ending date or
time!");
wstring ws(L"Start date or time can't be later than ending date
or time!");

ofstream f("f.txt");
wofstream wf("wf.txt");

f << s <<endl;
wf << ws<<endl;

cout << s <<endl;
wcout << ws<<endl;

system("PAUSE");
return EXIT_SUCCESS;
}

To my great suprise, it works OK when using cout and fstream: it shows
the whole string.
However when it comes to wcout and wofstream. both cut the string and
stop when they come to "'", so the output is "Start date or time
can".

After looking to the hexview of this string, I found that the code for
"'" is infact UNICODE 0x2019, not the ascii code 0x27.

could this be the problem? how to explain it. Thanks.

hex view of the string attached.

0008b357:53 00 74 00 61 00 72 00 74 00 20 00 64 00 61 00 S.t.a.r.t.
.d.a.
0008b367:74 00 65 00 20 00 6f 00 72 00 20 00 74 00 69 00 t.e. .o.r.
.t.i.
0008b377:6d 00 65 00 20 00 63 00 61 00 6e 00 19 20 74 00 m.e.
.c.a.n.. t.
0008b387:20 00 62 00 65 00 20 00 6c 00 61 00 74 00 65 00 .b.e.
.l.a.t.e.
0008b397:72 00 20 00 74 00 68 00 61 00 6e 00 20 00 65 00 r.
.t.h.a.n. .e.
0008b3a7:6e 00 64 00 69 00 6e 00 67 00 20 00 64 00 61 00 n.d.i.n.g.
.d.a.
0008b3b7:74 00 65 00 20 00 6f 00 72 00 20 00 74 00 69 00 t.e. .o.r.
.t.i.
0008b3c7:6d 00 65 00 21 00 m.e.!.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

kanze

unread,

Jun 3, 2006, 7:52:22 AM6/3/06

to

tomgee wrote:
> Here is a string I need to write to the console or a file
> "Start date or time can't be later than ending date or time!"

> To my great suprise, it works OK when using cout and fstream:

> it shows the whole string. However when it comes to wcout and
> wofstream. both cut the string and stop when they come to "'",
> so the output is "Start date or time can".

> After looking to the hexview of this string, I found that the
> code for "'" is infact UNICODE 0x2019, not the ascii code
> 0x27.

Whose interpretation depends on the imbued locale. In locale
"C" (the default), I think that only characters in the basic
execution character set are supported.

I'm not too familiar with the Windows environment, but under
Unix, practically the first lines in main in any program which
uses text which might not be in English will be:
std::locale::global( std::locale() ) ;
setlocale( LC_ALL, "" ) ;
In theory, that should work, but the interactions between C/C++
locales and the OS's view of things aren't always that clear.
(Set your window's font to Zapf Dingbats, for example, and the
output is going to look very wierd, even in locale "C". And
there's nothing that library or the language can do about it.)

Having said that, I think it is a questionable decision to map
characters in the basic character set to anything but the first
128 entries in Unicode. Clearly legal ("The values of the
execution character sets are implementation defined", §2.2/3),
but rather surprising. This is an awkward point in general;
logically, the mapping should correspond to the locale which is
active at the time. Except that the mapping must be fixed at
compile time, and the locale isn't fixed until runtime. In this
case, however, I think that Windows is pretty much uniquely
Unicode for wchar_t, where 0x2019 is a RIGHT SINGLE QUOTATION
MARK, and the ASCII character in the basic source character set
is an APOSTROPHE, which is a different character, and should map
to 0x0027.

It's an interesting point, however, that the standard doesn't
name the characters in the basic source character set, but only
gives a graphical representation. Presumably mapping the
character which looks like an 'A' to 0x0391 would be legal as
well. But as I said, this is an awkward problem as well: the
graphic representation in the standard has serifs; if the Greek
fonts on my machine display with serifs, and the Latin fonts
without, mapping it to 0x0391 would arguably be better.

I think we all know and agree what is actually meant and
desired. Or rather, I though we did -- your experience suggests
that there are some people who interpret it differently.
Although historically... C comes from the world of ASCII. I
would suppose that when the standard uses a graphic
representation for a character, what it means is the character
in ASCII which has the graphic representation closest to the
representation in the standard -- that all characters in the
basic source character set are present in ASCII. In this case,
APOSTROPHE and LATIN CAPITAL LETTER A are characters in ASCII,
RIGHT SINGLE QUOTATION MARK and GREEK CAPITAL LETTER ALPHA are
not. Given the historical background, I find it very difficult
to imagine that either of these alternative mappings are conform
to the spirit of the standard. (Note too that in this regard,
changing ASCII to EBCDIC, above, doesn't change anything. And
reading K&R1, it seems to me that any ambiguity concerning the
actual encoding of the characters is to allow EBCDIC, or other
historical 7 or 8 bit code sets. And not to allow interpreting
characters in the basic source character set to be characters
not present in ASCII or EBCDIC.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Michiel...@tomtom.com

unread,

Jun 3, 2006, 10:04:52 AM6/3/06

to

tomgee wrote:
> #include <fstream>
> #include <string>
>
> using namespace std;
>
> int main(int argc, char *argv[])
> {
> string s("Start date or time can't be later than ending date or
> time!");
> wstring ws(L"Start date or time can't be later than ending date
> or time!");

> cout << s <<endl;
> wcout << ws<<endl;

You can't mix those; you have to choose. In addition, AFAIK on many
systems wcout is more or less broken. Even if wchar_t supports Unicode,
wcout may not.

HTH,
Michiel Salters

James Kanze

unread,

Jun 3, 2006, 8:40:15 PM6/3/06

to

Michiel...@tomtom.com wrote:
> tomgee wrote:
>> #include <fstream>
>> #include <string>

>> using namespace std;

>> int main(int argc, char *argv[])
>> {
>> string s("Start date or time can't be later than ending date or
>> time!");
>> wstring ws(L"Start date or time can't be later than ending date
>> or time!");

>> cout << s <<endl;
>> wcout << ws<<endl;

> You can't mix those; you have to choose.

I think he just put the two cases in one program for the
example. I don't really think his actual application output the
same text twice, once as char, and then as wchar_t.

> In addition, AFAIK on
> many systems wcout is more or less broken. Even if wchar_t
> supports Unicode, wcout may not.

What wcout supports depends on the locale it is imbued with. I
*think* that that must be locale "C". I also think that one
could argue that locale "C" should only support the basic narrow
character set.

I mentionned in an earlier posting that you should take pains to
set up the global locale correctly when starting a program which
uses internationalized text. I forgot that in C++, since each
stream has its own copy of the locale, copied when the stream
was created, you also have to imbue the standard streams with
the new global locale.

--
James Kanze kanze...@neuf.fr

Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

9 place Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34

Allan W

unread,

Jun 5, 2006, 7:33:43 PM6/5/06

to

tomgee wrote:
> Here is a string I need to write to the console or a file
> "Start date or time can't be later than ending date or time!"
>
> here below is the demo code(compiled with VC2005)

I tried it with VC2003, and didn't have this problem.

> To my great suprise, it works OK when using cout and fstream: it shows
> the whole string.
> However when it comes to wcout and wofstream. both cut the string and
> stop when they come to "'", so the output is "Start date or time
> can".
>
> After looking to the hexview of this string, I found that the code for
> "'" is infact UNICODE 0x2019, not the ascii code 0x27.

I specifically checked for this... didn't happen for me.

> could this be the problem? how to explain it. Thanks.

First, I don't see how that could cause the problem.

> hex view of the string attached.

Second, may I suggest doing the same hex view of the source code?
Maybe you thought you used an apostrophe in both strings, but really
one of them was different...?
I suppose that some new "feature" in VC2005 screwed this up, but
somehow I doubt it. It's going to take a few weeks before I get a
chance to test this with VC2005... check but your source code,
\and make sure it really is identical.

tomgee

unread,

Jun 6, 2006, 6:06:42 PM6/6/06

to

Michiel...@tomtom.com wrote:

> You can't mix those; you have to choose. In addition, AFAIK on many
> systems wcout is more or less broken. Even if wchar_t supports Unicode,
> wcout may not.

this piece of code is for demo. However, you can mix those in fact,
I did that once because some one has put both ANSI and UNICODE in one
bin
file -I had to parse it.

here are 2 excellent links of mixing both worlds.
http://www.codeproject.com/string/cppstringguide1.asp
http://www.codeproject.com/string/cppstringguide2.asp

tomgee

unread,

Jun 6, 2006, 6:07:57 PM6/6/06

to

Allan W wrote:

>
> I tried it with VC2003, and didn't have this problem.
>

please hexview the string if the "'" is 0x2019, as I showed in the
original post.

if not, please modify it to that with some hex editor.

I copied the string here, but it seems is's changed to 0x27 somewhere
en route.

tomgee

unread,

Jun 6, 2006, 6:07:07 PM6/6/06

to

James Kanze wrote:

> What wcout supports depends on the locale it is imbued with. I
> *think* that that must be locale "C". I also think that one
> could argue that locale "C" should only support the basic narrow
> character set.
>
> I mentionned in an earlier posting that you should take pains to
> set up the global locale correctly when starting a program which
> uses internationalized text. I forgot that in C++, since each
> stream has its own copy of the locale, copied when the stream
> was created, you also have to imbue the standard streams with
> the new global locale.

Thanks, this explanation as well as the elaberation in your last reply
cast some light. I'll try locale soon.

honestly, I never paid attention to locales before, even though I am a
Chinese myself.
I had thought -- I always accept the default. Locales to me are nothing
more about date and currency formats. Seems it's time to learn it .

One more question is: how come the stream stops at the character
0x2019, instead of going on with messy characters, which seems more
resonable to me?

Francis Glassborow

unread,

Jun 7, 2006, 5:07:24 PM6/7/06

to

In article <1149603440.1...@g10g2000cwb.googlegroups.com>,
tomgee <RockO...@gmail.com> writes

>
>Allan W wrote:
>
>>
>> I tried it with VC2003, and didn't have this problem.
>>
>please hexview the string if the "'" is 0x2019, as I showed in the
>original post.

I wonder if the fact that 0x20 is a space in 8-bit ASCII has anything to
do with it.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' and "You Can Program in C++"
see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

kanze

unread,

Jun 7, 2006, 5:08:13 PM6/7/06

to

tomgee wrote:
> James Kanze wrote:

> > What wcout supports depends on the locale it is imbued with.
> > I *think* that that must be locale "C". I also think that
> > one could argue that locale "C" should only support the
> > basic narrow character set.

> > I mentionned in an earlier posting that you should take
> > pains to set up the global locale correctly when starting a
> > program which uses internationalized text. I forgot that in
> > C++, since each stream has its own copy of the locale,
> > copied when the stream was created, you also have to imbue
> > the standard streams with the new global locale.

> Thanks, this explanation as well as the elaberation in your
> last reply cast some light. I'll try locale soon.

> honestly, I never paid attention to locales before, even
> though I am a Chinese myself. I had thought -- I always accept
> the default. Locales to me are nothing more about date and
> currency formats. Seems it's time to learn it .

Locales affect a lot of things. Character encoding gets mixed
up with them (in sometimes unpleasant ways -- whether I'm using
ISO 8859-1 or UTF-8 is really independant of whether I'm working
in French or German), because of course, locale dependant
functions like isalpha depend on the character encoding.

The default locale, if you do nothing, is locale "C". This is
also the simplest locale, and usually doesn't support anything
beyond basic ASCII -- on my machine, isalpha( 'é' ) returns
false in locale "C". This is rarely useful, although it is
certainly the simplest, safest and probably least surprising
choice. It also works if your code is only used in English
speaking environments.

> One more question is: how come the stream stops at the
> character 0x2019, instead of going on with messy characters,
> which seems more resonable to me?

It's hard to be sure, but if the conversion functions return an
error, filebuf::overflow will return traits::eof(), which will
cause the basic_ostream function doing the output to set
failbit. And once failbit has been set, all further operations
on the stream are no-op's until it has been cleared.

--
James Kanze GABI Software

Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

kanze

unread,

Jun 8, 2006, 6:50:00 PM6/8/06

to

Francis Glassborow wrote:
> In article <1149603440.1...@g10g2000cwb.googlegroups.com>,
> tomgee <RockO...@gmail.com> writes

> >Allan W wrote:

> >> I tried it with VC2003, and didn't have this problem.

> >please hexview the string if the "'" is 0x2019, as I showed
> >in the original post.

> I wonder if the fact that 0x20 is a space in 8-bit ASCII has
> anything to do with it.

Well, it's not the first 0x20 (space) in his string. I thought
for a moment it might be related to the EOF character under DOS,
which I think Windows still respects, but this is 0x1A, rather
than 0x19. In the end, I suspect that it is simple that he is
outputting in the "C" locale, that the system supposes that "C"
is either pure ASCII, or maybe ISO 8859-1, that it cannot
translate the character into this codeset, and so sets the
failbit. But I'm just guessing, really. (IMHO: ASCII would be
in some way the most "correct", but I suspect that ISO 8859-1
would probably be more useful, and less surprising for the naïve
user.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Francis Glassborow

unread,

Jun 8, 2006, 7:16:47 PM6/8/06

to

In article <1149673212....@j55g2000cwa.googlegroups.com>, kanze
<ka...@gabi-soft.fr> writes

>The default locale, if you do nothing, is locale "C". This is
>also the simplest locale, and usually doesn't support anything
>beyond basic ASCII -- on my machine, isalpha( 'é' ) returns
>false in locale "C". This is rarely useful, although it is
>certainly the simplest, safest and probably least surprising
>choice. It also works if your code is only used in English
>speaking environments.

No, it doesn't, at least not in British English environments. As an
example the correct spelling of naive is naïve [and to write that in my
newsreader I had to copy/paste it from elsewhere [or dug into the
documentation to discover how to type that direct from my keyboard].
That is just one example of a correct English word that cannot be
readily typed on a UK keyboard:-)

The symbols used for writing English words are substantially greater in
number than the 26 letters of the English alphabet + standard
punctuation symbols.

I believe that US English also sometimes uses extra symbols but I think
that it is always correct there to force spellings using only the 26
letters of the alphabet.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' and "You Can Program in C++"
see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

kanze

unread,

Jun 9, 2006, 6:14:07 PM6/9/06

to

Francis Glassborow wrote:
> In article <1149673212....@j55g2000cwa.googlegroups.com>, kanze
> <ka...@gabi-soft.fr> writes

> >The default locale, if you do nothing, is locale "C". This is also
> >the simplest locale, and usually doesn't support anything beyond
> >basic ASCII -- on my machine, isalpha( 'é' ) returns false in locale
> >"C". This is rarely useful, although it is certainly the simplest,
> >safest and probably least surprising choice. It also works if your
> >code is only used in English speaking environments.

> No, it doesn't, at least not in British English environments. As an
> example the correct spelling of naive is naïve [and to write that in
> my newsreader I had to copy/paste it from elsewhere [or dug into the
> documentation to discover how to type that direct from my keyboard].
> That is just one example of a correct English word that cannot be
> readily typed on a UK keyboard:-)

Well, it's certainly A correct spelling; I'm not sure it's the only
one.
(The online dictionary I use gives naive as the prefered spelling, with
naïve as an alternative. But it's an American dictionary.) Ditto
cases
like encyclopedia (which started with an ae ligature when I was a kid).

And of course, even in an purely English speaking environment, you
might
want to output the name of some Czeck or Pole -- or a French wine. In
the case of the Czeck or the Pole, even ISO 8859-1 won't help.

(BTW: I have no trouble getting ï on the US keyboard on my Sparc. In
either vim or emacs. The same thing works under Linux or Windows:
although my Linux machine has a German keyboard and my Windows machine
a
French one, I normally install the US drivers and use them by default
--
there's no way I'm going to type C++ on a keyboard which doesn't have
{,
}, [, ], \ or | :-), even when the comments are in French.)

> The symbols used for writing English words are substantially greater
> in number than the 26 letters of the English alphabet + standard
> punctuation symbols.

> I believe that US English also sometimes uses extra symbols but I
> think that it is always correct there to force spellings using only
> the 26 letters of the alphabet.

Well, it's always correct, even in British English, in the same sense
that it is "correct" to omit accents on capital letters in French: the
rules are more or less officially bent to accomadate what is
technically
possible; for type written text, what a normal, mechanical typewriters
can support. In the case of French, at least, the bending is only a
tolerance, and only considered acceptable when conditionned by such
mechanical limitations.

It's also true that you can't generate anything acceptable for a
variable width font with just US ASCII. The use of single " character,
instead of different forms for opening and closing quotes, is also a
tolerance. More precisely stated, perhaps, straight ASCII works where
the only text IO is typewriter like text in English. You're correct in
pointing out that in general, our expectations have increased, and
except for simple things like log and configuration files, we generally
expect better quality output than that which can be simply achieved
with
just ASCII characters. Even in American English.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

tomgee

unread,

Jun 22, 2006, 7:02:46 AM6/22/06

to

This seems resonable.

I imbued wcout/wfstream with another locale other than the default,
say, Chinese.
They can now out output the whole string, which seems to prove your
analysis.
<CODE>

int main(int argc, char *argv[])
{

string s("Start date or time canâ€™t be later than ending date or
time!");
wstring ws(L"Start date or time canâ€™t be later than ending date
or time!");

ofstream f("f.txt");
wofstream wf("wf.txt");

//f.imbue(locale("chinese")); // this line has no impact on
ofstream
f << s <<endl;

// wf.imbue(locale("chinese")); // this line enables wofstream a
successful output
wf << ws<<endl;

//cout.imbue(locale("chinese")); // this line has no impact on cout
cout << s <<endl;
wcout.imbue(locale("chinese")); // this line enables wcout a
successful output
wcout << ws<<endl;

system("PAUSE");
return EXIT_SUCCESS;
}

<OUTPUT>
Start date or time canâ€™t be later than ending date or time!
Start date or time canË‡t be later than ending date or time!

One thing still puzzles me: while it's reasonable for wcout/wfstream to
present the string
according to the locale. How come cout/fstream are able to present the
string as is in the source code, no matter what locale is set?

Thanks.

kanze å�€ ™é?“ï¼š

> > One more question is: how come the stream stops at the
> > character 0x2019, instead of going on with messy characters,
> > which seems more resonable to me?
>
> It's hard to be sure, but if the conversion functions return an
> error, filebuf::overflow will return traits::eof(), which will
> cause the basic_ostream function doing the output to set
> failbit. And once failbit has been set, all further operations
> on the stream are no-op's until it has been cleared.