Here is a string I need to write to the console or a file "Start date or time can't be later than ending date or time!"
here below is the demo code(compiled with VC2005)
#include <fstream> #include <string>
using namespace std;
int main(int argc, char *argv[]) { string s("Start date or time can't be later than ending date or time!"); wstring ws(L"Start date or time can't be later than ending date or time!");
ofstream f("f.txt"); wofstream wf("wf.txt");
f << s <<endl; wf << ws<<endl;
cout << s <<endl; wcout << ws<<endl;
system("PAUSE"); return EXIT_SUCCESS;
}
To my great suprise, it works OK when using cout and fstream: it shows the whole string. However when it comes to wcout and wofstream. both cut the string and stop when they come to "'", so the output is "Start date or time can".
After looking to the hexview of this string, I found that the code for "'" is infact UNICODE 0x2019, not the ascii code 0x27.
could this be the problem? how to explain it. Thanks.
tomgee wrote: > Here is a string I need to write to the console or a file > "Start date or time can't be later than ending date or time!" > To my great suprise, it works OK when using cout and fstream: > it shows the whole string. However when it comes to wcout and > wofstream. both cut the string and stop when they come to "'", > so the output is "Start date or time can". > After looking to the hexview of this string, I found that the > code for "'" is infact UNICODE 0x2019, not the ascii code > 0x27.
Whose interpretation depends on the imbued locale. In locale "C" (the default), I think that only characters in the basic execution character set are supported.
I'm not too familiar with the Windows environment, but under Unix, practically the first lines in main in any program which uses text which might not be in English will be: std::locale::global( std::locale() ) ; setlocale( LC_ALL, "" ) ; In theory, that should work, but the interactions between C/C++ locales and the OS's view of things aren't always that clear. (Set your window's font to Zapf Dingbats, for example, and the output is going to look very wierd, even in locale "C". And there's nothing that library or the language can do about it.)
Having said that, I think it is a questionable decision to map characters in the basic character set to anything but the first 128 entries in Unicode. Clearly legal ("The values of the execution character sets are implementation defined", §2.2/3), but rather surprising. This is an awkward point in general; logically, the mapping should correspond to the locale which is active at the time. Except that the mapping must be fixed at compile time, and the locale isn't fixed until runtime. In this case, however, I think that Windows is pretty much uniquely Unicode for wchar_t, where 0x2019 is a RIGHT SINGLE QUOTATION MARK, and the ASCII character in the basic source character set is an APOSTROPHE, which is a different character, and should map to 0x0027.
It's an interesting point, however, that the standard doesn't name the characters in the basic source character set, but only gives a graphical representation. Presumably mapping the character which looks like an 'A' to 0x0391 would be legal as well. But as I said, this is an awkward problem as well: the graphic representation in the standard has serifs; if the Greek fonts on my machine display with serifs, and the Latin fonts without, mapping it to 0x0391 would arguably be better.
I think we all know and agree what is actually meant and desired. Or rather, I though we did -- your experience suggests that there are some people who interpret it differently. Although historically... C comes from the world of ASCII. I would suppose that when the standard uses a graphic representation for a character, what it means is the character in ASCII which has the graphic representation closest to the representation in the standard -- that all characters in the basic source character set are present in ASCII. In this case, APOSTROPHE and LATIN CAPITAL LETTER A are characters in ASCII, RIGHT SINGLE QUOTATION MARK and GREEK CAPITAL LETTER ALPHA are not. Given the historical background, I find it very difficult to imagine that either of these alternative mappings are conform to the spirit of the standard. (Note too that in this regard, changing ASCII to EBCDIC, above, doesn't change anything. And reading K&R1, it seems to me that any ambiguity concerning the actual encoding of the characters is to allow EBCDIC, or other historical 7 or 8 bit code sets. And not to allow interpreting characters in the basic source character set to be characters not present in ASCII or EBCDIC.)
-- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
> int main(int argc, char *argv[]) > { > string s("Start date or time can't be later than ending date or > time!"); > wstring ws(L"Start date or time can't be later than ending date > or time!"); > cout << s <<endl; > wcout << ws<<endl;
You can't mix those; you have to choose. In addition, AFAIK on many systems wcout is more or less broken. Even if wchar_t supports Unicode, wcout may not.
Michiel.Salt...@tomtom.com wrote: > tomgee wrote: >> #include <fstream> >> #include <string> >> using namespace std; >> int main(int argc, char *argv[]) >> { >> string s("Start date or time can't be later than ending date or >> time!"); >> wstring ws(L"Start date or time can't be later than ending date >> or time!"); >> cout << s <<endl; >> wcout << ws<<endl; > You can't mix those; you have to choose.
I think he just put the two cases in one program for the example. I don't really think his actual application output the same text twice, once as char, and then as wchar_t.
> In addition, AFAIK on > many systems wcout is more or less broken. Even if wchar_t > supports Unicode, wcout may not.
What wcout supports depends on the locale it is imbued with. I *think* that that must be locale "C". I also think that one could argue that locale "C" should only support the basic narrow character set.
I mentionned in an earlier posting that you should take pains to set up the global locale correctly when starting a program which uses internationalized text. I forgot that in C++, since each stream has its own copy of the locale, copied when the stream was created, you also have to imbue the standard streams with the new global locale.
-- James Kanze kanze.ja...@neuf.fr Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34
tomgee wrote: > Here is a string I need to write to the console or a file > "Start date or time can't be later than ending date or time!"
> here below is the demo code(compiled with VC2005)
I tried it with VC2003, and didn't have this problem.
> To my great suprise, it works OK when using cout and fstream: it shows > the whole string. > However when it comes to wcout and wofstream. both cut the string and > stop when they come to "'", so the output is "Start date or time > can".
> After looking to the hexview of this string, I found that the code for > "'" is infact UNICODE 0x2019, not the ascii code 0x27.
I specifically checked for this... didn't happen for me.
> could this be the problem? how to explain it. Thanks.
First, I don't see how that could cause the problem.
> hex view of the string attached.
Second, may I suggest doing the same hex view of the source code? Maybe you thought you used an apostrophe in both strings, but really one of them was different...? I suppose that some new "feature" in VC2005 screwed this up, but somehow I doubt it. It's going to take a few weeks before I get a chance to test this with VC2005... check but your source code, \and make sure it really is identical.
Michiel.Salt...@tomtom.com wrote: > You can't mix those; you have to choose. In addition, AFAIK on many > systems wcout is more or less broken. Even if wchar_t supports Unicode, > wcout may not.
this piece of code is for demo. However, you can mix those in fact, I did that once because some one has put both ANSI and UNICODE in one bin file -I had to parse it.
James Kanze wrote: > What wcout supports depends on the locale it is imbued with. I > *think* that that must be locale "C". I also think that one > could argue that locale "C" should only support the basic narrow > character set.
> I mentionned in an earlier posting that you should take pains to > set up the global locale correctly when starting a program which > uses internationalized text. I forgot that in C++, since each > stream has its own copy of the locale, copied when the stream > was created, you also have to imbue the standard streams with > the new global locale.
Thanks, this explanation as well as the elaberation in your last reply cast some light. I'll try locale soon.
honestly, I never paid attention to locales before, even though I am a Chinese myself. I had thought -- I always accept the default. Locales to me are nothing more about date and currency formats. Seems it's time to learn it .
One more question is: how come the stream stops at the character 0x2019, instead of going on with messy characters, which seems more resonable to me?
tomgee wrote: > James Kanze wrote: > > What wcout supports depends on the locale it is imbued with. > > I *think* that that must be locale "C". I also think that > > one could argue that locale "C" should only support the > > basic narrow character set. > > I mentionned in an earlier posting that you should take > > pains to set up the global locale correctly when starting a > > program which uses internationalized text. I forgot that in > > C++, since each stream has its own copy of the locale, > > copied when the stream was created, you also have to imbue > > the standard streams with the new global locale. > Thanks, this explanation as well as the elaberation in your > last reply cast some light. I'll try locale soon. > honestly, I never paid attention to locales before, even > though I am a Chinese myself. I had thought -- I always accept > the default. Locales to me are nothing more about date and > currency formats. Seems it's time to learn it .
Locales affect a lot of things. Character encoding gets mixed up with them (in sometimes unpleasant ways -- whether I'm using ISO 8859-1 or UTF-8 is really independant of whether I'm working in French or German), because of course, locale dependant functions like isalpha depend on the character encoding.
The default locale, if you do nothing, is locale "C". This is also the simplest locale, and usually doesn't support anything beyond basic ASCII -- on my machine, isalpha( 'é' ) returns false in locale "C". This is rarely useful, although it is certainly the simplest, safest and probably least surprising choice. It also works if your code is only used in English speaking environments.
> One more question is: how come the stream stops at the > character 0x2019, instead of going on with messy characters, > which seems more resonable to me?
It's hard to be sure, but if the conversion functions return an error, filebuf::overflow will return traits::eof(), which will cause the basic_ostream function doing the output to set failbit. And once failbit has been set, all further operations on the stream are no-op's until it has been cleared.
-- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Francis Glassborow wrote: > In article <1149603440.104807.114...@g10g2000cwb.googlegroups.com>, > tomgee <RockOnE...@gmail.com> writes > >Allan W wrote: > >> I tried it with VC2003, and didn't have this problem. > >please hexview the string if the "'" is 0x2019, as I showed > >in the original post. > I wonder if the fact that 0x20 is a space in 8-bit ASCII has > anything to do with it.
Well, it's not the first 0x20 (space) in his string. I thought for a moment it might be related to the EOF character under DOS, which I think Windows still respects, but this is 0x1A, rather than 0x19. In the end, I suspect that it is simple that he is outputting in the "C" locale, that the system supposes that "C" is either pure ASCII, or maybe ISO 8859-1, that it cannot translate the character into this codeset, and so sets the failbit. But I'm just guessing, really. (IMHO: ASCII would be in some way the most "correct", but I suspect that ISO 8859-1 would probably be more useful, and less surprising for the naïve user.)
-- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
In article <1149673212.669241.62...@j55g2000cwa.googlegroups.com>, kanze <ka...@gabi-soft.fr> writes
>The default locale, if you do nothing, is locale "C". This is >also the simplest locale, and usually doesn't support anything >beyond basic ASCII -- on my machine, isalpha( 'é' ) returns >false in locale "C". This is rarely useful, although it is >certainly the simplest, safest and probably least surprising >choice. It also works if your code is only used in English >speaking environments.
No, it doesn't, at least not in British English environments. As an example the correct spelling of naive is naïve [and to write that in my newsreader I had to copy/paste it from elsewhere [or dug into the documentation to discover how to type that direct from my keyboard]. That is just one example of a correct English word that cannot be readily typed on a UK keyboard:-)
The symbols used for writing English words are substantially greater in number than the 26 letters of the English alphabet + standard punctuation symbols.
I believe that US English also sometimes uses extra symbols but I think that it is always correct there to force spellings using only the 26 letters of the alphabet.
Francis Glassborow wrote: > In article <1149673212.669241.62...@j55g2000cwa.googlegroups.com>, kanze > <ka...@gabi-soft.fr> writes > >The default locale, if you do nothing, is locale "C". This is also > >the simplest locale, and usually doesn't support anything beyond > >basic ASCII -- on my machine, isalpha( 'é' ) returns false in locale > >"C". This is rarely useful, although it is certainly the simplest, > >safest and probably least surprising choice. It also works if your > >code is only used in English speaking environments. > No, it doesn't, at least not in British English environments. As an > example the correct spelling of naive is naïve [and to write that in > my newsreader I had to copy/paste it from elsewhere [or dug into the > documentation to discover how to type that direct from my keyboard]. > That is just one example of a correct English word that cannot be > readily typed on a UK keyboard:-)
Well, it's certainly A correct spelling; I'm not sure it's the only one. (The online dictionary I use gives naive as the prefered spelling, with naïve as an alternative. But it's an American dictionary.) Ditto cases like encyclopedia (which started with an ae ligature when I was a kid).
And of course, even in an purely English speaking environment, you might want to output the name of some Czeck or Pole -- or a French wine. In the case of the Czeck or the Pole, even ISO 8859-1 won't help.
(BTW: I have no trouble getting ï on the US keyboard on my Sparc. In either vim or emacs. The same thing works under Linux or Windows: although my Linux machine has a German keyboard and my Windows machine a French one, I normally install the US drivers and use them by default -- there's no way I'm going to type C++ on a keyboard which doesn't have {,
}, [, ], \ or | :-), even when the comments are in French.) > The symbols used for writing English words are substantially greater > in number than the 26 letters of the English alphabet + standard > punctuation symbols. > I believe that US English also sometimes uses extra symbols but I > think that it is always correct there to force spellings using only > the 26 letters of the alphabet.
Well, it's always correct, even in British English, in the same sense that it is "correct" to omit accents on capital letters in French: the rules are more or less officially bent to accomadate what is technically possible; for type written text, what a normal, mechanical typewriters can support. In the case of French, at least, the bending is only a tolerance, and only considered acceptable when conditionned by such mechanical limitations.
It's also true that you can't generate anything acceptable for a variable width font with just US ASCII. The use of single " character, instead of different forms for opening and closing quotes, is also a tolerance. More precisely stated, perhaps, straight ASCII works where the only text IO is typewriter like text in English. You're correct in pointing out that in general, our expectations have increased, and except for simple things like log and configuration files, we generally expect better quality output than that which can be simply achieved with just ASCII characters. Even in American English.
-- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
I imbued wcout/wfstream with another locale other than the default, say, Chinese. They can now out output the whole string, which seems to prove your analysis. <CODE> int main(int argc, char *argv[]) { string s("Start date or time can’t be later than ending date or time!"); wstring ws(L"Start date or time can’t be later than ending date or time!");
ofstream f("f.txt"); wofstream wf("wf.txt");
//f.imbue(locale("chinese")); // this line has no impact on ofstream f << s <<endl;
// wf.imbue(locale("chinese")); // this line enables wofstream a successful output wf << ws<<endl;
//cout.imbue(locale("chinese")); // this line has no impact on cout cout << s <<endl; wcout.imbue(locale("chinese")); // this line enables wcout a successful output wcout << ws<<endl;
system("PAUSE"); return EXIT_SUCCESS;
}
<OUTPUT> Start date or time can’t be later than ending date or time! Start date or time canˇt be later than ending date or time!
One thing still puzzles me: while it's reasonable for wcout/wfstream to present the string according to the locale. How come cout/fstream are able to present the string as is in the source code, no matter what locale is set?
> > One more question is: how come the stream stops at the > > character 0x2019, instead of going on with messy characters, > > which seems more resonable to me?
> It's hard to be sure, but if the conversion functions return an > error, filebuf::overflow will return traits::eof(), which will > cause the basic_ostream function doing the output to set > failbit. And once failbit has been set, all further operations > on the stream are no-op's until it has been cleared.