Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

wcout, wprintf() only print English

109 views
Skip to first unread message

Ioannis Vranos

unread,
Feb 23, 2008, 5:13:14 AM2/23/08
to
Has anyone actually managed to print non-English text by using wcout or
wprintf and the rest of standard, wide character functions?

Ioannis Vranos

unread,
Feb 23, 2008, 5:15:20 AM2/23/08
to
Ioannis Vranos wrote:
> Has anyone actually managed to print non-English text by using wcout or
> wprintf and the rest of standard, wide character functions?


For example:

[john@localhost src]$ cat main.cc
#include <iostream>

int main()
{
using namespace std;

wcout<< L"Δοκιμαστικό μήνυμα\n";
}

[john@localhost src]$ ./foobar-cpp
??????????? ??????

[john@localhost src]$

Rolf Magnus

unread,
Feb 23, 2008, 5:33:06 AM2/23/08
to
Ioannis Vranos wrote:

> Ioannis Vranos wrote:
>> Has anyone actually managed to print non-English text by using wcout or
>> wprintf and the rest of standard, wide character functions?
>
>
> For example:
>
> [john@localhost src]$ cat main.cc
> #include <iostream>
>
> int main()
> {
> using namespace std;
>
> wcout<< L"Δοκιμαστικό μήνυμα\n";

Are you sure that you stored your source file in the same encoding the
compiler expects as source character set?

Ioannis Vranos

unread,
Feb 23, 2008, 6:04:24 AM2/23/08
to
Rolf Magnus wrote:
> Ioannis Vranos wrote:
>
>> Ioannis Vranos wrote:
>>> Has anyone actually managed to print non-English text by using wcout or
>>> wprintf and the rest of standard, wide character functions?
>>
>> For example:
>>
>> [john@localhost src]$ cat main.cc
>> #include <iostream>
>>
>> int main()
>> {
>> using namespace std;
>>
>> wcout<< L"Δοκιμαστικό μήνυμα\n";

>
> Are you sure that you stored your source file in the same encoding the
> compiler expects as source character set?
>
>> }
>>
>> [john@localhost src]$ ./foobar-cpp
>> ??????????? ??????
>>
>> [john@localhost src]$


Well I created the file with anjuta editor with the message being a
Greek one. The Greek message also appears the same when I display the
source file in the console.

I suppose it is saved as UTF8.


Also the code

#include <iostream>
#include <string>

int main()
{
using namespace std;

wstring s;

wcin>> s;


wcout<< s<< endl;
}


displays nothing when I enter greek text.


Should I mess with locales?

Ioannis Vranos

unread,
Feb 23, 2008, 6:06:26 AM2/23/08
to


both in g++ under Linux and VC++ 2008 Express under Windows, with the
latest saving the source code file as Unicode after it detected
non-english text.

Ioannis Vranos

unread,
Feb 23, 2008, 6:11:09 AM2/23/08
to
Made more precise:

Ioannis Vranos wrote:
>>>> For example:
>>>>
>>>> [john@localhost src]$ cat main.cc
>>>> #include <iostream>
>>>>
>>>> int main()
>>>> {
>>>> using namespace std;
>>>>

>>>> wcout<< L"Δοκιμαστικό μήνυμα\n";


>>>
>>> Are you sure that you stored your source file in the same encoding the
>>> compiler expects as source character set?
>>>
>>>> }
>>>>
>>>> [john@localhost src]$ ./foobar-cpp
>>>> ??????????? ??????
>>>>
>>>> [john@localhost src]$
>>
>>
>> Well I created the file with anjuta editor with the message being a
>> Greek one. The Greek message also appears the same when I display the
>> source file in the console.
>>
>> I suppose it is saved as UTF8.
>>
>>
>> Also the code
>>
>> #include <iostream>
>> #include <string>
>>
>> int main()
>> {
>> using namespace std;
>>
>> wstring s;
>> wcin>> s;
>>
>> wcout<< s<< endl;
>> }
>>
>>

displays the Greek text when I enter it, but outputs nothing. With
English text, the text is displayed both when entered and outputed.


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

[john@localhost src]$ ./foobar-cpp
Test
Test
[john@localhost src]$

Jeff Schwab

unread,
Feb 23, 2008, 7:15:09 AM2/23/08
to

Hmmm... I work almost entirely in English, so this error message is new
to me:

$ make
g++ -ansi -pedantic -Wall main.cc -o main
main.cc: In function 'int main()':
main.cc:4: error: converting to execution character set: Invalid or
incomplete multibyte or wide character
make: *** [main] Error 1

Boris

unread,
Feb 23, 2008, 7:49:37 AM2/23/08
to
On Sat, 23 Feb 2008 13:11:09 +0200, Ioannis Vranos
<ivr...@nospam.no.spamfreemail.gr> wrote:

> [...]
>>> Also the code
> [...]displays the Greek text when I enter it, but outputs nothing. With

> English text, the text is displayed both when entered and outputed.

I don't remember anymore the details but the problem has something to do
with codecvt: Your wide characters are automatically converted to narrow
characters by wcout. This is something you might not want (and even if you
want it the conversion might not work automatically the way you expect :).

Try writing to wstringstream and converting to UTF-8 explicitly (storing
the result eg. in string). If your console supports UTF-8 you can print to
cout (otherwise print to a file so you can test the output in an editor).

HTH,
Boris

Ioannis Vranos

unread,
Feb 23, 2008, 8:10:54 AM2/23/08
to
Jeff Schwab wrote:
> Ioannis Vranos wrote:
>> Ioannis Vranos wrote:
>>> Has anyone actually managed to print non-English text by using wcout
>>> or wprintf and the rest of standard, wide character functions?
>>
>>
>> For example:
>>
>> [john@localhost src]$ cat main.cc
>> #include <iostream>
>>
>> int main()
>> {
>> using namespace std;
>>
>> wcout<< L"Äïêéìáóôéêü ìÞíõìá\n";

>> }
>>
>> [john@localhost src]$ ./foobar-cpp
>> ??????????? ??????
>>
>> [john@localhost src]$
>
> Hmmm... I work almost entirely in English, so this error message is new
> to me:
>
> $ make
> g++ -ansi -pedantic -Wall main.cc -o main
> main.cc: In function 'int main()':
> main.cc:4: error: converting to execution character set: Invalid or
> incomplete multibyte or wide character
> make: *** [main] Error 1


I tried the same:

[john@localhost src]$ g++ -ansi -pedantic-errors -Wall main.cc -o
foobar-cpp

[john@localhost src]$


Perhaps when you copy and paste the greek text, you copy garbage (that
is, not viewing the message in the correct character set in your
newsgroup reader).


So, I repost the code in this message which is encoded to Unicode (UTF-8):

Jeff Schwab

unread,
Feb 23, 2008, 8:59:36 AM2/23/08
to

Thanks, you were correct.

Here's what I thought was "supposed" to be the portable solution:

#include <iostream>
#include <locale>

int main() {
std::wcout.imbue(std::locale("el_GR.UTF-8"));


std::wcout << L"Δοκιμαστικό μήνυμα\n";
}

However, my system still shows question marks for this. For whatever
it's worth, here's the (probably incorrect) way that appears to work on
my system:

#include <iostream>
#include <locale>

int main() {
std::cout.imbue(std::locale(""));
std::cout << "Δοκιμαστικό μήνυμα\n";
}

Ioannis Vranos

unread,
Feb 23, 2008, 11:07:12 AM2/23/08
to
Jeff Schwab wrote:
>
>> So, I repost the code in this message which is encoded to Unicode
>> (UTF-8):
>>
>>
>> #include <iostream>
>>
>> int main()
>> {
>> using namespace std;
>>
>> wcout<< L"Δοκιμαστικό μήνυμα\n";
>> }
>
> Thanks, you were correct.
>
> Here's what I thought was "supposed" to be the portable solution:
>
> #include <iostream>
> #include <locale>
>
> int main() {
> std::wcout.imbue(std::locale("el_GR.UTF-8"));
> std::wcout << L"Δοκιμαστικό μήνυμα\n";
> }
>
> However, my system still shows question marks for this. For whatever
> it's worth, here's the (probably incorrect) way that appears to work on
> my system:
>
> #include <iostream>
> #include <locale>
>
> int main() {
> std::cout.imbue(std::locale(""));
> std::cout << "Δοκιμαστικό μήνυμα\n";
> }


"Strangely" these also happen to my Linux box with "gcc version 4.1.2
20070626".

cout prints Greek without the L notation to the string literal.

The same with wcout prints an empty line.

The same with wcout and L notation prints question marks.


This made me think to use plain cout, and it also works:


#include <iostream>

int main()
{


std::cout << "Δοκιμαστικό μήνυμα\n";
}

also prints the Greek message.


Seeing this I am assuming char is implemented as unsigned char and this
is working because Greek is provided in the extended ASCII character set
(values 128-255) supported by my system (I have set the regional
settings under GNOME etc). However why does this also work for you?


The code


#include <iostream>
#include <limits>

int main()
{
using namespace std;

cout<< static_cast<int>( numeric_limits<char>::max() )<< endl;
}

produces in my system:

[john@localhost src]$ ./foobar-cpp
127

[john@localhost src]$


so I am wrong, char is implemented as signed char, and no extended ASCII
takes place.


Strange.

Ioannis Vranos

unread,
Feb 23, 2008, 11:19:57 AM2/23/08
to
Based on the MSDN example:


// basic_ios_imbue.cpp
// compile with: /EHsc
#include <iostream>
#include <locale>

int main( )
{
using namespace std;

cout.imbue( locale( "french_france" ) );
double x = 1234567.123456;
cout << x << endl;
}


that doesn't work in my GCC, this works:

#include <iostream>
#include <limits>

int main()
{
using namespace std;

cout.imbue( locale( "greek" ) );

cout<< "Δοκιμαστικό\n";
}


This also works:

#include <iostream>
#include <limits>

int main()
{
using namespace std;

cout.imbue( locale( "en_US" ) );

cout<< "Δοκιμαστικό\n";
}


Crazy stuff.

Ioannis Vranos

unread,
Feb 23, 2008, 11:38:41 AM2/23/08
to
It looks like GCC has the opposite stuff, cout, cin, string work as
wcout, wcin, wstring and vice versa! Bug?

#include <iostream>

int main()
{
using namespace std;

wstring ws;

wcin>> ws;

cout<< ws.size()<< endl;
}



[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

0
[john@localhost src]$

#include <iostream>

int main()
{
using namespace std;

string s;

cin>> s;

cout<< s.size()<< endl;
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

22
[john@localhost src]$


#include <iostream>

int main()
{
using namespace std;

string s;

cin>> s;

cout<< s<< endl;
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

Δοκιμαστικό
[john@localhost src]$

#include <iostream>

int main()
{
using namespace std;

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

[john@localhost src]$

#include <iostream>

int main()
{
using namespace std;

cout<< "Δοκιμαστικό-11\n";

wcout<< "Δοκιμαστικό-22\n";

cout<< L"Δοκιμαστικό-33\n";

wcout<< L"Δοκιμαστικό-44\n";
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό-11
-22
0x80488c8���������-44
[john@localhost src]$

Conclusion: It appears GCC has the wide character stuff messed up, or I
am missing important knowledge.

Jeff Schwab

unread,
Feb 23, 2008, 11:47:52 AM2/23/08
to
Ioannis Vranos wrote:
> It looks like GCC has the opposite stuff, cout, cin, string work as
> wcout, wcin, wstring and vice versa! Bug?
...

> Conclusion: It appears GCC has the wide character stuff messed up, or I
> am missing important knowledge.

You and me both. I would be very surprised if this were a GCC bug (I'm
using 4.2.4 pre-release), but I'm guessing somebody here knows a lot
more about this than we do, and is willing to enlighten us. :)

Alf P. Steinbach

unread,
Feb 23, 2008, 1:31:09 PM2/23/08
to
* Jeff Schwab:

As has been remarked else-thread, by Rolf Magnus, one issue, relevant
for literal strings, is the compiler's translation (or lack of
translation) of the source code text's character set to the execution
character set.

Ans as has also been remarked else-thread, by Boris, one issue, relevant
for i/o, is that the wide character streams convert to and from narrow
characters. wcout converts to narrow characters, and wcin converts from
narrow characters. They're not wide character streams, they're wide
character converters.

Assuming no issue with translation from source code character set to
execution character set, if you use only the narrow character streams
you avoid most translation. There's still translation of newlines and
possibly other characters (e.g. Ctrl Z in Windows). Thus, using UTF-8
source code and UTF-8 execution environment character set, and (mostly)
non-translating narrow character streams, everything should work swimmingly.

Another reason to avoid the wide character streams is that they're not
supported by the MingW Windows port of g++.

At least, not in the version I have.

And as I understand it UTF-8 is the usual in the *nix world.

For an interactive Windows program, you can set the console's narrow
character stream translation (to/from UCS2, which is what a console
window uses internally) temporarily to UTF-8 via Windows' console API
functions.


Disclaimer: I've never tried this for greek text + UTF-8 encoding,
because I've not had to deal with that particular issue.

Cheers, & hth.,

- Alf

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Jeff Schwab

unread,
Feb 23, 2008, 3:14:48 PM2/23/08
to
Alf P. Steinbach wrote:
> * Jeff Schwab:
>> Ioannis Vranos wrote:
>>> It looks like GCC has the opposite stuff, cout, cin, string work as
>>> wcout, wcin, wstring and vice versa! Bug?
>> ...
>>> Conclusion: It appears GCC has the wide character stuff messed up, or
>>> I am missing important knowledge.
>>
>> You and me both. I would be very surprised if this were a GCC bug
>> (I'm using 4.2.4 pre-release), but I'm guessing somebody here knows a
>> lot more about this than we do, and is willing to enlighten us. :)
>
> As has been remarked else-thread, by Rolf Magnus, one issue, relevant
> for literal strings, is the compiler's translation (or lack of
> translation) of the source code text's character set to the execution
> character set.

A good point. I know my source is in UTF-8. I don't know what
influences the execution character set, or how to tweak it.


> Ans as has also been remarked else-thread, by Boris, one issue, relevant
> for i/o, is that the wide character streams convert to and from narrow
> characters. wcout converts to narrow characters, and wcin converts from
> narrow characters. They're not wide character streams, they're wide
> character converters.

Clear as mud. :)

Ioannis Vranos

unread,
Feb 23, 2008, 4:19:57 PM2/23/08
to
Alf P. Steinbach wrote:
> * Jeff Schwab:
>> Ioannis Vranos wrote:
>>> It looks like GCC has the opposite stuff, cout, cin, string work as
>>> wcout, wcin, wstring and vice versa! Bug?
>> ...
>>> Conclusion: It appears GCC has the wide character stuff messed up, or
>>> I am missing important knowledge.
>>
>> You and me both. I would be very surprised if this were a GCC bug
>> (I'm using 4.2.4 pre-release), but I'm guessing somebody here knows a
>> lot more about this than we do, and is willing to enlighten us. :)
>
> As has been remarked else-thread, by Rolf Magnus, one issue, relevant
> for literal strings, is the compiler's translation (or lack of
> translation) of the source code text's character set to the execution
> character set.


There isn't such issue here, cout prints Greek literal correctly and
wcout not. Also cin and string read and store Greek text correctly while
wcin and wstring look like they do not work for Greek text input.


> Ans as has also been remarked else-thread, by Boris, one issue, relevant
> for i/o, is that the wide character streams convert to and from narrow
> characters. wcout converts to narrow characters, and wcin converts from
> narrow characters. They're not wide character streams, they're wide
> character converters.

I am not sure I understand this.

Isn't L"some text" a wide character string literal? Don't wcout, wcin
and wstring provide operator<< and operator>> overloads for wide
characters and wide character strings?


> Assuming no issue with translation from source code character set to
> execution character set, if you use only the narrow character streams
> you avoid most translation.


What do you mean by "narrow character" streams? char streams right?


> There's still translation of newlines and
> possibly other characters (e.g. Ctrl Z in Windows). Thus, using UTF-8
> source code and UTF-8 execution environment character set, and (mostly)
> non-translating narrow character streams, everything should work
> swimmingly.
>
> Another reason to avoid the wide character streams is that they're not
> supported by the MingW Windows port of g++.


This is irrelevant. MINGW's problems are MINGW problems, I am using GCC
under Linux (Scientific Linux 5.1 which is essentially Red Hat
Enterprise Linux 5.1 source code recompiled, like CentOS - give them a try).

Also I have MS Visual C++ 2008 Express installed.


> At least, not in the version I have.
>
> And as I understand it UTF-8 is the usual in the *nix world.
>
> For an interactive Windows program, you can set the console's narrow
> character stream translation (to/from UCS2, which is what a console
> window uses internally) temporarily to UTF-8 via Windows' console API
> functions.
>
>
> Disclaimer: I've never tried this for greek text + UTF-8 encoding,
> because I've not had to deal with that particular issue.


Can you pinpoint where our code is wrong? Essentially the following:
#include <iostream>
#include <string>

int main()
{
using namespace std;

wcout<< "Give wide character input: ";

wstring ws;

wcin>> ws;

wcout<< "You gave: "<< ws << endl;
}


It produces:

[john@localhost src]$ ./foobar-cpp
Give wide character input: Δοκιμαστικό
You gave:
[john@localhost src]$

while the code:

#include <iostream>
#include <string>

int main()
{
using namespace std;

cout<< "Give wide character input: ";

string s;

cin>> s;

cout<< "You gave: "<< s << endl;
}


produces:

[john@localhost src]$ ./foobar-cpp
Give wide character input: Δοκιμαστικό
You gave: Δοκιμαστικό
[john@localhost src]$

Ioannis Vranos

unread,
Feb 23, 2008, 5:27:42 PM2/23/08
to
I posted the following to c.l.c., and I think it is useful to post it
here too:


[The current message encoding is set to Unicode (UTF-8) because it
contains Greek]


The following code does not work as expected:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%s\n", input);

return 0;
}


Under Linux:


[john@localhost src]$ ./foobar-cpp
Test
T

[john@localhost src]$


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

[john@localhost src]$


Under MS Visual C++ 2008 Express:

Test
Test

Press any key to continue . . .


Δοκιμαστικό
??????ε????

Press any key to continue . . .


Am I missing something?

James Kanze

unread,
Feb 23, 2008, 6:03:29 PM2/23/08
to
On Feb 23, 11:33 am, Rolf Magnus <ramag...@t-online.de> wrote:
> Ioannis Vranos wrote:
> > Ioannis Vranos wrote:
> >> Has anyone actually managed to print non-English text by
> >> using wcout or wprintf and the rest of standard, wide
> >> character functions?

> > For example:

> > [john@localhost src]$ cat main.cc
> > #include <iostream>

> > int main()
> > {
> > using namespace std;

> > wcout<< L"Δοκιμαστικό μήνυμα\n";

> Are you sure that you stored your source file in the same
> encoding the compiler expects as source character set?

Are you sure the compiler even allows anything but US ASCII as
input? The standard makes most of this implementation defined.
(Logically, if you think about it. I wouldn't expect any of my
files to compile without being transcoded on a machine which
uses EBCDIC.)

Before going any further, we have to know 1) how the Greek
characters are encoded. (Probably UTF-8, since that what my
editor is configured for, and I'm seeing them correctly.) And
which compiler he's using, which options, and what the compiler
documentation says about input file encodings. Most likely,
he'll have to ask in a group for his compiler what it accepts,
and how to make it accept what he's got.

--
James Kanze (GABI Software) email:james...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

James Kanze

unread,
Feb 23, 2008, 6:11:46 PM2/23/08
to

> > #include <iostream>

> Thanks, you were correct.

> #include <iostream>
> #include <locale>

> #include <iostream>
> #include <locale>

You're still not telling us a lot of important information.
What is the actual encoding used in the source file, and what
are the bytes actually output. (FWIW: I think g++, and most
other compilers, just pass the bytes through transparently in a
narrow character string. Which means that your second code will
output whatever your editor put in the source file. If you're
using the same encoding everywhere, it will seem to work.)

Note that there isn't really any portable solution, because so
much depends on things the C++ compiler has no control over.
Run the same code in two different xterm, and it can output two
different things, completely; just specify a different font
(option -fn) with a different encoding for one of the xterm.
(And of course, it's pretty much par for the course to see one
thing when you cat to the screen, and something else when you
output the same file to the printer.)

Ioannis Vranos

unread,
Feb 23, 2008, 6:20:17 PM2/23/08
to
James Kanze wrote:
>
> You're still not telling us a lot of important information.
> What is the actual encoding used in the source file, and what
> are the bytes actually output. (FWIW: I think g++, and most
> other compilers, just pass the bytes through transparently in a
> narrow character string. Which means that your second code will
> output whatever your editor put in the source file. If you're
> using the same encoding everywhere, it will seem to work.)
>
> Note that there isn't really any portable solution, because so
> much depends on things the C++ compiler has no control over.
> Run the same code in two different xterm, and it can output two
> different things, completely; just specify a different font
> (option -fn) with a different encoding for one of the xterm.
> (And of course, it's pretty much par for the course to see one
> thing when you cat to the screen, and something else when you
> output the same file to the printer.)


I posted a C95 question in c.l.c., about this (which is a subset of
C++03) and I got a C95 working code. My last message there:


> Ben Bacarisse wrote:
>
> You need "%ls". This is very important with wprintf since without it
> %s denotes a multi-byte character sequence. printf("%ls\n" input)
> should also work. You need the w version if you want the multi-byte
> conversion of %s or if the format has to be a wchar_t pointer.


Perhaps you may help me understand better. We have the usual char
encoding which is implementation defined (usually ASCII).

wchar_t is wide character encoding, which is the "largest character set
supported by the system", so I suppose Unicode under Linux and Windows.

What exactly is a multi-byte character?

I have to say that I am talking about C95 here, not C99.


>
>> return 0;
>> }
>>
>>
>> Under Linux:
>>
>>
>> [john@localhost src]$ ./foobar-cpp
>> Test
>> T
>> [john@localhost src]$
>>
>>
>> [john@localhost src]$ ./foobar-cpp
>> Δοκιμαστικό
>> �
>> [john@localhost src]$
>

> The above my not be the only problem. In cases like this, you need to
> say way encoding your terminal is using.


You are somehow correct on this. My terminal encoding was UTF-8 and I
added Greek(ISO-8859-7). Under the last, the following code works OK:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wprintf(L"Δοκιμαστικό\n");

return 0;
}

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
[john@localhost src]$


Also the original, fixed according to your suggestion:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%ls", input);

return 0;
}

works OK too:

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

Δοκιμαστικό
[john@localhost src]$


It works OK under Terminal UTF-8 default encoding too. So "%ls" is what
was really needed.


BTW, how can we define UTF-8 as the locale?


Thanks a lot.

James Kanze

unread,
Feb 23, 2008, 6:21:02 PM2/23/08
to
On Feb 23, 5:07 pm, Ioannis Vranos <ivra...@nospam.no.spamfreemail.gr>
wrote:
> Jeff Schwab wrote:

[...]


> > However, my system still shows question marks for this. For
> > whatever it's worth, here's the (probably incorrect) way
> > that appears to work on my system:

> > #include <iostream>
> > #include <locale>

> > int main() {
> > std::cout.imbue(std::locale(""));
> > std::cout << "Δοκιμαστικό μήνυμα\n";
> > }

> "Strangely" these also happen to my Linux box with "gcc
> version 4.1.2 20070626".

> cout prints Greek without the L notation to the string
> literal.

> The same with wcout prints an empty line.

I don't think the problem is so much wcout, as the wide
character literal. The compiler is obliged to do interpret the
contents of the literal in some way, and I would guess that it's
not doing this in a way conform with the input you've given it.

What does the compiler documentation say about how it processes
characters outside of the basic character set? What happens if
you replace your characters with their UCN, e.g.:

std::wcout << L"\u0394\u03BF..." ;

?

> The same with wcout and L notation prints question marks.

> This made me think to use plain cout, and it also works:

> #include <iostream>

> int main()
> {
> std::cout << "Δοκιμαστικό μήνυμα\n";
> }

> also prints the Greek message.

> Seeing this I am assuming char is implemented as unsigned char
> and this is working because Greek is provided in the extended
> ASCII character set (values 128-255) supported by my system (I
> have set the regional settings under GNOME etc). However why
> does this also work for you?

Most likely, the compiler is just generating code which copies
the characters bit patterns, without ever looking at their
numeric values. So the signedness of char is irrelevant
(here---in other places, it can cause problems).

> The code

> #include <iostream>
> #include <limits>

> int main()
> {
> using namespace std;
> cout<< static_cast<int>( numeric_limits<char>::max() )<< endl;
> }

> produces in my system:

> [john@localhost src]$ ./foobar-cpp
> 127

In other words, plain char is signed. (It usually is, for some
reason.)

> [john@localhost src]$

> so I am wrong, char is implemented as signed char, and no
> extended ASCII takes place.

There's no such thing as "extended ASCII":-). Still, I
regularly used ISO 8859-15 in plain char's, on machines which
are signed. If I look at the numeric value of the char, it's
wrong, but the bits are right, and they get copied through
correctly.

I just have to be careful when I use functions which expect an
int in the range [0...UCHAR_MAX]. (Those in the <cctype>
header, for example.)

Jeff Schwab

unread,
Feb 23, 2008, 6:24:56 PM2/23/08
to

UTF-8

> and what
> are the bytes actually output.

0x3f eleven times (UTF-8 question mark '?'), followed by one 0x20
(literal space ' '), followed by six more 0x3f.

??????????? ??????

> (FWIW: I think g++, and most
> other compilers, just pass the bytes through transparently in a
> narrow character string. Which means that your second code will
> output whatever your editor put in the source file. If you're
> using the same encoding everywhere, it will seem to work.)

That is probably what is happening.


> Note that there isn't really any portable solution, because so
> much depends on things the C++ compiler has no control over.
> Run the same code in two different xterm, and it can output two
> different things, completely; just specify a different font
> (option -fn) with a different encoding for one of the xterm.
> (And of course, it's pretty much par for the course to see one
> thing when you cat to the screen, and something else when you
> output the same file to the printer.)

Thanks. Well, that's not very satisfying. :-/

James Kanze

unread,
Feb 23, 2008, 6:24:25 PM2/23/08
to

It wouldn't surprise me if g++ (or any other compiler) had some
bugs in this. It's far from trivial. But for the moment,
nothing you've show seems particularly surprising to me. (In
fact, I'm sure that there is one bug in g++. Most of what is
involved here is implementation defined, and the standard says
that a conforming implementation must document its choices. I
haven't found any such documentation for g++.)

Ioannis Vranos

unread,
Feb 23, 2008, 6:26:18 PM2/23/08
to
Reply I posted in c.l.c.:


Ioannis Vranos wrote:
>
> It works OK under Terminal UTF-8 default encoding too. So "%ls" is what
> was really needed.


Actually the code:

#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wprintf(L"Δοκιμαστικό\n");

return 0;
}

works only when I set the Terminal encoding to Greek (ISO-8859-7).

James Kanze

unread,
Feb 23, 2008, 6:29:59 PM2/23/08
to
On Feb 23, 7:31 pm, "Alf P. Steinbach" <al...@start.no> wrote:
> * Jeff Schwab:

[...]


> Assuming no issue with translation from source code character
> set to execution character set, if you use only the narrow
> character streams you avoid most translation.

In practice, at least with g++. In theory, you *should*
encounter problems in a quality implementation, because the
compiler is supposed to define what it does for input outside
the basic character set. Which may or may not include handling
Greek characters correctly.

> There's still translation of newlines and possibly other
> characters (e.g. Ctrl Z in Windows). Thus, using UTF-8 source
> code and UTF-8 execution environment character set, and
> (mostly) non-translating narrow character streams, everything
> should work swimmingly.

> Another reason to avoid the wide character streams is that
> they're not supported by the MingW Windows port of g++.

> At least, not in the version I have.

I'm not sure what the current status is, but for a very long
time, g++ couldn't handle any locales except "C" and "POSIX".

> And as I understand it UTF-8 is the usual in the *nix world.

Not at all. Most Unix programmers think it should be, however,
so maybe in a couple of decades... (Actually, things are moving
fairly quickly in this direction.)

Ioannis Vranos

unread,
Feb 23, 2008, 6:32:18 PM2/23/08
to
How can we convert the C subset C++ code:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%ls", input);

return 0;
}

that works, to use the newest and greatest C++ facilities? :-)

Jeff Schwab

unread,
Feb 23, 2008, 6:41:26 PM2/23/08
to
James Kanze wrote:
> On Feb 23, 5:47 pm, Jeff Schwab <j...@schwabcenter.com> wrote:
>> Ioannis Vranos wrote:
>>> It looks like GCC has the opposite stuff, cout, cin, string
>>> work as wcout, wcin, wstring and vice versa! Bug?
>> ...
>>> Conclusion: It appears GCC has the wide character stuff
>>> messed up, or I am missing important knowledge.
>
>> You and me both. I would be very surprised if this were a GCC
>> bug (I'm using 4.2.4 pre-release), but I'm guessing somebody
>> here knows a lot more about this than we do, and is willing to
>> enlighten us. :)
>
> It wouldn't surprise me if g++ (or any other compiler) had some
> bugs in this. It's far from trivial. But for the moment,

I wouldn't, either. I'd just be surprised if a "hello unicode" example
uncovered any that I would recognize.


> nothing you've show seems particularly surprising to me. (In
> fact, I'm sure that there is one bug in g++. Most of what is
> involved here is implementation defined, and the standard says
> that a conforming implementation must document its choices. I
> haven't found any such documentation for g++.)

http://gcc.gnu.org/onlinedocs/gcc-4.2.3/gcc/Characters-implementation.html

http://gcc.gnu.org/onlinedocs/gcc-4.2.3/cpp/Implementation_002ddefined-behavior.html

http://gcc.gnu.org/onlinedocs/libstdc++/22_locale/howto.html#1

"Currently, CPP requires its input to be ASCII or UTF-8. The execution
character set may be controlled by the user, with the -fexec-charset and
-fwide-exec-charset options."

What specifically needs to be documented?

James Kanze

unread,
Feb 23, 2008, 6:41:36 PM2/23/08
to
On Feb 23, 10:19 pm, Ioannis Vranos

<ivra...@nospam.no.spamfreemail.gr> wrote:
> Alf P. Steinbach wrote:
> > * Jeff Schwab:

[...]


> > As has been remarked else-thread, by Rolf Magnus, one issue,
> > relevant for literal strings, is the compiler's translation
> > (or lack of translation) of the source code text's character
> > set to the execution character set.

> There isn't such issue here, cout prints Greek literal
> correctly and wcout not.

That's just because cout and narrow string literals are passing
your bytes through literally. Neither is doing anything with
them.

> Also cin and string read and store Greek text correctly while
> wcin and wstring look like they do not work for Greek text
> input.

Using which locale? For input in what encoding?

> > Ans as has also been remarked else-thread, by Boris, one
> > issue, relevant for i/o, is that the wide character streams
> > convert to and from narrow characters. wcout converts to
> > narrow characters, and wcin converts from narrow characters.
> > They're not wide character streams, they're wide character
> > converters.

> I am not sure I understand this.

> Isn't L"some text" a wide character string literal?

According to the language. But the characters between the "..."
are still encoded in some narrow character encoding, which the
compiler has to translate into some wide character encoding.

Which narrow character encoding, and which wide character
encoding, is anybody's guess. The standard says that it's
"implementation defined", which means that the implementation
has to document its choices. Good luck finding such
documentation (for just about any compiler).

> Don't wcout, wcin and wstring provide operator<< and
> operator>> overloads for wide characters and wide character
> strings?

Yes, but all I/O is actually byte oriented. So the do code
translations on the fly. According to the embedded locale.
(The last time I checked, in g++, you could embed any locale
installed on the system, and it would still act as if it were in
locale "C". But that was a very, very long time ago.)

> > Assuming no issue with translation from source code
> > character set to execution character set, if you use only
> > the narrow character streams you avoid most translation.

> What do you mean by "narrow character" streams? char streams
> right?

Yes.

He should have added that to be sure there's no code
translation, you have to embed the "C" locale.

> > There's still translation of newlines and possibly other
> > characters (e.g. Ctrl Z in Windows). Thus, using UTF-8
> > source code and UTF-8 execution environment character set,
> > and (mostly) non-translating narrow character streams,
> > everything should work swimmingly.

> > Another reason to avoid the wide character streams is that
> > they're not supported by the MingW Windows port of g++.

> This is irrelevant. MINGW's problems are MINGW problems, I am
> using GCC under Linux (Scientific Linux 5.1 which is
> essentially Red Hat Enterprise Linux 5.1 source code
> recompiled, like CentOS - give them a try).

> Also I have MS Visual C++ 2008 Express installed.

Under Linux ! :-)

> > At least, not in the version I have.

> > And as I understand it UTF-8 is the usual in the *nix world.

> > For an interactive Windows program, you can set the
> > console's narrow character stream translation (to/from UCS2,
> > which is what a console window uses internally) temporarily
> > to UTF-8 via Windows' console API functions.

> > Disclaimer: I've never tried this for greek text + UTF-8
> > encoding, because I've not had to deal with that particular
> > issue.

> Can you pinpoint where our code is wrong? Essentially the following:
> #include <iostream>
> #include <string>

> int main()
> {
> using namespace std;

> wcout<< "Give wide character input: ";
> wstring ws;
> wcin>> ws;
> wcout<< "You gave: "<< ws << endl;
> }

> It produces:

> [john@localhost src]$ ./foobar-cpp
> Give wide character input: Δοκιμαστικό
> You gave:
> [john@localhost src]$

To start with, you didn't embed a locale which supports
characters outside of the basic character set.

> while the code:
>
> #include <iostream>
> #include <string>
>
> int main()
> {
> using namespace std;
> cout<< "Give wide character input: ";

> string s;
> cin>> s;
> cout<< "You gave: "<< s << endl;
> }

> produces:

> [john@localhost src]$ ./foobar-cpp
> Give wide character input: Δοκιμαστικό
> You gave: Δοκιμαστικό
> [john@localhost src]$

Formally, the code has undefined behavior:-). Practically,
you're just shuffling bytes, so it "seems" to work.

Ioannis Vranos

unread,
Feb 23, 2008, 7:00:19 PM2/23/08
to
James Kanze wrote:
>
>> Also I have MS Visual C++ 2008 Express installed.
>
> Under Linux ! :-)


Linux::VMWare::Windows::VC++2008 Express.

Ioannis Vranos

unread,
Feb 23, 2008, 8:51:07 PM2/23/08
to
Ioannis Vranos wrote:
> How can we convert the C subset C++ code:
>
>
> #include <wchar.h>
> #include <locale.h>
> #include <stdio.h>
> #include <stddef.h>
>
> int main()
> {
> char *p= setlocale( LC_ALL, "greek" );

>
> wchar_t input[50];
>
> if (!p)
> printf("NULL returned!\n");
>
> fgetws(input, 50, stdin);
>
> wprintf(L"%ls", input);
>
> return 0;
> }
>
> that works, to use the newest and greatest C++ facilities? :-)


The next best thing after this, is to use the C-subset setlocale with
wcin, wcout, wstring and stuff, and it works indeed:


#include <iostream>
#include <clocale>
#include <string>

int main()
{
using namespace std;

char *p= setlocale( LC_ALL, "greek" );

if (!p)
cerr<< "NULL returned!\n";

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
Δοκιμαστικό
[john@localhost src]$

The following works out of the box too:

#include <iostream>
#include <clocale>

int main()
{
using namespace std;

char *p= setlocale( LC_ALL, "greek" );

wcout<< L"Δοκιμαστικό\n";
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
[john@localhost src]$

Now how can we move from the setlocale() to the newer C++ facilities?

Rolf Magnus

unread,
Feb 24, 2008, 6:46:46 AM2/24/08
to
James Kanze wrote:

> On Feb 23, 11:33 am, Rolf Magnus <ramag...@t-online.de> wrote:
>> Ioannis Vranos wrote:
>> > Ioannis Vranos wrote:
>> >> Has anyone actually managed to print non-English text by
>> >> using wcout or wprintf and the rest of standard, wide
>> >> character functions?
>
>> > For example:
>
>> > [john@localhost src]$ cat main.cc
>> > #include <iostream>
>
>> > int main()
>> > {
>> > using namespace std;
>
>> > wcout<< L"Δοκιμαστικό μήνυμα\n";
>
>> Are you sure that you stored your source file in the same
>> encoding the compiler expects as source character set?
>
> Are you sure the compiler even allows anything but US ASCII as
> input?

I don't know, but if it doesn't, the file was not stored in the encoding
that the compiler expected ;-)
The OP could use the \u notation to specify his wide characters.

> Before going any further, we have to know 1) how the Greek
> characters are encoded. (Probably UTF-8, since that what my
> editor is configured for, and I'm seeing them correctly.)

Content-Type: text/plain; charset=ISO-8859-7; format=flowed

But the encoding used in the posting need not be the same as the encoding in
the original source file.

> And which compiler he's using, which options, and what the compiler
> documentation says about input file encodings. Most likely,
> he'll have to ask in a group for his compiler what it accepts,
> and how to make it accept what he's got.

Indeed.

James Kanze

unread,
Feb 24, 2008, 8:50:03 AM2/24/08
to
On Feb 24, 12:46 pm, Rolf Magnus <ramag...@t-online.de> wrote:
> James Kanze wrote:
> > On Feb 23, 11:33 am, Rolf Magnus <ramag...@t-online.de> wrote:
> >> Ioannis Vranos wrote:
> >> > Ioannis Vranos wrote:
> >> >> Has anyone actually managed to print non-English text by
> >> >> using wcout or wprintf and the rest of standard, wide
> >> >> character functions?

> >> > For example:

> >> > [john@localhost src]$ cat main.cc
> >> > #include <iostream>

> >> > int main()
> >> > {
> >> > using namespace std;
> >> > wcout<< L"Δοκιμαστικό μήνυμα\n";

> >> Are you sure that you stored your source file in the same
> >> encoding the compiler expects as source character set?

> > Are you sure the compiler even allows anything but US ASCII as
> > input?

> I don't know, but if it doesn't, the file was not stored in
> the encoding that the compiler expected ;-)

Yep.

I guess my real point is that you've got to read the compiler
documentation, to find out what it supports. Supposing you can
find it.

> The OP could use the \u notation to specify his wide characters.

In theory. Can you really imagine maintaining code in which
strings like his are all written using UCN's?

It is, of course, the only halfway portable approach. But IMHO,
it means that he'll need some sort of pre-processor which
converts his characters to UCN's. (It shouldn't be that hard to
write---something like ten or twenty lines of C++. But of
course, in order to write it, you have to know the encoding
you're using.)

> > Before going any further, we have to know 1) how the Greek
> > characters are encoded. (Probably UTF-8, since that what my
> > editor is configured for, and I'm seeing them correctly.)

> Content-Type: text/plain; charset=ISO-8859-7; format=flowed

> But the encoding used in the posting need not be the same as
> the encoding in the original source file.

And the encoding used in the posting need not be the encoding
which I get when I copy/paste it in my environment:-). I'd
completely forgotten about that aspect. Especially, as I'm
using Google to read news, and have configured my browser to
tell the server that UTF-8 is the preferred encoding. I
wouldn't be surprised if Google were translating it (since it
sends many postings in the same HTML page, and so has to ensure
that they are all in the same encoding), but even if it weren't,
the fonts I'm using here are UTF-8, so the browser will convert
to UTF-8 to display, and probably for copy/paste as well.

> > And which compiler he's using, which options, and what the
> > compiler documentation says about input file encodings.
> > Most likely, he'll have to ask in a group for his compiler
> > what it accepts, and how to make it accept what he's got.

> Indeed.

Yes. No matter how you look at it, the problem is NOT trivial.

James Kanze

unread,
Feb 24, 2008, 8:51:32 AM2/24/08
to
On Feb 24, 1:00 am, Ioannis Vranos <ivra...@nospam.no.spamfreemail.gr>
wrote:
> James Kanze wrote:

> > Under Linux ! :-)

Thanks. I'll give it a try myself. (Of course, the executables
it generates will also require VMWare to run, but it will allow
at least verifying that my code compiles with VC++ before trying
to port it to Windows.)

Boris

unread,
Feb 24, 2008, 9:19:37 AM2/24/08
to
On Sat, 23 Feb 2008 23:19:57 +0200, Ioannis Vranos
<ivr...@nospam.no.spamfreemail.gr> wrote:

> Alf P. Steinbach wrote:
> [...]


>> Ans as has also been remarked else-thread, by Boris, one issue,
>> relevant for i/o, is that the wide character streams convert to and
>> from narrow characters. wcout converts to narrow characters, and wcin
>> converts from narrow characters. They're not wide character streams,
>> they're wide character converters.
>
> I am not sure I understand this.
>
> Isn't L"some text" a wide character string literal? Don't wcout, wcin
> and wstring provide operator<< and operator>> overloads for wide
> characters and wide character strings?

wcout and wcin represent external devices. When you read from or write to
external devices the facet codecvt is used. The C++ standard says there
are only two: codecvt<char, char, mbstate_t> which doesn't do anything and
codecvt<wchar_t, char, mbstate_t> which converts from wchar_t to char. As
you see there is an implicit conversion to char even if you actually use
wchar_t in your program. You don't know either how the conversion of
codecvt<wchar_t, char, mbstate_t> works (there is no guarantee that it's
UTF-16 to UTF-8 for example). Either you convert to UTF-8 explicitly and
write to cout or you define or use a codecvt from a third-party library
(like http://www.boost.org/libs/serialization/doc/codecvt.html).

Boris

> [...]

Ioannis Vranos

unread,
Feb 24, 2008, 9:46:26 AM2/24/08
to


Instead of messing with these details, perhaps we should accept that the
C subset setlocale() function defined in <clocale> is simpler (and thus
better)?


The following code works:


#include <iostream>
#include <clocale>
#include <string>

#include <cstdlib>


int main()
{
using namespace std;

if (!setlocale( LC_ALL, "greek" ))
{
cerr<< "NULL returned!\n";

return EXIT_FAILURE;
}

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}

Ioannis Vranos

unread,
Feb 24, 2008, 9:52:52 AM2/24/08
to
I filed a bug in GCC Bugzilla:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35353


May anyone explain me how should I apply the suggested solution
"sync_with_stdio (false)", and the fstream suggested solution to the
following failing code?

#include <iostream>
#include <locale>
#include <string>

int main()
{
using namespace std;

wcout.imbue(locale("greek"));

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}

Message has been deleted

James Kanze

unread,
Feb 24, 2008, 10:05:07 AM2/24/08
to
On Feb 24, 12:20 am, Ioannis Vranos

<ivra...@nospam.no.spamfreemail.gr> wrote:
> James Kanze wrote:

> > You're still not telling us a lot of important information.
> > What is the actual encoding used in the source file, and what
> > are the bytes actually output. (FWIW: I think g++, and most
> > other compilers, just pass the bytes through transparently in a
> > narrow character string. Which means that your second code will
> > output whatever your editor put in the source file. If you're
> > using the same encoding everywhere, it will seem to work.)

> > Note that there isn't really any portable solution, because so
> > much depends on things the C++ compiler has no control over.
> > Run the same code in two different xterm, and it can output two
> > different things, completely; just specify a different font
> > (option -fn) with a different encoding for one of the xterm.
> > (And of course, it's pretty much par for the course to see one
> > thing when you cat to the screen, and something else when you
> > output the same file to the printer.)

> I posted a C95 question in c.l.c., about this (which is a subset of
> C++03) and I got a C95 working code. My last message there:

> > Ben Bacarisse wrote:

> > You need "%ls". This is very important with wprintf since without it
> > %s denotes a multi-byte character sequence. printf("%ls\n" input)
> > should also work. You need the w version if you want the multi-byte
> > conversion of %s or if the format has to be a wchar_t pointer.

I'd forgotten about that aspect. It's been many, many years
since I last used printf et al. But yes, you'll definitely need
a modifier in any printf specifier.

> Perhaps you may help me understand better.

Well, the main thing you have to understand is that there are
many different players in game, and that each is doing more or
less what it wants, without considering what the others are
doing.

> We have the usual char encoding which is implementation
> defined (usually ASCII).

The "usual char encoding" for what? One of the problems is
that different tools have different ideas as to what the "usual
char encoding" should be.

Unless you have to deal with mainframes (where EBCIDC still
rules), you can probably count on whatever encoding is being
used for narrow characters to understand ASCII as a subset
(although I'm not at all sure that this is true for the Asian
languages).

> wchar_t is wide character encoding, which is the "largest
> character set supported by the system", so I suppose Unicode
> under Linux and Windows.

wchar_t is implementation defined, and can be just about
anything. On the systems I know, it's UTF-16 for Windows and
AIX, UTF-32 (I think) under Linux, and some pre-Unicode 32 bit
encoding under Solaris. Except that all it really is is a 16 or
32 bit integral type. (On the usual systems. The standard
doesn't make any requirements, and an implementation which
typedef's it to char is conformant.) How the implementation
interprets it (the encoding) may depend on the locale (and I
think recent versions of Solaris have locales which interpret it
as UTF-32, rather than the pre-Unicode encoding).

> What exactly is a multi-byte character?

A character which requires several bytes for its encoding.
Very, very succinctly (Haralambous takes about 60 pages to cover
the issues, so I've obviously got to leave something out):

A character is represented by one or more code points.
Probably, all of the characters we're concerned with here can be
represented by a single code point in Unicode, but that's not
always true. And even characters that can be represented by a
single code point (e.g. an o with a circumflex accent) may be
represented by more than one code point (e.g. latin small letter
O, followed by combining accent circumflex), and will be
represented thusly in some canonical representations. A code
point is a numeric value, e.g. 0x0065 (Latin small letter E, in
Unicode) or 0x0394 (Greek capital letter Delta, in Unicode).
Which leaves open how the numeric value is represented. Unicode
code points require at least 21 bits in order to be represented
numerically, but in fact, Unicode defines a certain number of
"transformation formats", specifying how the code points are to
be formatted. The most frequent are UTF-32 (with 32 bits per
element, and one element per code point, always), UTF-16 (BE or
LE), with 16 bits per element, and one or two elements per code
point (but if all you're concerned with is the Latin and the
Greek alphabets, you can consider that it is always one element
per code point as well), and UTF-8, with 8 bit elements, and one
to four elements per code point.

In all cases of Unicode where there can be more than one element
per code point, the encoding format is defined in such a way
that you can always tell from a single element whether it is a
complete code point, the first element of a multiple element
code point, or a following element of a multiple element code
point. Thus, in UTF-8, byte values 0-0x7F are single element
code points (corresponding in fact to US ASCII), byte values
0x80-0xBF can only be a trailing byte in a multibyte code point,
0xC2-0xF7 can only be the first byte of a multibyte code point,
and values 0xC0, 0xC1, 0xF8-0xFF never occur. (The UTF-8
encoding format is actually capable of handling numeric values
up to 0x7FFFFFFF; such values may use the byte values 0xF8-0xFD
for the first byte.)

The important point, of course, being that a single code point
may require more than one byte.

Historically, earlier encodings didn't make such a rigorous
distinction between characters and code points, and tended to
define code points directly in terms of the encoding format,
rather than as a numeric value. Also, most of them didn't have
the characteristic that you could tell immediately from the
value of a byte whether it was a first byte or not; in general,
if you just indexed into a string at any arbitrary byte index,
you had no way of "resynchronizing", i.e. finding the nearest
character boundary. Some of the earlier encodings also depended
on stream state, using codes for shift in and shift out to
specify that the numeric values which followed (until the next
shift in or shift out code) were e.g. in the Greek alphabet,
rather than in the Latin one. (Some early data transmission
codes were only five bits, using shift in and shift out to
change from letters to digits/punctuation and vice versa---and
only supporting one case of letters.)

> I have to say that I am talking about C95 here, not C99.

> >> return 0;
> >> }

> >> Under Linux:

> >> [john@localhost src]$ ./foobar-cpp
> >> Test
> >> T
> >> [john@localhost src]$

> >> [john@localhost src]$ ./foobar-cpp
> >> Δοκιμαστικό
> >> �
> >> [john@localhost src]$

> > The above my not be the only problem. In cases like this,
> > you need to say way encoding your terminal is using.

> You are somehow correct on this. My terminal encoding was
> UTF-8 and I added Greek(ISO-8859-7).

In general: all a program written in C++ can do is output bytes,
which have some numeric value. We suppose a particular
encoding, etc. in the program, but there's no guarantee that
whoever later reads those bytes supposes the same thing, and
there's not much C++ can do about it.

(Since you're under Linux, try starting an xterm with a font
using UTF-8, set the locales correctly for it, and create a file
with Greek characters in the name. Then start a different xterm
with a font using ISO-8859-7, set the locales for that, and do
an ls on the directory where you created the file. As you can
see, even without any C++, there can be problems. And there's
nothing C++ can do about it.)

[...]


> BTW, how can we define UTF-8 as the locale?

It depends on the implementation, but the Unix conventions
prescribe something along the lines of
<language>[_<country>][.<encoding>], where the language is 2
letter language code, as per ISO 639-2 (in lower case), the
country is the 2 letter country code, as per ISO 3166, in upper
case, and the encoding is somthing or other. With the optional
parts defaulting to some system defined value if they're not
specified. For historical reasons, most implementations also
support additional names, like "Greek". And of course,
depending on the machine, any given locale may or may not be
installed---typically, if you do an ls of either
/usr/share/locale or /usr/lib/locale, you'll get a list of
supported locales for the machine in question. (On the version
of Linux I'm running here, UTF-8 is the default, and I can't see
it in the locale names. IIRC from the Solaris machine at work,
however, the UTF-8 locales end in .utf8. Also note that there
may be some additional files in this directory.)

Boris

unread,
Feb 24, 2008, 10:41:13 AM2/24/08
to
On Sun, 24 Feb 2008 16:46:26 +0200, Ioannis Vranos
<ivr...@nospam.no.spamfreemail.gr> wrote:

> [...]Instead of messing with these details, perhaps we should accept

> that the C subset setlocale() function defined in <clocale> is simpler
> (and thus better)?
>
>
> The following code works:
>
>
> #include <iostream>
> #include <clocale>
> #include <string>
> #include <cstdlib>
>
>
> int main()
> {
> using namespace std;
>
> if (!setlocale( LC_ALL, "greek" ))
> {
> cerr<< "NULL returned!\n";
>
> return EXIT_FAILURE;
> }
>
>
> wstring ws;
>
> wcin>> ws;
>
> wcout<< ws<< endl;
> }

If the locale name "greek" means an eight-bit character set is used you
don't need to use wstring, wcin and wcout at all? What character set do
you actually plan to use in your program?

Boris

Ioannis Vranos

unread,
Feb 24, 2008, 11:15:29 AM2/24/08
to


I am only experimenting with the wide character support. Is there any
possibility the "greek" locale to be only a 8-bit one? And if it is, how
can it also display english (latin characters)? By using something like
ASCII extended (0-127 latin characters and 128-255 greek characters)?


"locale -a" also displays the following greek encodings:


el_GR
el_GR.iso88597
el_GR.utf8

The following also works:


#include <iostream>
#include <clocale>
#include <string>
#include <cstdlib>

int main()
{
using namespace std;

if (!setlocale( LC_ALL, "el_GR.utf8" ))


{
cerr<< "NULL returned!\n";

return EXIT_FAILURE;
}

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

Δοκιμαστικό
[john@localhost src]$


but the following doesn't:

#include <iostream>
#include <locale>
#include <string>
#include <cstdlib>


int main()
{
using namespace std;

wcout.imbue(locale("el_GR.utf8"));

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

[john@localhost src]$

Ioannis Vranos

unread,
Feb 24, 2008, 11:32:35 AM2/24/08
to
I managed to make it work under the newer C++ facilities, after having a
look in ISO C++ example code:


The following codes work:


#include <iostream>
#include <locale>
#include <string>

int main()
{
using namespace std;

locale::global(locale("greek"));

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

Δοκιμαστικό
[john@localhost src]$


along with el_GR, el_GR.iso88597, el_GR.utf8. For example the following
code also works:


#include <iostream>
#include <locale>
#include <string>


int main()
{
using namespace std;

locale::global(locale("el_GR.utf8"));

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

Δοκιμαστικό
[john@localhost src]$

Jeff Schwab

unread,
Feb 24, 2008, 11:36:07 AM2/24/08
to
Ioannis Vranos wrote:
> I filed a bug in GCC Bugzilla:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35353

Ahh, sanity from the Gnu. :)


> May anyone explain me how should I apply the suggested solution
> "sync_with_stdio (false)", and the fstream suggested solution to the
> following failing code?

fstream has nothing to do with it. The Gnu developer just thought you
might have used sync_with_stdio before with file streams. What he means
is that before you do any I/O, you should make a call like this:

std::ios_base::sync_with_stdio(false);

The following program, as far as I can tell, works find on my system.

#include <iostream>
#include <locale>

int main()
{
std::ios_base::sync_with_stdio(false);

std::wcout.imbue(std::locale("el_GR.UTF-8"));

std::wcout<< L"Δοκιμαστικό μήνυμα\n";
std::wcout.flush();
}

Ioannis Vranos

unread,
Feb 24, 2008, 11:55:55 AM2/24/08
to


Yes it works OK here too. How does desynchronizing with stdio makes the
program work?

Ioannis Vranos

unread,
Feb 24, 2008, 11:58:20 AM2/24/08
to
Correction:


Ioannis Vranos wrote:
> I managed to make it work under the newer C++ facilities, after having a
> look in ISO C++ example code:
>
>
> The following codes work:
>
>
> #include <iostream>
> #include <locale>
> #include <string>
>
>
> int main()
> {
> using namespace std;
>
> locale::global(locale("greek"));
>
> wstring ws;
>
> wcin>> ws;
>
> wcout<< ws<< endl;
> }


Although the above works, I think the correct version is the following:


#include <iostream>
#include <locale>
#include <string>


int main()
{
using namespace std;

locale::global(locale("greek"));

wcin.imbue(locale());
wcout.imbue(locale());

Jeff Schwab

unread,
Feb 24, 2008, 12:30:59 PM2/24/08
to

No idea. From your other post, the code appears to work without the
sync call.

sync_with_stdio(false) means that the C++ I/O streams no longer have to
coordinate their actions with the C I/O streams: stdin, stdout, and
stderr. The Gnu developer seemed to imply that the C I/O streams had
some fundamental limitation that prevented them to support multi-byte
text. His solution was to "untie" wcout from stdout.

Boris

unread,
Feb 24, 2008, 1:28:03 PM2/24/08
to
On Sun, 24 Feb 2008 18:15:29 +0200, Ioannis Vranos
<ivr...@nospam.no.spamfreemail.gr> wrote:

> [...]I am only experimenting with the wide character support. Is there

> any possibility the "greek" locale to be only a 8-bit one? And if it is,
> how can it also display english (latin characters)? By using something
> like ASCII extended (0-127 latin characters and 128-255 greek
> characters)?

You might want to use this character set:
http://en.wikipedia.org/wiki/ISO/IEC_8859-7

Boris

Ioannis Vranos

unread,
Feb 24, 2008, 4:44:40 PM2/24/08
to
Jeff Schwab wrote:
>
>>>
>>> The following program, as far as I can tell, works find on my system.
>>>
>>> #include <iostream>
>>> #include <locale>
>>>
>>> int main()
>>> {
>>> std::ios_base::sync_with_stdio(false);
>>>
>>> std::wcout.imbue(std::locale("el_GR.UTF-8"));
>>>
>>> std::wcout<< L"Δοκιμαστικό μήνυμα\n";
>>> std::wcout.flush();
>>> }
>>
>>
>> Yes it works OK here too. How does desynchronizing with stdio makes
>> the program work?
>
> No idea. From your other post, the code appears to work without the
> sync call.
>
> sync_with_stdio(false) means that the C++ I/O streams no longer have to
> coordinate their actions with the C I/O streams: stdin, stdout, and
> stderr. The Gnu developer seemed to imply that the C I/O streams had
> some fundamental limitation that prevented them to support multi-byte
> text. His solution was to "untie" wcout from stdout.


The bottom line of this, as far as I can understand, is that my & your
GCC has a bug for standard stream imbue() specialisations when
locale::global() is not specified while the C++ standard I/O is
synchronized with the C subset standard I/O.

When we desync them, the standard stream imbue() locale specialisations
work normally.

Ioannis Vranos

unread,
Feb 24, 2008, 4:52:49 PM2/24/08
to

Actually GNU do not consider it as a bug, but as an
implementation-defined behaviour. As he said:

"Not a bug, given our implementation-defined behavior: the various cin /
wcin, streams are by default synced with stdio (per the standard
requirements) and thus not converting".

Jeff Schwab

unread,
Feb 24, 2008, 5:20:18 PM2/24/08
to

That's enough for me, but I rarely need multi-byte characters. Please
post back to c.l.c++ if you com across any other interesting issues
related to languages other than English.

kasthurira...@gmail.com

unread,
Feb 24, 2008, 11:32:12 PM2/24/08
to
On Feb 25, 2:44 am, Ioannis Vranos <ivra...@nospam.no.spamfreemail.gr>
wrote:
> work normally.- Hide quoted text -
>
> - Show quoted text -

may i suggest usage of wcin.imbue(locale("greek")) as well as
wcout.imbue(locale("greek")).Looks like your program worked when
global locale was set. But the program has locale imbued only to wcout
and not to wcin. Setting the global locale sets both wcin and wcout to
that locale. Referring C++ standard I/O streams & locales by Langer/
Kreft, different locales can be imbued to different streams and still
all the streams can be used without crossing each other. Currently i
am handicapped to try this since our environment has only english
locale.

Moreover sync_with_stdio has something to do with performance, again
referring Langer/Kreft.

James Kanze

unread,
Feb 25, 2008, 3:19:06 AM2/25/08
to
On Feb 24, 10:52 pm, Ioannis Vranos
<ivra...@nospam.no.spamfreemail.gr> wrote:

[...]


> Actually GNU do not consider it as a bug, but as an
> implementation-defined behaviour. As he said:

> "Not a bug, given our implementation-defined behavior: the
> various cin / wcin, streams are by default synced with stdio
> (per the standard requirements) and thus not converting".

That's a wierd statement. wcin inputs wide characters as bytes,
so unless wchar_t is the same size as char, it has to convert
somehow. And I don't see any relationship between the
conversion (which is required to depend on the imbued locale)
and stdio (which is required by the C standard to convert wide
character I/O according to the global locale).

James Kanze

unread,
Feb 25, 2008, 3:25:34 AM2/25/08
to
On Feb 25, 5:32 am, kasthurirangan.bal...@gmail.com wrote:

[...]


> Moreover sync_with_stdio has something to do with performance,
> again referring Langer/Kreft.

sync_with_stdio basically means that mixing stdio and iostream
on the same stream (with cin==stdin, etc.) will work. It's
actual effect is that anytime you do I/O on an iostream, the
streambuf will first call fflush() on stdout, and vice versa.
Or behave as if that were the case.

If the implementation of stdio knows about the implementation of
iostream, and vice versa, this is relatively easy to handle,
without any significant loss of performance. If the two are
independent (usually the case, and definitely the case with g++,
where stdio is normally provided by the platform), stdio isn't
going to sync with the iostream, so iostream basically has to
call fflush before each operation (which shouldn't cost much if
there hasn't actually been any output), and sync after every
operation (which can be very, very expensive in terms of time).

Of course, none of this has anything to do with how the iostream
transcodes its input or output.

Ioannis Vranos

unread,
Feb 25, 2008, 6:47:39 AM2/25/08
to
kasthurira...@gmail.com wrote:
>
> may i suggest usage of wcin.imbue(locale("greek")) as well as
> wcout.imbue(locale("greek")).Looks like your program worked when
> global locale was set. But the program has locale imbued only to wcout
> and not to wcin. Setting the global locale sets both wcin and wcout to
> that locale. Referring C++ standard I/O streams & locales by Langer/
> Kreft, different locales can be imbued to different streams and still
> all the streams can be used without crossing each other. Currently i
> am handicapped to try this since our environment has only english
> locale.
>
> Moreover sync_with_stdio has something to do with performance, again
> referring Langer/Kreft.


If you mean this:

#include <iostream>
#include <locale>
#include <string>


int main()
{
using namespace std;

// locale::global(locale("greek"));

wcin.imbue(locale("greek"));
wcout.imbue(locale("greek"));

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}


It doesn't work:


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό

[john@localhost src]$

If I uncomment the commented line, it works:

#include <iostream>
#include <locale>
#include <string>


int main()
{
using namespace std;

locale::global(locale("greek"));

wcin.imbue(locale("greek"));
wcout.imbue(locale("greek"));

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}


Ioannis Vranos

unread,
Feb 25, 2008, 6:53:23 AM2/25/08
to

This also works:

#include <iostream>
#include <locale>
#include <string>


int main()
{
using namespace std;

locale::global(locale("en_US"));

wcin.imbue(locale("greek"));
wcout.imbue(locale("greek"));

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
Δοκιμαστικό
[john@localhost src]$


In summary, the locale specialisations work only either when we use the
locale::global() statement, or we use the
std::ios_base::sync_with_stdio(false); statement. I have the feeling
this is a bug and not an implementation-defined behaviour.

kasthurira...@gmail.com

unread,
Feb 25, 2008, 7:22:52 AM2/25/08
to
On Feb 25, 4:53 pm, Ioannis Vranos <ivra...@nospam.no.spamfreemail.gr>
wrote:
> Ioannis Vranos wrote:
> this is a bug and not an implementation-defined behaviour.- Hide quoted text -

>
> - Show quoted text -

could you pls try without global locale!!!

i am not sure how the conclusion was arrived - use
std::ios_base::sync_with_stdio(false). Shall update if i come across
something similar.

Thanks,
Balaji.

kasthurira...@gmail.com

unread,
Feb 25, 2008, 7:31:41 AM2/25/08
to
> > > Äïêéìáóôéêü

>
> > > [john@localhost src]$
>
> > > If I uncomment the commented line, it works:
>
> > > #include <iostream>
> > > #include <locale>
> > > #include <string>
>
> > > int main()
> > > {
> > >     using namespace std;
>
> > >     locale::global(locale("greek"));
>
> > >     wcin.imbue(locale("greek"));
> > >     wcout.imbue(locale("greek"));
>
> > >     wstring ws;
>
> > >     wcin>> ws;
>
> > >     wcout<< ws<< endl;
> > > }
>
> > > [john@localhost src]$ ./foobar-cpp
> > > Äïêéìáóôéêü
> > > Äïêéìáóôéêü

> > > [john@localhost src]$
>
> > This also works:
>
> > #include <iostream>
> > #include <locale>
> > #include <string>
>
> > int main()
> > {
> >      using namespace std;
>
> >      locale::global(locale("en_US"));
>
> >      wcin.imbue(locale("greek"));
> >      wcout.imbue(locale("greek"));
>
> >      wstring ws;
>
> >      wcin>> ws;
>
> >      wcout<< ws<< endl;
>
> > }
>
> > [john@localhost src]$ ./foobar-cpp
> > Äïêéìáóôéêü
> > Äïêéìáóôéêü

> > [john@localhost src]$
>
> > In summary, the locale specialisations work only either when we use the
> > locale::global() statement, or we use the
> > std::ios_base::sync_with_stdio(false); statement. I have the feeling
> > this is a bug and not an implementation-defined behaviour.- Hide quoted text -
>
> > - Show quoted text -
>
> could you pls try without global locale!!!

the post didn't show your other work - already done for this, atleast
in my browser or maybe i didn't have a proper look at it. hence
suggested this.

It looks wierd. By the way i guess, we need to initialize the locale
first before use(one try could be locale("C") though this should be by
default) - which i do not know if the standard specifies that way or
its the implementation.

Thanks,
Balaji.


>
> i am not sure how the conclusion was arrived  - use
> std::ios_base::sync_with_stdio(false). Shall update if i come across
> something similar.
>
> Thanks,

> Balaji.- Hide quoted text -

Gerhard Fiedler

unread,
Feb 25, 2008, 8:08:25 AM2/25/08
to
On 2008-02-24 10:51:32, James Kanze wrote:

>>>> Also I have MS Visual C++ 2008 Express installed.
>
>>> Under Linux ! :-)
>
>> Linux::VMWare::Windows::VC++2008 Express.
>
> Thanks. I'll give it a try myself. (Of course, the executables it
> generates will also require VMWare to run, but it will allow at least
> verifying that my code compiles with VC++ before trying to port it to
> Windows.)

How will the generated executables require VMware to run? If you generate
executables in such an environment, they'll run on any of the Windows
platforms supported by your code and compilation, whether under VMware or
not.

Gerhard

James Kanze

unread,
Feb 26, 2008, 3:59:44 AM2/26/08
to
On Feb 25, 2:08 pm, Gerhard Fiedler <geli...@gmail.com> wrote:
> On 2008-02-24 10:51:32, James Kanze wrote:

> >>>> Also I have MS Visual C++ 2008 Express installed.

> >>> Under Linux ! :-)

> >> Linux::VMWare::Windows::VC++2008 Express.

> > Thanks. I'll give it a try myself. (Of course, the
> > executables it generates will also require VMWare to run,
> > but it will allow at least verifying that my code compiles
> > with VC++ before trying to port it to Windows.)

> How will the generated executables require VMware to run?

How will they not? VC++ will certainly still produce a Windows
executable, not a Linux executable.

> If you generate executables in such an environment, they'll
> run on any of the Windows platforms supported by your code and
> compilation, whether under VMware or not.

If I had a Windows platform handy, I'd run VC++ on it. If I
have to run it under VMware, it's because I don't have a Windows
platform available otherwise.

James Kanze

unread,
Feb 26, 2008, 4:04:06 AM2/26/08
to
On Feb 25, 1:22 pm, kasthurirangan.bal...@gmail.com wrote:
> On Feb 25, 4:53 pm, Ioannis Vranos <ivra...@nospam.no.spamfreemail.gr>
> wrote:

[...]


> i am not sure how the conclusion was arrived - use
> std::ios_base::sync_with_stdio(false).

What does sync_with_stdio have to do with the locales? With the
exception of possible performance issues, and when buffers are
getting flushed (which is unspecified anyway), sync_with_stdio
should have no impact as long as all I/O actually is done with
iostream. If this is not the case, there's a problem with the
implementation.

Gerhard Fiedler

unread,
Feb 26, 2008, 6:43:33 AM2/26/08
to
On 2008-02-26 05:59:44, James Kanze wrote:

>>> (Of course, the executables it generates will also require VMWare to
>>> run, but it will allow at least verifying that my code compiles with
>>> VC++ before trying to port it to Windows.)
>
>> How will the generated executables require VMware to run?
>
> How will they not? VC++ will certainly still produce a Windows
> executable, not a Linux executable.

Exactly. And that exe requires Windows, not VMware.

> If I had a Windows platform handy, I'd run VC++ on it. If I
> have to run it under VMware, it's because I don't have a Windows
> platform available otherwise.

Right. But that's not a requirement of the generated executable. You can
install Windows in a dual-boot config or send the exe to someone else with
a Windows system or use a different virtualizer or Windows emulator, and
the exe will run just fine without VMware. It doesn't even know it was
compiled under VMware.

Gerhard

Frank Birbacher

unread,
May 15, 2008, 5:34:04 AM5/15/08
to
Hi!

Ioannis Vranos schrieb:


> In summary, the locale specialisations work only either when we use the
> locale::global() statement, or we use the
> std::ios_base::sync_with_stdio(false); statement. I have the feeling
> this is a bug and not an implementation-defined behaviour.

You are using gcc on linux. With msvc8 on windows it seems to be the
other way round: When I set an utf8 converting locale on wcout only, it
works. However if I set it on wcout AND globally the output will be
garbled. Maybe it tries to do the conversion twice.

Note: I tested the utf8 output in an eclipse console which is set to
utf8 encoding.

From my current experience there is no usable wcout implementation that
reliably makes use of the conversion facet. Maybe I will just implement
a std wide (wchar_t) stream that outputs to a narrow stream and uses the
utf8 encoding. I could use that to simply wrap cout and replace wcout.

Frank

0 new messages