Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unicode test

305 views
Skip to first unread message

alexo

unread,
Apr 5, 2017, 8:36:42 AM4/5/17
to
Hello all,
I'm trying to write a test program to deal with Unicode characters on
Windows. I've found that the way to insert it in a character type is to
use wchar_t (but this is not a portable solution) and use the character
format code L"\0xxx".

this is the code I wrote after googling a bit (but not much :)

/* code follows */

#include <iostream>
#include <locale.h>

using namespace std;

int main()
{
char* locale = setlocale(LC_ALL, "");

wchar_t uc = L"\u0333";

wcout << uc << endl;

return 0;
}

It outputs nothing

alexo

unread,
Apr 5, 2017, 8:38:43 AM4/5/17
to
Il 05/04/2017 14:36, alexo ha scritto:

the message was sent too early.
Thank you for your help.

alessandro


Alf P. Steinbach

unread,
Apr 5, 2017, 6:28:12 PM4/5/17
to
On 06-Apr-17 12:07 AM, Stefan Ram wrote:
> alexo <alessandr...@libero.it> writes:
>> I'm trying to write a test program to deal with Unicode characters on
>> Windows. I've found that the way to insert it in a character type is to
>> use wchar_t (but this is not a portable solution) and use the character
>> format code L"\0xxx".
>
> #include <stdio.h>
> #include <wchar.h>
> #include <Windows.h>
>
> int main()
> { //chcp 65001
> UINT oldcp = GetConsoleOutputCP();
> if (!SetConsoleOutputCP(CP_UTF8))
> { fprintf(stderr, "chcp failed\n");
> return EXIT_FAILURE; }
>
> unsigned char utf8data[] =
> { 'H',
> 0xc3, 0xa4,
> 'l',
> 0xC2, 0xA2, // "cent"
> 0xE2, 0x82, 0xAC, // "euro"
> 0xF0, 0x90, 0x8D, 0x88, // U+10348
> 'l',
> 0xc3, 0xb6,
> '!',
> '\n',
> 0x00 };
>
> DWORD size=(DWORD)(sizeof(utf8data) / sizeof(*utf8data)) - 1;
>
> DWORD written = 0;
> BOOL success = WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8data, size, &written, 0 );
> if (!success)
> {
> fprintf(stderr, "WriteFile failed\n");
> return EXIT_FAILURE;
> }
>
>
> SetConsoleOutputCP(oldcp);
> return EXIT_SUCCESS;
> }
>
> You also need a Windows version that is not too old.
>

This is unfortunately ungood advice. The UTF-8 codepage generally
doesn't work for console i/o, down at the API level.

Using the Windows API, as is done above, you can use the direct console
i/o functions.

Using the C++ wide streams (portable) you can configure them via `_setmode`.


Cheers!,

- Alf

Alf P. Steinbach

unread,
Apr 5, 2017, 11:03:42 PM4/5/17
to
On 06-Apr-17 3:08 AM, Stefan Ram wrote:
> "Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>> This is unfortunately ungood advice. The UTF-8 codepage generally
>> doesn't work for console i/o, down at the API level.
>
> If »generally doesn't« means, »not under every combination
> of circumstance«, you might be right.

Generally means mostly, and not at all for input.


> If »generally doesn't« means, »never«, I'd like to report
> that I saw the expected characters. For example, the second
> character that appeared in the console was an »ä«.
>
> ,= ./ .: / ,++#, : .///. :///, :///, ,++#, --.: -=
> =% ,# =;:/ #, /$M+- /M+/== :...= :...= #, =;;; ;%
> =@++%# =//H+ #, -@ H .XH//- = = = = #, ;$,,/X =%
> =% ,# -H==%% #, .H;@,. ;#;=.. = = = = #, %/ -# -/
> =+ ,M .X++/H: M, :H:, =X++: ++++= ++++= M, ,%++X= -;
>
> H ä l ¢ Euro [] [] l ö !
>
> (U+10348 lays outside of the BMP, and was not rendered correctly
> here [it was rendered looking similar to »[][]«]).

That might be an UTF-16 surrogate pair, treated as two UCS-2 characters
by the console.

Sorry I don't have the energy or time right now (or earlier) to give a
good discussion. It's messy. I'll try to find time & energy tomorrow.


Cheers!,

- Alf

alexo

unread,
Apr 6, 2017, 8:43:29 AM4/6/17
to
Il 06/04/2017 00:16, Stefan Ram ha scritto:
> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>> You also need a Windows version that is not too old.
>
> And you need to set a console font in the console,
> no, wait: a /Unicode/ font!, such as: Lucida Console
> or Consolas.
>

I have tried my code on Windows 10 under cmd prompt shell and consolas
font face. My code produces no output.

if instead of wcout I use cout what I get is the number value in decimal
format of the character I would like to print.


alexo

unread,
Apr 6, 2017, 9:16:10 AM4/6/17
to
My machine displays only certain characters, namely the cent symbol, the
euro sign, the ä and ö, but even with consolas font face the 0xF0, 0x90,
0x8D, 0x88 (character U+10348) - that in my case is shown as 2 distinct
characters - is not displayed at all.

Häl¢€𐍈lö! should be the output.

I've argued that I have to insert the exadecimal code of any single byte
of the UTF-8 character, I'm I right?
Could you explain me (I'm not so friend to hex and key codes) how should
I use, for example, the U+0333 characacter?

Is it really so tricky to print a Unicode character?

thank you

Alf P. Steinbach

unread,
Apr 6, 2017, 12:20:39 PM4/6/17
to
U+0333 is a modifier character that applies a double underscore to the
preceding character. When there is no preceding character, such as a
space, you should not expect any effect.

However, if instead you used e.g. the Euro sign, which Windows Write
informs me is U+20AC, then you should only expect that character to
display if it is in the active codepage (character set) for the console
window. Because the function of wcout is to convert from wide characters
(effectively Unicode, in Windows encoded as UTF-16) to the narrow
character encoding expected by the external environment.

In Unix-land this is not a big problem, because the external environment
in Unix-land typically expects, and produces, UTF-8 encoded text. So all
you have to do there is the `setlocale(LC_ALL, "")` code that you have.
Then the wide streams work, in Unix-land.

There are two main ways to cajole wcout & friends to, instead of their
standard conversion behavior, do direct console i/o in Windows:

• use a Microsoft extension called `_setmode`. It's supported also by
MinGW g++, or
• replace the wide streams' text buffers with custom buffers that use
Windows' direct console i/o functions.

Here's an example of the first, easier solution:

// Source encoding: UTF-8 w/BOM.
#include <iostream>
#include <string>
#include <locale.h>

#include <io.h> // _setmode, _fileno
#include <fcntl.h> // _O_U16TEXT

using namespace std;

void init_streams()
{
// The bare minimum for this program. More generally one should
// check whether a stream is connected to the console, and if not,
// set UTF-8 mode instead of wide text mode.
_setmode( _fileno( stdin), _O_WTEXT );
_setmode( _fileno( stdout), _O_WTEXT );
}

auto main()
-> int
{
//setlocale( LC_ALL, "" );
init_streams();

auto const& s = L"Every 日本国 кошка likes Norwegian blåbærsyltetøy!";
wcout << s << endl;

wcout << endl;
wcout << L"What’s your name? ";
wstring name;
getline( wcin, name );
wcout << "Pleased to meet you, " << name << "!" << endl;
}


I put the stream configuration in a function so you can more easily see
that it can be moved all the way to its own translation unit, which then
would leave the main program as 100% portable.

Well, except for the limitation of Windows console windows to the UCS-2
character set, the Basic Multilingual Plane of Unicode, which means e.g.
that some archaic Chinese ideographs just can't be handled.

I wrote about this once, some five years ago, here: <url:
https://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/>


Cheers & hth.,

- Alf

David Brown

unread,
Apr 6, 2017, 3:53:54 PM4/6/17
to
On 06/04/17 18:20, Alf P. Steinbach wrote:

> In Unix-land this is not a big problem, because the external environment
> in Unix-land typically expects, and produces, UTF-8 encoded text. So all
> you have to do there is the `setlocale(LC_ALL, "")` code that you have.
> Then the wide streams work, in Unix-land.
>

In Unix land (well, my Linux system anyway), I get utf-8 encoded text
like this:

$ cat u.cpp
#include <iostream>

int main(void)
{
std::cout << "Hello, world - ÅØÆ πr² ïç a\u0333b\n";
}

$ file u.cpp
u.cpp: C source, UTF-8 Unicode text

$ g++ u.cpp -Wall -Wextra

./a.out
Hello, world - ÅØÆ πr² ïç a̳b


I just write the UTF-8 characters I want, mostly using the keyboard
directly (possibly with the compose key) rather than code points or a
character applet, and they come out fine.


Alf P. Steinbach

unread,
Apr 6, 2017, 4:00:04 PM4/6/17
to
Yes, you can do things more easily in non-portable code. ;-)


Cheers!

- Alf


David Brown

unread,
Apr 6, 2017, 5:17:30 PM4/6/17
to
In this case, that's certainly true.

Or you can do it more easily in portable code that is not C++:

$cat u.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
print(u"Hello, world - ÅØÆ πr² ïç a\u0333b")

./u.py
Hello, world - ÅØÆ πr² ïç a̳b

Works fine with Python 2 or 3. It certainly works on my Linux machine -
I expect it to work on Windows too.


Alf P. Steinbach

unread,
Apr 6, 2017, 7:25:54 PM4/6/17
to
On 06-Apr-17 11:17 PM, David Brown wrote:
> Or you can do it more easily in portable code that is not C++:
>
> $cat u.py
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
> print(u"Hello, world - ÅØÆ πr² ïç a\u0333b")
>
> ./u.py
> Hello, world - ÅØÆ πr² ïç a̳b
>
> Works fine with Python 2 or 3. It certainly works on my Linux machine -
> I expect it to work on Windows too.

It seems I'm consigned to the rôle of breaking reasonable expectations
about Windows.

To be fair: it's not a problem with Windows, really, it's a problem that
I call (just in my discussions with myself) “pretend software”. It's
where most everybody pretend that some software is OK for its general
purpose, because it works for a number of special cases. And we've never
heard about it not working. At least, we don't remember it.


[H:\forums\clc++\unicode in windows console]
> python --version
Python 3.4.3

[H:\forums\clc++\unicode in windows console]
> python unicode.py
Traceback (most recent call last):
File "unicode.py", line 2, in <module>
print(u"Hello, world - Å\xd8Æ pr² ïç a\u0333b")
File "c:\Python34\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xd8' in
position 16: character maps to <undefined>

[H:\forums\clc++\unicode in windows console]
> _


The silly beast defaults to translating that Unicode text to the narrow
encoding that's chosen (e.g. via the `chcp` command) in the console
window, the console window's “active codepage”.

Which in the example above is 437, the original IBM PC encoding, with
lots of fancy characters but not the one noted in the eror mesage.

So what to do?

Well in earlier days CPython (which I suspect this is, it's just been
installed by something else) could be tweaked to do The Right Thing™,
but that possibility was removed, probably because of the mass psychosis
effect I described above: everybody think it works, because there are
cases (e.g. pure ASCII output) where it works. Alternatively it could be
that someone didn't like that it was so easy to prove that this was a
Python problem and not a Windows problem, so, made that that much harder
to do. Anyway, with modern Python it's a PITA to fix it.

I wrote about it in 2015, here: <url:
https://alfps.wordpress.com/2015/05/12/non-crashing-python-3-x-output-in-windows/>.


Cheers!,

- Alf

David Brown

unread,
Apr 7, 2017, 3:18:03 AM4/7/17
to
I took a quick test on my Windows 7 system. It struggled with some of
the characters - it seems the modification character \u0333 is beyond
Window's abilities. And π is missing. With those removed, it worked.

Perhaps Windows console could print the Unicode characters, as long as
those characters happened to be in the normal Windows code page
(Latin-1, or something similar).


Alvin

unread,
Apr 7, 2017, 8:07:07 AM4/7/17
to
Windows works fine, if you set the codepage to UTF-8 (at least with a
terminal with good UTF support like ConEmu):
chcp 65001

It's not like it would work on Linux, if you have a non-UTF configuration:

> LC_ALL=en_US python3 unicode.py

Traceback (most recent call last):
File "unicode.py", line 5, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u03c0' in
position 19: ordinal not in range(256)

Alf P. Steinbach

unread,
Apr 7, 2017, 8:19:56 AM4/7/17
to
On 07-Apr-17 9:17 AM, David Brown wrote:
> On 07/04/17 01:25, Alf P. Steinbach wrote:
>> On 06-Apr-17 11:17 PM, David Brown wrote:
>>> Or you can do it more easily in portable code that is not C++:
>>>
>>> $cat u.py
>>> #!/usr/bin/python
>>> # -*- coding: utf-8 -*-
>>> print(u"Hello, world - ÅØÆ πr² ïç a\u0333b")
>>>
>>> ./u.py
>>> Hello, world - ÅØÆ πr² ïç a̳b
>>>
>>> Works fine with Python 2 or 3. It certainly works on my Linux machine -
>>> I expect it to work on Windows too.
>>
[snip]
>
> I took a quick test on my Windows 7 system. It struggled with some of
> the characters - it seems the modification character \u0333 is beyond
> Window's abilities.

Yes, it is.


> And π is missing.

Whether you have π (pi) available with Python output depends on the
active codepage (the narrow character encoding) in the console, and the
reason that it depends on that, is Python's conversion from Unicode to
the active codepage, instead of using console i/o. I.e., to be less
imprecise, that the CPython implementation still does not support
Windows consoles but uses the standard narrow streams for console i/o.

Checking by simple

chcp CODEPAGENUMBER
echo π | more

… π appears to be there in codepages 437 (IBM PC, default in English
language installations of Windows) and 865 (Nordic), but appears to be
missing in codepages 865 (all-European) and 1252 (Windows ANSI Western,
an extension of ISO Latin-1), which some programmers use in the console.

But again, it would be impractical to keep changing the codepage to suit
the effective character set of the Python script. Instead CPython should
be fixed. I think that to get it fixed someone should argue that CPython
could be made even better (no mention of this being a fault).


> With those removed, it worked.

> Perhaps Windows console could print the Unicode characters, as long as
> those characters happened to be in the normal Windows code page
> (Latin-1, or something similar).

Windows consoles handle the Basic Multilingual Plane just fine; Windows
console Windows are Unicode beasts through and through.

For example, you can use any BMP character in a typed in command, and
that command line is passed perfectly to the process.

Programs, such as the CPython implementation, and designs, such as the
design of the C++ i/o, fail to handle Unicode correctly in Windows.
Mainly because they are based on the antiquated and counter-productive
since 1981 old Unix unification of files, pipes and interactive i/o as
streams of single bytes. Even with UTF-8 and no support for interactive
features it's ungood, because UTF-8 error states are usually persistent.


Cheers & hth., :)

- Alf

alexo

unread,
Apr 7, 2017, 9:03:53 AM4/7/17
to

I don't understand how 0xC2, 0xAC gives the cent symbol.
It is not a single character code, so how can this sequence interpreted
as a single character?

Same doubts in

0xE2, 0x82, 0xAC
or
0xF0, 0x90, 0x8D, 0x88.

> unsigned char utf8data[] =
> {
> 'H',
> 0xc3, 0xa4,
> 'l',
> 0xC2, 0xA2, // "cent"
> 0xE2, 0x82, 0xAC, // "euro"
> 0xF0, 0x90, 0x8D, 0x88, // U+10348
> 'l',
> 0xc3, 0xb6,
> '!',
> '\n',
> 0x00
> }

This arrays contains 19 or 11 characters (including in the counting 0x00
as '\0') ?

More over, reading the posts in this thread, I still have not understood
how to use wcout and unicode codes without having to write console
settings like

UINT oldcp = GetConsoleOutputCP();

if (!SetConsoleOutputCP(CP_UTF8))
{
fprintf(stderr, "chcp failed\n");
return EXIT_FAILURE;
}

I know I seem stupid, but I tought there were a way to write something like


std::wcout << "H\u0333" << "ello" << endl;

thank you

David Brown

unread,
Apr 7, 2017, 9:37:45 AM4/7/17
to
That is certainly true. But the difference is that the standard shells
and terminals on Linux have all been fine with utf-8 for a good many
years, and most systems will have a utf-8 locale even if they are only
used for plain ASCII characters normally. On Windows, however, you need
to go out of your way to get extra terminal software and have extra
settings (unless things have changed in later Windows).

Still, I will remember the possibility of something like ConEmu if I
find I need console utf-8 on Windows.


Alf P. Steinbach

unread,
Apr 7, 2017, 9:37:56 AM4/7/17
to
On 07-Apr-17 3:03 PM, alexo wrote:
> More over, reading the posts in this thread, I still have not understood
> how to use wcout and unicode codes without having to write console
> settings like
>
> UINT oldcp = GetConsoleOutputCP();
>
> if (!SetConsoleOutputCP(CP_UTF8))
> {
> fprintf(stderr, "chcp failed\n");
> return EXIT_FAILURE;
> }
>
> I know I seem stupid, but I tought there were a way to write something like
>
>
> std::wcout << "H\u0333" << "ello" << endl;

Well, you could read my reply to your original posting.

That's a hint.


Cheers!,

- Alf

David Brown

unread,
Apr 7, 2017, 9:44:45 AM4/7/17
to
That makes sense.

> Checking by simple
>
> chcp CODEPAGENUMBER
> echo π | more
>
> … π appears to be there in codepages 437 (IBM PC, default in English
> language installations of Windows) and 865 (Nordic), but appears to be
> missing in codepages 865 (all-European) and 1252 (Windows ANSI Western,
> an extension of ISO Latin-1), which some programmers use in the console.

My code page at the moment is 850 (latin-1) - I have Windows set up for
UK English, but with a Norwegian keyboard. I can show π in a terminal,
with a suitable font like Lucida Console, but the python script still
does not print it.

>
> But again, it would be impractical to keep changing the codepage to suit
> the effective character set of the Python script. Instead CPython should
> be fixed. I think that to get it fixed someone should argue that CPython
> could be made even better (no mention of this being a fault).
>

Yes - it would be nice if this simply worked cross-platform out of the
box on Windows. I suppose font support would be required, but it should
not be asking /too/ much of Python to make Unicode output work here on
Windows in the same way as on Linux.

>
>> With those removed, it worked.
>
>> Perhaps Windows console could print the Unicode characters, as long as
>> those characters happened to be in the normal Windows code page
>> (Latin-1, or something similar).
>
> Windows consoles handle the Basic Multilingual Plane just fine; Windows
> console Windows are Unicode beasts through and through.
>
> For example, you can use any BMP character in a typed in command, and
> that command line is passed perfectly to the process.
>
> Programs, such as the CPython implementation, and designs, such as the
> design of the C++ i/o, fail to handle Unicode correctly in Windows.
> Mainly because they are based on the antiquated and counter-productive
> since 1981 old Unix unification of files, pipes and interactive i/o as
> streams of single bytes. Even with UTF-8 and no support for interactive
> features it's ungood, because UTF-8 error states are usually persistent.
>

Linux terminals can certainly be screwed up if you try and cat a binary
file. I don't know if it is only UTF-8 errors, or other problems. No
system is perfect, it seems.

>
> Cheers & hth., :)
>

Perhaps that is enough of this here - the Python stuff is off-topic for
c.l.c++ and I doubt if it is helping the OP. But thank you for your
explanations - I have learned a few new things here.


Alf P. Steinbach

unread,
Apr 7, 2017, 9:46:50 AM4/7/17
to
On 07-Apr-17 3:03 PM, alexo wrote:
> I don't understand how 0xC2, 0xAC gives the cent symbol.
> It is not a single character code, so how can this sequence interpreted
> as a single character?

You need to provide the full code and example for that.

Cheers!,

- Alf

Scott Lurndal

unread,
Apr 7, 2017, 9:49:34 AM4/7/17
to
Random terminal control escape sequences within the binary will screw up
xterm and gnome-terminal.

$ stty sane
$ tput reset

will restore normal operations. It may be necessary,
in some cases (^c of poorly written curses app, e.g.)
to use ^j to get a newline when typing stty sane.


http://invisible-island.net/xterm/ctlseqs/ctlseqs.html

Alvin

unread,
Apr 7, 2017, 10:40:56 AM4/7/17
to
I just tried Python 3.6.1. It works without chcp. There is PEP 528:
https://www.python.org/dev/peps/pep-0528/

Ben Bacarisse

unread,
Apr 7, 2017, 2:52:33 PM4/7/17
to
alexo <alessandr...@libero.it> writes:

> I don't understand how 0xC2, 0xAC gives the cent symbol.

I don't think it ever will! 0xC2 0xAC is the hex encoding of the UTF-8
encoding of the not sign. A cent symbol is UTF-8 encoded as 0xC2 0xA2.

You need to anchor the distinction between the character set (loosely
the numbering of some collection of symbols) and the way in which those
numbers are encoded as bytes for transmission and/or printing. Unicode
is, to a first approximation, a numbering scheme for many hundreds of
characters. The numbers specified by Unicode can then be transmitted in
a variety of ways with names like UCS4, UTF-16 and UTF-8.

An excellent resource for learning about UTF-8 is this page:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

<snip>
--
Ben.

alexo

unread,
Apr 8, 2017, 7:02:04 AM4/8/17
to
this is the output of my g++ compiler (MinGW)

g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

And this is the message I get trying to compile your code:


main.cpp: In function 'void init_streams()':
main.cpp:15:32: error: '_O_WTEXT' was not declared in this scope
_setmode( _fileno( stdin), _O_WTEXT );
^
main.cpp: In function 'int main()':
main.cpp:24:21: error: converting to execution character set: Illegal
byte sequence
auto const& s = L"Every ??? ????? likes Norwegian blåbærsyltetøy!";
^
main.cpp:28:14: error: converting to execution character set: Illegal
byte sequence
wcout << L"What's your name? ";
^


I've removed all C++11 flavours and I've added

#include <cstdio>

to turn off a couple of errors about finding stdin and stout.
The problem is that my compiler cannot find _O_WTEXT
and that it doesn't recognize the format L"..." string


Alvin

unread,
Apr 8, 2017, 8:01:15 AM4/8/17
to
...\x86_64-w64-mingw32\include\fcntl.h:
/**
* This file has no copyright assigned and is placed in the Public Domain.
* This file is part of the mingw-w64 runtime package.
* No warranty is given; refer to the file DISCLAIMER.PD within this
package.
*/
...
#define _O_WTEXT 0x10000
...

> main.cpp: In function 'int main()':
> main.cpp:24:21: error: converting to execution character set: Illegal
> byte sequence
> auto const& s = L"Every ??? ????? likes Norwegian blåbærsyltetøy!";
> ^
> main.cpp:28:14: error: converting to execution character set: Illegal
> byte sequence
> wcout << L"What's your name? ";
> ^

That's the kind of error you get, if you didn't properly create the .cpp
as UTF-8.

> I've removed all C++11 flavours and I've added
>
> #include <cstdio>
>
> to turn off a couple of errors about finding stdin and stout.
> The problem is that my compiler cannot find _O_WTEXT
> and that it doesn't recognize the format L"..." string

The original code works with the MinGW 5.x and 6.x versions I have lying
around.

Alf P. Steinbach

unread,
Apr 8, 2017, 9:05:24 AM4/8/17
to
This is an old compiler.

Well, 2 years old, and that can be a long time when we get a new
standard every second year.

Essentially the maintenance of MinGW g++ has been passed from the
original MinGW project (where I believe you downloaded that compiler) to
the MinGW-64 project.


> And this is the message I get trying to compile your code:
>
>
> main.cpp: In function 'void init_streams()':
> main.cpp:15:32: error: '_O_WTEXT' was not declared in this scope
> _setmode( _fileno( stdin), _O_WTEXT );

By inspection of the headers of that compiler's standard library, in
order to get a definition of `_O_WTEXT` with this compiler you need to
define `__MSVCRT_VERSION__` as equal or greater than `0x0800`.

Also, with `-std=c++11` option you need to explicitly tell it to not
define `__STRICT_ANSI__`, in order to get a definition of `_fileno`.

Which with this compiler's library is defined by the header that I
forgot to include, namely `<stdio.h>`.

It's weird that a compiler whose one and only purpose was to work in
Windows, doesn't. Anyway, the good news is that the newer g++ compilers
don't have these quirks. At least not the ones from MinGW-64.

Be that as it may, the following build command works for me, with g++
5.3.0-3 from the old MinGW project:

---------------------------------------------------------------------
[H:\forums\clc++\unicode in windows console]
> g++ --version
g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


[H:\forums\clc++\unicode in windows console]
> g++ _setmode.cpp -std=c++11 -D __MSVCRT_VERSION__=0x0800
-U__STRICT_ANSI__

[H:\forums\clc++\unicode in windows console]
> a
Every 日本国 кошка likes Norwegian blåbærsyltetøy!

What’s your name? Særskrevne Påske Nøtter
Pleased to meet you, Særskrevne Påske Nøtter!

[H:\forums\clc++\unicode in windows console]
> _
---------------------------------------------------------------------

To avoid having to write all that every time, you can define the parts
that you'd otherwise have to repeat, as an environment variable.

Or, make a script or alias for the g++ invocation.


> main.cpp: In function 'int main()':
> main.cpp:24:21: error: converting to execution character set: Illegal
> byte sequence
> auto const& s = L"Every ??? ????? likes Norwegian blåbærsyltetøy!";
> ^
> main.cpp:28:14: error: converting to execution character set: Illegal
> byte sequence
> wcout << L"What's your name? ";

You just need to save your .cpp file with the encoding that g++ expects.

By default that's UTF-8.

And better make that UTF-8 with BOM, so that Visual C++ will understand
that it's UTF-8 by default.


> I've removed all C++11 flavours and I've added
>
> #include <cstdio>
>
> to turn off a couple of errors about finding stdin and stout.

Sorry about that, I plain forgot to include that header. :(

By the way it should be `<stdio.h>`.

The `<cstdio>` header may not necessarily provide unqualified names,
e.g. with that header one may have to write `std::stdin` instead of just
`stdin`.


> The problem is that my compiler cannot find _O_WTEXT
> and that it doesn't recognize the format L"..." string


alexo

unread,
Apr 8, 2017, 10:28:44 AM4/8/17
to
I'm sorry to have to agree to you. But I knew only the MinGW main
project site.

> Well, 2 years old, and that can be a long time when we get a new
> standard every second year.
>
> Essentially the maintenance of MinGW g++ has been passed from the
> original MinGW project (where I believe you downloaded that compiler) to
> the MinGW-64 project.
>

Never heard of mingw-w64


>
> [H:\forums\clc++\unicode in windows console]
>> g++ _setmode.cpp -std=c++11 -D __MSVCRT_VERSION__=0x0800
> -U__STRICT_ANSI__
>

this is the output my actual (MinGW) g++ spits out:

g++: error: __MSVCRT_VERSION__=0x0800: No such file or directory
probably this is due to the compiler's old release number.

> [H:\forums\clc++\unicode in windows console]
>> a
> Every 日本国 кошка likes Norwegian blåbærsyltetøy!
>
> What’s your name? Særskrevne Påske Nøtter
> Pleased to meet you, Særskrevne Påske Nøtter!
>
> [H:\forums\clc++\unicode in windows console]
>> _
> ---------------------------------------------------------------------
>
> To avoid having to write all that every time, you can define the parts
> that you'd otherwise have to repeat, as an environment variable.
>
> Or, make a script or alias for the g++ invocation.

I'm not sure the way it must be done.
Anyway the command history of the prompt shell helps a lot.

>> main.cpp: In function 'int main()':
>> main.cpp:24:21: error: converting to execution character set: Illegal
>> byte sequence
>> auto const& s = L"Every ??? ????? likes Norwegian blåbærsyltetøy!";
>> ^
>> main.cpp:28:14: error: converting to execution character set: Illegal
>> byte sequence
>> wcout << L"What's your name? ";
>
> You just need to save your .cpp file with the encoding that g++ expects.
>
> By default that's UTF-8.
>
> And better make that UTF-8 with BOM, so that Visual C++ will understand
> that it's UTF-8 by default.

OK, encoded and saved using notepad++ text editor.
I didn't know of this encoding necessity.

Since UTF-8 is back-compatible to ASCII I'll use the former as the
default encoding format.

An off-topic question: could you brefly tell me what does the arrow mean
in the following main declaration?


thank you

auto main() -> int
{
...
}




Alf P. Steinbach

unread,
Apr 8, 2017, 12:39:25 PM4/8/17
to
On 08-Apr-17 4:28 PM, alexo wrote:
> [snip]
> Never heard of mingw-w64

That wasn't quite what I wrote.


>> [H:\forums\clc++\unicode in windows console]
>>> g++ _setmode.cpp -std=c++11 -D __MSVCRT_VERSION__=0x0800
>> -U__STRICT_ANSI__
>>
>
> this is the output my actual (MinGW) g++ spits out:
>
> g++: error: __MSVCRT_VERSION__=0x0800: No such file or directory
> probably this is due to the compiler's old release number.

You'd better copy and paste the command, and maybe edit the file name,
rather than typing it in.

Here it is again:

g++ your_file_name.cpp -std=c++11 -D __MSVCRT_VERSION__=0x0800
-U__STRICT_ANSI__


> [snip]
> An off-topic question: could you brefly tell me what does the arrow mean
> in the following main declaration?
>
> auto main() -> int

It means that `main` returns a function result of type `int`.


Cheers, & hth.,

- Alf


Mr Flibble

unread,
Apr 8, 2017, 6:44:19 PM4/8/17
to
On 08/04/2017 15:28, alexo wrote:

>
> An off-topic question: could you brefly tell me what does the arrow mean
> in the following main declaration?
>
>
> thank you
>
> auto main() -> int
> {
> ...
> }
>

It means that the person who wrote it has OCD.

/Flibble

Bonita Montero

unread,
Apr 9, 2017, 5:00:15 AM4/9/17
to
Your formatting-style is disgusting.

alexo

unread,
Apr 9, 2017, 6:07:14 AM4/9/17
to
Il 08/04/2017 18:39, Alf P. Steinbach ha scritto:
>>> [H:\forums\clc++\unicode in windows console]
>>>> g++ _setmode.cpp -std=c++11 -D __MSVCRT_VERSION__=0x0800
>>> -U__STRICT_ANSI__
>>>
>>
>> this is the output my actual (MinGW) g++ spits out:
>>
>> g++: error: __MSVCRT_VERSION__=0x0800: No such file or directory
>> probably this is due to the compiler's old release number.
> You'd better copy and paste the command, and maybe edit the file name,
> rather than typing it in.
>
> Here it is again:
>
> g++ your_file_name.cpp -std=c++11 -D __MSVCRT_VERSION__=0x0800
> -U__STRICT_ANSI__
>

Now it worked!
Copy and paste just worked fine. I suppose now that I miss-typed in
something...


>> [snip]
>> An off-topic question: could you brefly tell me what does the arrow mean
>> in the following main declaration?
>>
>> auto main() -> int
>
> It means that `main` returns a function result of type `int`.

ok

alexo

unread,
Apr 9, 2017, 11:34:55 AM4/9/17
to
Il 08/04/2017 18:39, Alf P. Steinbach ha scritto:
> On 08-Apr-17 4:28 PM, alexo wrote:
>> [snip]
>> Never heard of mingw-w64
>
> That wasn't quite what I wrote.

www.mingw-w64.org/doku.php

I've found this project for MinGW in 64 bit flavour.
Is that what you referred to?


Alf P. Steinbach

unread,
Apr 9, 2017, 4:53:02 PM4/9/17
to
Yep. You just have to be aware that there are lots of different builds,
by different persons. As I recall the Nuwen distro (which is very simple
and small) is built on MinGW-w64 but lacks support for specifying
execution character set, which currently is awkward for Windows
programming. Still it's nice, and it's maintained by STL over at
Microsoft (who is the guy who maintains the STL over at Microsoft, a
sort of extreme name coincidence), but on the Nuwen site he just calls
himself after the hacker prince in Vernor Vinge's novels.

Cheers!,

- Alf

Jorgen Grahn

unread,
Apr 12, 2017, 2:04:08 AM4/12/17
to
On Fri, 2017-04-07, David Brown wrote:
> On 07/04/17 14:19, Alf P. Steinbach wrote:
...
>> Programs, such as the CPython implementation, and designs, such as the
>> design of the C++ i/o, fail to handle Unicode correctly in Windows.
>> Mainly because they are based on the antiquated and counter-productive
>> since 1981 old Unix unification of files, pipes and interactive i/o as
>> streams of single bytes. Even with UTF-8 and no support for interactive
>> features it's ungood, because UTF-8 error states are usually persistent.

That surprises me, because UTF-8 was designed so that recovering would
be easy. E.g. when you encounter an octet with MSB unset, you know
you've found an undamaged ASCII character.

> Linux terminals can certainly be screwed up if you try and cat a binary
> file. I don't know if it is only UTF-8 errors, or other problems.

It's very easy to screw up a terminal without involving UTF-8. I doubt
if UTF-8 makes that worse.

> No system is perfect, it seems.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
0 new messages