Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

MinGW g++ encoding `u8"literal"` with Windows ANSI Western, not UTF-8

590 views
Skip to first unread message

Alf P. Steinbach

unread,
Jun 5, 2017, 7:27:00 PM6/5/17
to
Argh.

Code:

#include <assert.h>
#include <iostream>
using namespace std;

auto main() -> int
{
char const* const s = u8"Jæklig råtten translatør!";

using Byte = unsigned char;
for( int i = 0; i < 4; ++i ) { cout << +Byte( s[i] ) << ' '; }
cout << endl;
assert( s[0] == 'J' );
assert( Byte( s[1] ) > 127 );
assert( Byte( s[2] ) != 'k' );
}

Result:

[C:\my\dev\libraries\stdlib\explore_issues]
> where g++
C:\Program Files
(x86)\mingw-w64\i686-6.3.0-posix-dwarf-rt_v5-rev1\mingw32\bin\g++.exe

[C:\my\dev\libraries\stdlib\explore_issues]
> g++ "g++.u8-support.cpp" && a
74 230 107 108
Assertion failed!

Program: C:\my\dev\libraries\stdlib\explore_issues\a.exe
File: g++.u8-support.cpp, Line 14

Expression: Byte( s[2] ) != 'k'

This application has requested the Runtime to terminate it in an
unusual way.
Please contact the application's support team for more information.

[C:\my\dev\libraries\stdlib\explore_issues]
> _

These bytes are Windows ANSI Western (codepage 1252) encoding of the
literal.

Is there any good solution for guaranteed producing an UTF-8 encoded
narrow literal with g++?


Cheers!,

- Alf

Alf P. Steinbach

unread,
Jun 5, 2017, 7:31:47 PM6/5/17
to
On 06-Jun-17 1:26 AM, Alf P. Steinbach wrote:
>
> These bytes [for u8"bæsj"] are Windows ANSI Western (codepage 1252) encoding of the
> literal.
>
> Is there any good solution for guaranteed producing an UTF-8 encoded
> narrow literal with g++?

I forgot to add, I'm so happy to ask this question here in clc++.

Over at Stack Overflow the g++ idiot fanboy group would immediately have
started downvoting the question and inserting noise such as comments
that MinGW g++ isn't representative of gcc suite, and my compiler (here
much techno-babble about which compiler that is) doesn't do that, etc.

You guys are better. :)


Cheers!,

- Alf

Alf P. Steinbach

unread,
Jun 5, 2017, 8:20:53 PM6/5/17
to
On 06-Jun-17 1:26 AM, Alf P. Steinbach wrote:
>
> These bytes are Windows ANSI Western (codepage 1252) encoding of the
> literal.

Because g++ doesn't do any validation of bytes in narrow literals, and I
inadvertently saved the source code as Windows ANSI.


> Is there any good solution for guaranteed producing an UTF-8 encoded
> narrow literal with g++?

No, not as far as I know, since it doesn't validate.

One has to either use the `-finput-charset` option, which AFAIK is
broken (requiring even the standard library headers to have the
specified encoding), or else use UTF-8 encoded source, which of course
is a good idea anyway, but is difficult to guarantee with lots of
editors that still prefer other encodings.


Cheers!,

- Alf

Manfred

unread,
Jun 5, 2017, 8:58:46 PM6/5/17
to
If it is not a bug in mingw, could it be related with your source editor?
Have you tried with a UTF-8 compliant editor?
Otherwise, the hard way would be to escape the non-ascii chars..

( I couldn't refrain from trying - indeed gcc on Linux works fine:
$ g++ -std=c++14 g++.u8-support.cpp && ./a.out
74 195 166 107
)

Manfred

unread,
Jun 5, 2017, 9:13:58 PM6/5/17
to
This is a problem only if the standard headers use non-ascii chars.

>
>
> Cheers!,
>
> - Alf

Christiano

unread,
Jun 5, 2017, 9:40:51 PM6/5/17
to
1- Verify if your source is encoded with UTF-8 using an Hex editor (example: HxD )
Verify the BOM of the source file [2]

2- Try:

g++ -finput-charset=utf-8 -fexec-charset=utf-8 "g++.u8-support.cpp"

or

g++ -finput-charset=UTF-8 -fexec-charset=UTF-8 "g++.u8-support.cpp"

3- Windows doesn't have full support to UTF-8. The method encouraged by MSDN Blog[1] is:
- Save strings using utf-8
- When you need write to Windows API (Unicode Version*), convert utf-8 --> UTF-16
- When you need read to Windows API (Unicode Version*), convert to UTF-16 --> utf-8

#include <windows.h>
#include <fcntl.h>
#include <io.h>
#include <stdio.h>

int main(void) {
wchar_t ws[1024];
char s[1024];
_setmode(_fileno(stdout), _O_U16TEXT);
_setmode(_fileno(stdin), _O_U16TEXT);

wscanf(L"%s", s);
WideCharToMultiByte(CP_UTF8, 0, ws, -1, s, 1024, NULL, NULL);
/* ... */
MultiByteToWideChar(CP_UTF8, 0, s, -1, ws, 1024, NULL, NULL);
wprintf(s);


return 0;
}

4- There is a hack not-documented to read UTF-16 (win api unicode version) from console using standard functions.

#include <fcntl.h>
#include <io.h>
#include <stdio.h>

int main(void) {

_setmode(_fileno(stdout), _O_U16TEXT);
_setmode(_fileno(stdin), _O_U16TEXT);

wchar_t s[1024];
wscanf(L"%s", s);
wprintf(s);
return 0;
}
____________________________
* Windows api has two versions: Code Page version and "Unicode"(=utf-16) version

[1] https://blogs.msdn.microsoft.com/vcblog/2016/02/22/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx

Juha Nieminen

unread,
Jun 6, 2017, 2:44:43 AM6/6/17
to
Alf P. Steinbach <alf.p.stein...@gmail.com> wrote:
> #include <assert.h>

Really?

> using namespace std;
> for( int i = 0; i < 4; ++i ) { cout << +Byte( s[i] ) << ' '; }
> cout << endl;

You wrote 21 characters in order to save 15 characters later in the code.
You thus saved -6 characters in your code. Your code did not become any
more readable or understandable as a result.

And, keeping with that theme, you love to write

> auto main() -> int

which is 8 characters longer than the traditional form, without introducing
any sort of additional readability or clarity.

Alf P. Steinbach

unread,
Jun 6, 2017, 3:08:58 AM6/6/17
to
On 06-Jun-17 8:44 AM, Juha Nieminen wrote:
> Alf P. Steinbach <alf.p.stein...@gmail.com> wrote:
>> #include <assert.h>
>
> Really?

It's a standard header that defines a macro called `assert`.

For general information about the concept, see e.g.
<url: https://en.wikipedia.org/wiki/Assertion_(software_development)>.


Cheers & hth.,

- Alf

Öö Tiib

unread,
Jun 6, 2017, 4:33:54 AM6/6/17
to
On Tuesday, 6 June 2017 04:40:51 UTC+3, Christiano wrote:
>
> 1- Verify if your source is encoded with UTF-8 using an Hex editor (example: HxD )
> Verify the BOM of the source file [2]

Software putting BOM to UTF-8 is doing it wrong.
According to the Unicode standard, the BOM for UTF-8 files is neither
required nor recommended:

| 2.6 Encoding Schemes
|
| ... Use of a BOM is neither required nor recommended for UTF-8, but may be
| encountered in contexts where UTF-8 data is converted from other encoding
| forms that use a BOM or where the BOM is used as a UTF-8 signature. See
| the “Byte Order Mark” subsection in Section 16.8, Specials, for more
| information.

Chris Vine

unread,
Jun 6, 2017, 4:58:55 AM6/6/17
to
On Tue, 6 Jun 2017 02:20:42 +0200
"Alf P. Steinbach" <alf.p.stein...@gmail.com> wrote:
> On 06-Jun-17 1:26 AM, Alf P. Steinbach wrote:
> >
> > These bytes are Windows ANSI Western (codepage 1252) encoding of
> > the literal.
>
> Because g++ doesn't do any validation of bytes in narrow literals,
> and I inadvertently saved the source code as Windows ANSI.

That shouldn't matter provided you tell the compiler what encoding
the source file is saved in, with the -finput-charset option. According
to https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html , it seems to
me that your code should work if you specify Windows ANSI as the input
character set using that option. If you do that, the u8 specifier for
the string literal should force the compiler to use UTF-8 as the
execution character set for the string (that is, the encoding included
within the binary): if it fails to do so that is a bug in the compiler
it seems to me.

If there is a bug, specifying UTF-8 as the execution character set
with the -fexec-charset compiler flag should do the trick.

I cannot test this as I do not have mingw available.

Alf P. Steinbach

unread,
Jun 6, 2017, 10:31:00 AM6/6/17
to
On 06-Jun-17 10:33 AM, Öö Tiib wrote:
> On Tuesday, 6 June 2017 04:40:51 UTC+3, Christiano wrote:
>>
>> 1- Verify if your source is encoded with UTF-8 using an Hex editor (example: HxD )
>> Verify the BOM of the source file [2]
>
> Software putting BOM to UTF-8 is doing it wrong.
> According to the Unicode standard, the BOM for UTF-8 files is neither
> required nor recommended:

In earlier years (I think pre 2015) you needed a BOM for UTF-8 to
identify the encoding for the Visual C++ compiler, even for a single
user code file translation unit.

Now MSVC can be informed of the encoding of the main file via an option,
but you still need that BOM to identify UTF-8 as such in included
headers, when they can be included from files with other encodings.
Unfortunately g++ doesn't do such encoding detection: AFAIK it's unable
to handle different source encodings in the same translation unit.
Earlier g++ was even unable to handle BOM in an UTF-8 file, which was a
huge problem: g++ couldn't handle it, while MSVC required it…

So, a BOM in UTF-8 files has certain clear advantages, over and above
being the Windows convention, and it has no problems except with some
old *nix tools, which at one time included the g++ compiler. The Unicode
standard's wording is unfortunate because many *nix fanboys read “not
recommended” as “recommended to abstain from”, so that the sorry lack of
support in many *nix tools, at one time, could be argued as being
positive standard-conformance rather than the negative low quality it
was. Microsoft is a founding member of the Unicode consortium and the
UTF-8 BOM convention is crucial for some of their APIs and tools,
including Visual C++, so the “recommended to abstain from”
interpretation is very unlikely to be the single intended meaning. At a
guess the wording was intentionally ambiguous, a political thing.

Ralf Goertz

unread,
Jun 6, 2017, 12:09:39 PM6/6/17
to
Am Tue, 6 Jun 2017 16:30:47 +0200
schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:

> So, a BOM in UTF-8 files has certain clear advantages, over and above
> being the Windows convention, and it has no problems except with some
> old *nix tools, which at one time included the g++ compiler. The
> Unicode standard's wording is unfortunate because many *nix fanboys
> read “not recommended” as “recommended to abstain from”, so that the
> sorry lack of support in many *nix tools, at one time, could be argued
> as being positive standard-conformance rather than the negative low
> quality it was.

AFAIK the BOM in *nix land is frowned upon because it interferes with
file concatenation. Having two files a.h and a.cc with BOM one might
want to put the content of a.cc into a.h. Using »cat a.cc >> a.h« would
create problems because BOMs may only occur at the beginning of a file,
right?

At least that's what I have been using as an excuse to not use BOMs for
years despite the fact that in principle they seem to be a good idea.
:-)

Alf P. Steinbach

unread,
Jun 6, 2017, 12:30:24 PM6/6/17
to
On 06-Jun-17 6:09 PM, Ralf Goertz wrote:
> Am Tue, 6 Jun 2017 16:30:47 +0200
> schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:
>
>> So, a BOM in UTF-8 files has certain clear advantages, over and above
>> being the Windows convention, and it has no problems except with some
>> old *nix tools, which at one time included the g++ compiler. The
>> Unicode standard's wording is unfortunate because many *nix fanboys
>> read “not recommended” as “recommended to abstain from”, so that the
>> sorry lack of support in many *nix tools, at one time, could be argued
>> as being positive standard-conformance rather than the negative low
>> quality it was.
>
> AFAIK the BOM in *nix land is frowned upon because it interferes with
> file concatenation. Having two files a.h and a.cc with BOM one might
> want to put the content of a.cc into a.h. Using »cat a.cc >> a.h« would
> create problems because BOMs may only occur at the beginning of a file,
> right?

The encoding of a BOM can occur anywhere. It is a zero-width
non-breaking invisible space. When you put a zero-width non-breaking
space at the start of an UTF-8 encoded sequence, then it indicates the
encoding, and when you put it at the start of an UTF-16 encoded sequence
then it additionally indicates the byte order, hence the acronym.


> At least that's what I have been using as an excuse to not use BOMs for
> years despite the fact that in principle they seem to be a good idea.
> :-)
>

Cheers!,

- Alf

Scott Lurndal

unread,
Jun 6, 2017, 12:42:43 PM6/6/17
to
Ralf Goertz <m...@myprovider.invalid> writes:
>Am Tue, 6 Jun 2017 16:30:47 +0200
>schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:
>
>> So, a BOM in UTF-8 files has certain clear advantages, over and above
>> being the Windows convention, and it has no problems except with some
>> old *nix tools, which at one time included the g++ compiler. The
>> Unicode standard's wording is unfortunate because many *nix fanboys
>> read =E2=80=9Cnot recommended=E2=80=9D as =E2=80=9Crecommended to abstain=
> from=E2=80=9D, so that the
>> sorry lack of support in many *nix tools, at one time, could be argued
>> as being positive standard-conformance rather than the negative low
>> quality it was.
>
>AFAIK the BOM in *nix land is frowned upon because it interferes with

The Byte-Order-Marker is frowned upon in *nix land because it is
completely unnecessary for UTF-8.

Öö Tiib

unread,
Jun 6, 2017, 12:59:33 PM6/6/17
to
On Tuesday, 6 June 2017 17:31:00 UTC+3, Alf P. Steinbach wrote:
> On 06-Jun-17 10:33 AM, Öö Tiib wrote:
> > On Tuesday, 6 June 2017 04:40:51 UTC+3, Christiano wrote:
> >>
> >> 1- Verify if your source is encoded with UTF-8 using an Hex editor (example: HxD )
> >> Verify the BOM of the source file [2]
> >
> > Software putting BOM to UTF-8 is doing it wrong.
> > According to the Unicode standard, the BOM for UTF-8 files is neither
> > required nor recommended:
>
> In earlier years (I think pre 2015) you needed a BOM for UTF-8 to
> identify the encoding for the Visual C++ compiler, even for a single
> user code file translation unit.

Microsoft is fully capable to drop their megalomania and to adapt to
reality. In my experience far more capable than Apple for example.
Yes it has been reason of pain that big guys can insist on doing it
wrong.

>
> Now MSVC can be informed of the encoding of the main file via an option,
> but you still need that BOM to identify UTF-8 as such in included
> headers, when they can be included from files with other encodings.

The second reason of nonsense has always been the people who use a full
rainbow of possibilities together. Why to use all available encodings in single
project? May be also change the encoding of each file now and then?
Such people IMHO deserve that tools, starting from repo slow down and
sometimes hiccup on that thing.

> Unfortunately g++ doesn't do such encoding detection: AFAIK it's unable
> to handle different source encodings in the same translation unit.
> Earlier g++ was even unable to handle BOM in an UTF-8 file, which was a
> huge problem: g++ couldn't handle it, while MSVC required it…

That is the third reason of pain ... the people who have ultra narrow mind
and speak only ASCII.

>
> So, a BOM in UTF-8 files has certain clear advantages, over and above
> being the Windows convention, and it has no problems except with some
> old *nix tools, which at one time included the g++ compiler. The Unicode
> standard's wording is unfortunate because many *nix fanboys read “not
> recommended” as “recommended to abstain from”, so that the sorry lack of
> support in many *nix tools, at one time, could be argued as being
> positive standard-conformance rather than the negative low quality it
> was.

I did not mean that software handling UTF-8 should be incapable of
handling BOM. I meant that software accepting text should support
UTF-8 without BOM. UTF-8 is right now about 90% of internet text
content and so people won't buy excuses that one shows garbage
because there was no BOM. A tool that rejects everything else but
UTF-8 would be likely more acceptable.

> Microsoft is a founding member of the Unicode consortium and the
> UTF-8 BOM convention is crucial for some of their APIs and tools,
> including Visual C++, so the “recommended to abstain from”
> interpretation is very unlikely to be the single intended meaning. At a
> guess the wording was intentionally ambiguous, a political thing.

Microsoft is actually most capable of adapting.

Ralf Goertz

unread,
Jun 7, 2017, 4:20:12 AM6/7/17
to
Am Tue, 6 Jun 2017 18:30:13 +0200
schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:

> On 06-Jun-17 6:09 PM, Ralf Goertz wrote:
> >
> > AFAIK the BOM in *nix land is frowned upon because it interferes
> > with file concatenation. Having two files a.h and a.cc with BOM one
> > might want to put the content of a.cc into a.h. Using »cat a.cc >>
> > a.h« would create problems because BOMs may only occur at the
> > beginning of a file, right?
>
> The encoding of a BOM can occur anywhere. It is a zero-width
> non-breaking invisible space. When you put a zero-width non-breaking
> space at the start of an UTF-8 encoded sequence, then it indicates
> the encoding, and when you put it at the start of an UTF-16 encoded
> sequence then it additionally indicates the byte order, hence the
> acronym.

But that's even worse in my opinion. Who in his right mind would want to
mix encodings in a single file? And if encodings are not mixed than we
don't need BOMs in the middle of a file. It might be okay to have to
check for a BOM at the beginning but when we read a line from the middle
we should not be supposed to check every wchar_t for it being a byte
order mark.

Consider a utf8 encoded file with BOM reading:

is ø the empty set symbol?
is ∅ the empty set symbol?

and this small program

#include <iostream>
#include <locale>
#include <string>

using namespace std;

int main() {
ios::sync_with_stdio(false);
wstring s;
locale loc("C.UTF-8");
wcin.imbue(loc);
while (getline(wcin,s)) {
wcout<<s.size()<<endl;
}
return 0;
}


Due to the BOM we get this output:

27
26

although both lines are of the same length when read as wide strings.
The BOM at the beginning can easily be dealt with because it can/should
be expected. But to have to check every wide character just because a
BOM is allowed everywhere? It would be a nightmare.

Interestingly, the option to switch on the use of BOMs in vim is

:set bomb

That says it all. ;-)

Juha Nieminen

unread,
Jun 12, 2017, 3:55:40 AM6/12/17
to
Alf P. Steinbach <alf.p.stein...@gmail.com> wrote:
> On 06-Jun-17 8:44 AM, Juha Nieminen wrote:
>> Alf P. Steinbach <alf.p.stein...@gmail.com> wrote:
>>> #include <assert.h>
>>
>> Really?
>
> It's a standard header that defines a macro called `assert`.

No, the standard header you are referring to is named cassert.

http://en.cppreference.com/w/cpp/header/cassert

Alf P. Steinbach

unread,
Jun 12, 2017, 4:08:29 AM6/12/17
to
On 12-Jun-17 9:55 AM, Juha Nieminen wrote:
> Alf P. Steinbach <alf.p.stein...@gmail.com> wrote:
>> On 06-Jun-17 8:44 AM, Juha Nieminen wrote:
>>> Alf P. Steinbach <alf.p.stein...@gmail.com> wrote:
>>>> #include <assert.h>
>>>
>>> Really?
>>
>> It's a standard header that defines a macro called `assert`.
>
> No, the standard header you are referring to is named cassert.

You're wrong.

I suspect you don't understand how this work, so, short summary:

1. The C library header <xxx.h> defines some stuff S.

2. The C++ library header <cxxx> provides that stuff S in namespace std,
and possibly but not guaranteed in the global namespace. Some small
adjustments are sometimes made, some overloads are sometimes added.

3. The C++ library header <xxx.h> provides the stuff S from <cxxx>, in
the global namespace, and possibly but not guaranteed in namespace std.

As you can see there are logically three distinct headers, involving two
different languages. The two headers that appear to have the same name,
in points 1 and 3, belong to two different languages and are possibly
distinct files with different content.

I SUSPECT, from the otherwise nonsense comments, that you thought I was
referring to 1, when, if you had been fully familiar this, you would
have understood 3, since that's the C++ header.

Note also that namespaces don't matter for a macro: macros don't respect
scopes.

As I recall C++17 will foil this scheme a little by breaking the
symmetry of 2 and 3 for some new stuff, just to be consistent with its
plethora of other special you'd-never-guess-it! cases.


> http://en.cppreference.com/w/cpp/header/cassert

That presumably & hopefully backs up what I said, not what you said.

Juha Nieminen

unread,
Jun 12, 2017, 8:57:39 AM6/12/17
to
Alf P. Steinbach <alf.p.stein...@gmail.com> wrote:
> I SUSPECT, from the otherwise nonsense comments, that you thought I was
> referring to 1, when, if you had been fully familiar this, you would
> have understood 3, since that's the C++ header.

Do you know where you can shove your arrogant attitude?

Alf P. Steinbach

unread,
Jun 12, 2017, 6:14:02 PM6/12/17
to
No, it's a general problem. Consider UTF-16. ASCII text interpreted as
UTF-16 = a lot of gobbledygook, and possibly even invalid sequences.


Cheers!, & thanks for your suggestions else-thread,

- Alf

Manfred

unread,
Jun 13, 2017, 8:29:21 AM6/13/17
to
You are right, that would be a bad combination. That said, I wouldn't
say -finput-charset is broken per se, one could think of that option to
handle differently <> and "" included headers, or be dependent on source
tree location, but still it could not be 100% safe.

If one really wants to use different encodings for sources, I think they
should be disjoint in different compilation units.
In fact if you need a non US-ascii source this is typically (always?)
due to localized strings, and those would be good practice to be defined
in dedicated sources, that could need no standard headers at all (e.g.
by defining them as plain const char*).
Moreover, this would be practically needed in case of multiple
translations (this would end up into something similar to Windows
resource files)

Alf P. Steinbach

unread,
Jun 13, 2017, 8:48:22 AM6/13/17
to
Consider that many people prefer to use national characters in identifiers.

C++ formally supports the common set of identifier characters in
Unicode, it's rather large. I don't personally do that, because as I see
it English is the /lingua franca/ (hah!) of programming, and I think
source code should be accessible regardless of one's nationality, and
unlike Visual C++, g++ doesn't support more than ASCII. *But* I remember
one French guy in this group who argued for the national language
identifiers on the grounds that their programmers felt it was more easy.

As it happens Visual C++ has no problem with mixed encodings in a single
translation unit, proving that there is no inherent technical
show-stopper problem – that's why I felt safe characterizing the g++
scheme as broken.


Cheers!,

- Alf

David Brown

unread,
Jun 13, 2017, 9:12:52 AM6/13/17
to
Non-ASCII identifiers /could/ be a serious problem - their usage would
be a disaster for interoperability. Most people with English language
keyboards have enough trouble with English words like naïve and café,
because most of them use Windows and have a UK or US keyboard layout
without accents, dead keys, or a *nix compose key. If these turned up
in identifiers in someone else's code, they would be lost.

But would it be any worse than people writing identifiers in their own
language, just using ASCII-only identifiers? Is it worse for English
speakers to deal with:

enum kompassretninger { nor, øst, sør, vest };

or

enum kompassretninger { nor, ost, sor, vest };

?

I am curious if any studies have been done - perhaps with languages like
Python that have had support for non-ASCII identifiers for a long time.

Some kinds of additional letters would be nice, even when sticking to
English, such as π, µs, or kΩ - but they might be hard to read, and for
many people they would be hard to type. And without UTF-8 symbols and
flexible operators, we can't write things like

y = a₂·x² + a₁·x + a₀

or

if (A ⊆ ℝ) ...

(I admit I had to resort to a character map accessory to type that last
one...)


>
> As it happens Visual C++ has no problem with mixed encodings in a single
> translation unit, proving that there is no inherent technical
> show-stopper problem – that's why I felt safe characterizing the g++
> scheme as broken.
>

Yes, the gcc extended identifier support is currently incomplete (it's
complete in theory - you can write extended identifiers with UTF-8. But
it's broken in practice, because you have to write them by giving the
code points!).


Manfred

unread,
Jun 13, 2017, 9:24:54 AM6/13/17
to
On 6/13/2017 2:48 PM, Alf P. Steinbach wrote:
> On 13-Jun-17 2:29 PM, Manfred wrote:
>>
>> If one really wants to use different encodings for sources, I think
>> they should be disjoint in different compilation units.
>> In fact if you need a non US-ascii source this is typically (always?)
>> due to localized strings, and those would be good practice to be
>> defined in dedicated sources, that could need no standard headers at
>> all (e.g. by defining them as plain const char*).
>> Moreover, this would be practically needed in case of multiple
>> translations (this would end up into something similar to Windows
>> resource files)
>
> Consider that many people prefer to use national characters in identifiers.
>
> C++ formally supports the common set of identifier characters in
> Unicode, it's rather large. I don't personally do that, because as I see
> it English is the /lingua franca/ (hah!) of programming, and I think
> source code should be accessible regardless of one's nationality, and
> unlike Visual C++, g++ doesn't support more than ASCII.
Correction: gcc (including g++) uses UTF-8 as default encoding...

*But* I remember
> one French guy in this group who argued for the national language
> identifiers on the grounds that their programmers felt it was more easy.
...so French identifiers would be fine too as long as the they are valid
according to /language/ rules.
The problem is when you mix different encodings that are not compatible
with each other, not about ascii-only.

>
> As it happens Visual C++ has no problem with mixed encodings in a single
> translation unit, proving that there is no inherent technical
> show-stopper problem – that's why I felt safe characterizing the g++
> scheme as broken.
One difference is that Visual C++ is an IDE, which includes the editor.
gcc is merely a compiler, which means you have to use something else as
an editor, and this opens for trouble.
Anyway, msvc++ may be better suited for this task, but IMVHO I think
mixing encodings is not a very great idea. Besides, I /think/ (*) MSVC++
uses BOMs, which I /personally/ dislike, although I have seen others do
like them.

(* actually I have seen MSVC++ adding a BOM to UTF-8 XML, where I would
not want it - and IIRC it would be deprecated by the IETF too)

>
>
> Cheers!,
>
> - Alf

Manfred

unread,
Jun 13, 2017, 10:24:57 AM6/13/17
to
On 6/13/2017 3:24 PM, Manfred wrote:
> On 6/13/2017 2:48 PM, Alf P. Steinbach wrote:
>> On 13-Jun-17 2:29 PM, Manfred wrote:

> Correction: gcc (including g++) uses UTF-8 as default encoding...
>
> *But* I remember
>> one French guy in this group who argued for the national language
>> identifiers on the grounds that their programmers felt it was more easy.
> ...so French identifiers would be fine too as long as the they are valid
> according to /language/ rules.
I was wrong here: indeed gcc only allows for ascii identifiers (other
characters must be '\u' escaped in identifiers, as Bavid Brown correctly
pointed out)

David Brown

unread,
Jun 13, 2017, 2:34:58 PM6/13/17
to
No, you were more correct than you thought. gcc can use a variety of
character sets for the source character set and the execution character
set. The default input character set is taken from the host's local, or
UTF-8 if it cannot be determined (on Linux, UTF-8 is the norm), or it
can be overridden on the command line. The execution character set is
UTF-8 by default, but can be overridden on the command line.

However, gcc requires the character set for /identifiers/ to be ASCII -
so if you enable "extended identifiers", you have \uNNNN or \UNNNNNNNN
formats to make the UTF characters in the identifiers. Basically, that
means you need an extra layer of pre-processor (or a smart editor) to
use UTF characters in identifiers.

But you can happily use UTF-8 characters in strings, character
constants, and comments.

mvh.,

David
(or Bavid, if you really insist)

Manfred

unread,
Jun 13, 2017, 3:37:51 PM6/13/17
to
Yes, I was referring to identifiers in my last message - indeed an extra
layer or a smart editor could do.
0 new messages