Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to write wide char string literals?

451 views
Skip to first unread message

Juha Nieminen

unread,
Jun 30, 2021, 3:57:30 AM6/30/21
to
Character encoding was a problem in the 1960's, and it's still a problem
today, no matter how much computers advance. Sheesh.

Problem is, how to reliably write wide char string literals that contain
non-ascii characters?

Suppose you write for example this:

const wchar_t* str = L"???";

In the *source code* that string literal may be eg. UTF-8 encoded. However,
the compiler needs to convert it to wide chars.

Problem is, how does the compiler know which encoding is being used in
that 8-bit string literal in the source code, in order for it to convert
it properly to wide chars?

Some compilers may assume it's UTF-8 encoded source code. Others may
assume it's ISO-Latin-1 encoded (I'm looking at you, Visual Studio).
Obviously the end result will be garbage if the wrong assumption is made.

In most compilers (such as Visual Studio) you can specify which encoding
to assume for source files, but this has to be done at the project
settings level. I don't think there's any way to specify the encoding
in the source code itself.

What does the C++ standard say? Does it say that source code files are
always UTF-8 encoded, or is it up to the implementation? I assume that if
it's the latter, the standard doesn't provide any mechanism to specify
which encoding is being used. Or does it?

Kli-Kla-Klawitter

unread,
Jun 30, 2021, 4:05:57 AM6/30/21
to
Use UTF-16 sourcefiles.

Ralf Goertz

unread,
Jun 30, 2021, 4:08:54 AM6/30/21
to
Am Wed, 30 Jun 2021 07:57:14 +0000 (UTC)
schrieb Juha Nieminen <nos...@thanks.invalid>:

> In most compilers (such as Visual Studio) you can specify which
> encoding to assume for source files, but this has to be done at the
> project settings level. I don't think there's any way to specify the
> encoding in the source code itself.

When using UTF-encoding there is always A BOM you could use. Doesn't
help much with iso encoding, though. And I also just found out that gcc
doesn't notice that a source file with an appropriate byte order mark is
encoded in utf32 BE. That's a bit disappointing.

Ralf Goertz

unread,
Jun 30, 2021, 4:31:03 AM6/30/21
to
Am Wed, 30 Jun 2021 10:05:40 +0200
schrieb Kli-Kla-Klawitter <kliklakl...@gmail.com>:
That doesn't help with gcc. Even if you specify the encoding on the
command line with -finput-charset=utf16be you run into trouble since
then gcc assumes the include files (even those included implicitely) are
assumed to be utf16be.

MrSpu...@ywn9entw2s.org

unread,
Jun 30, 2021, 4:39:34 AM6/30/21
to
Why should it care? To C and C++ strings are just a sequence of bytes, the
encoding is irrelevant unless you're using functions specific to a particular
encoding, eg: utf8_strlen() or similar.

Alf P. Steinbach

unread,
Jun 30, 2021, 4:55:48 AM6/30/21
to
On 30 Jun 2021 09:57, Juha Nieminen wrote:
> Character encoding was a problem in the 1960's, and it's still a problem
> today, no matter how much computers advance. Sheesh.
>
> Problem is, how to reliably write wide char string literals that contain
> non-ascii characters?
>
> Suppose you write for example this:
>
> const wchar_t* str = L"???";
>
> In the *source code* that string literal may be eg. UTF-8 encoded. However,
> the compiler needs to convert it to wide chars.
>
> Problem is, how does the compiler know which encoding is being used in
> that 8-bit string literal in the source code, in order for it to convert
> it properly to wide chars?

The compiler necessarily assumes some source encoding.

g++ and Visual C++ use different schemes for determining the source code
encoding assumption.

g++ uses a single encoding assumption that you can change via options,
while Visual C++ by default determines the encoding for each individual
file, which is a much more flexible scheme. However, in modern
programming work you don't want to use that flexible Visual C++ scheme
because the base assumption, when no other indication is present, is
that a file is Windows ANSI encoded, while in modern programming work
it's most likely UTF-8 encoded. So it's now a good idea to use the
Visual C++ UTF-8 option, plus some others, e.g.

/nologo /utf-8 /EHsc /GR /permissive- /FI"iso646.h" /std:c++17
/Zc:__cplusplus /Zc:externC- /W4 /wd4459 /D _CRT_SECURE_NO_WARNINGS=1 /D
_STL_SECURE_NO_WARNINGS=1


> Some compilers may assume it's UTF-8 encoded source code. Others may
> assume it's ISO-Latin-1 encoded (I'm looking at you, Visual Studio).
> Obviously the end result will be garbage if the wrong assumption is made.

Yes. You can to some extent prevent Visual C++ mis-interpretation by
using the UTF-8 BOM as an encoding indicator, and I recommend that.
However, there are costs, in particular that mindless Linux fanbois (all
fanbois are mindless, even C++ fanbois) hung up on supporting archaic
Linux tools that can't handle the BOM, can then brand you as this and
that; and that's not hypothetical, it's direct experience. Also, even
though using a BOM is a very strong convention in Windows the Cmd `type`
command can't handle it, so that one is nudged in the direction of
Powershell, which is a monstrosity that I really hate.


> In most compilers (such as Visual Studio) you can specify which encoding
> to assume for source files, but this has to be done at the project
> settings level.

Uhm, no, you can specify compiler options per file if you want, in each
file's properties.

Visual Studio 2019 screenshot: (https://ibb.co/tJ5jNJC)


> I don't think there's any way to specify the encoding
> in the source code itself.

Not in standard C++. For Visual C++ there is an undocumented (or used to
be undocumented) `#pragma` used e.g. in automatically generated resource
scripts, .rc files. I don't recall the name. Also, there is the UTF-8
BOM. An UTF-8 BOM is a pretty surefire way to force UTF-8 assumption.


> What does the C++ standard say? Does it say that source code files are
> always UTF-8 encoded, or is it up to the implementation?

It's totally up to the implementation.

That wouldn't be so bad if the standard had addressed the issue of a
collection of source files, in particular headers, with different
encodings, e.g. if the standard had /required/ all source files in a
translation unit to have the same encoding.

That's the assumption of g++, but not of Visual C++.


> I assume that if
> it's the latter, the standard doesn't provide any mechanism to specify
> which encoding is being used. Or does it?

Right. It's a mess. :-o :-)

But, practical solutions:

• Use UTF-8 BOM and Just Ignore™ whining from Linux fanbois.
• For good measure also use `/utf-8` option with Visual C++.
• Where it matters you can /statically assert/ UTF-8 encoding.

A `static_assert` depends on both that the the compiler's source file
encoding assumption is correct, whatever it is, and that the basic
execution character set (encoding of literals in the executable) is
UTF-8. These are separate encoding choices and can be specified
separately both with g++ and Visual C++. But assuming they both hold,

constexpr inline auto utf8_is_the_execution_character_set()
-> bool
{
constexpr auto& slashed_o = "ø";
return (sizeof( slashed_o ) == 3 and slashed_o[0] == '\xC3' and
slashed_o[1] == '\xB8');
}

When a `static_assert(utf8_is_the_execution_character_set())` holds you
can be pretty sure that the source encoding assumption is correct.

- Alf

David Brown

unread,
Jun 30, 2021, 5:23:40 AM6/30/21
to
On 30/06/2021 10:55, Alf P. Steinbach wrote:

> Right. It's a mess. :-o :-)
>
> But, practical solutions:
>
> • Use UTF-8 BOM and Just Ignore™ whining from Linux fanbois.

Should you also ignore the recommendations from Unicode people? I say
you should ignore Windows Notepad fanbois, and drop the BOM.

Some programs can't handle a UTF-8 BOM. Some programs can't handle (or
at least, can't automatically recognise) UTF-8 encoding without a BOM.
Some programs add a UTF-8 BOM automatically, some remove it
automatically, some don't care whether it is there or not.

Like it or not, your success with or without a UTF-8 BOM is going to
depend on the programs you use.

If you use a lot of programs that can't work properly without it (such
as Windows Notepad), use a BOM. If you are able to live without it,
perhaps by telling your editor to assume UTF-8 or adding a compiler
switch to your build system, do so.

When you have the option, /always/ choose UTF-8 encoding /without/ a
BOM. It is far and away the most popular format for text, and it is the
format you are already using. It is the format used by all the include
files you have for all your libraries (including the standard library)
on your system, and on every other system. That is because plain ASCII
is also in UTF-8 with no BOM. It is the only unicode encoding that is
fully compatible with the files you have - it is therefore your only
option if your preference is to have a single encoding.


The sooner encodings other than BOM-less UTF-8 die out, the better.
That is the only way out of the mess.


> • For good measure also use `/utf-8` option with Visual C++.
> • Where it matters you can /statically assert/ UTF-8 encoding.
> > A `static_assert` depends on both that the the compiler's source file
> encoding assumption is correct, whatever it is, and that the basic
> execution character set (encoding of literals in the executable) is
> UTF-8. These are separate encoding choices and can be specified
> separately both with g++ and Visual C++. But assuming they both hold,
>
>     constexpr inline auto utf8_is_the_execution_character_set()
>         -> bool
>     {
>         constexpr auto& slashed_o = "ø";
>         return (sizeof( slashed_o ) == 3 and slashed_o[0] == '\xC3' and
> slashed_o[1] == '\xB8');
>     }
>
> When a `static_assert(utf8_is_the_execution_character_set())` holds you
> can be pretty sure that the source encoding assumption is correct.
>

Static assertions are always a good idea. So are pragmas forcing
options, for compilers that support that.


Kli-Kla-Klawitter

unread,
Jun 30, 2021, 6:38:11 AM6/30/21
to
UTF-16-files have a byte-header which helps the compiler to distinguish
ASCII-files and UTF-16-files.

Richard Damon

unread,
Jun 30, 2021, 6:51:53 AM6/30/21
to
You have this backwards, by the Standard, you don't tell the
implementation what encoding the source files are, the implementation
tells you what encoding it specifies that you should use.

The implementation is allowed to give you a way to tell it what to tell
you, but this is all implementation details.

There is a fundamental issue with trying to define an in-source way to
specify this, as we don't even have the ability to assume the ASCII is
part of the encoding as it could be EBCDIC.

Yes, if we got to throw out everything and start fresh, we might do
things differently.

As to the question of how to put the characters in a string, that is
what escape codes like \u and \U are for.

Juha Nieminen

unread,
Jun 30, 2021, 6:53:06 AM6/30/21
to
MrSpu...@ywn9entw2s.org wrote:
> Why should it care? To C and C++ strings are just a sequence of bytes, the
> encoding is irrelevant unless you're using functions specific to a particular
> encoding, eg: utf8_strlen() or similar.

That would be correct if this were a char string literal. But it's not.
It's a wide char string literal. L"something".

This means that in the source file the stuff between the quotes is,
for example, UTF-8 encoded, but the compiler needs to produce a wide
char string into the compiled binary, so the compiler needs to perform
at compile time a string encoding conversion from 8-bit UTF-8 to
whatever a wchar_t* may be (most usually either UTF-16 or UTF-32).

Paavo Helde

unread,
Jun 30, 2021, 7:03:58 AM6/30/21
to
30.06.2021 10:57 Juha Nieminen kirjutas:
> Character encoding was a problem in the 1960's, and it's still a problem
> today, no matter how much computers advance. Sheesh.
>
> Problem is, how to reliably write wide char string literals that contain
> non-ascii characters?
>
> Suppose you write for example this:
>
> const wchar_t* str = L"???";
>

I want my code to work anywhere with any compiler/framework conventions
and settings, and all my internal strings are in UTF-8 anyway, so I can
use strict ASCII source files with hardcoded UTF-8 characters, e.g.:

std::string s = "Copyright \xC2\xA9 2001-2020";

One can find the UTF-8 codes for such symbols quite easily from pages
like https://www.fileformat.info/info/unicode/char/a9/index.htm

For converting my strings to wide strings for Windows SDK functions, I
have small utility functions like Utf2Win():

::MessageBoxW(nullptr, Utf2Win(s).c_str(), L"About", MB_OK);

This setup means I do not have to worry about source code codepage
conventions *at all*. Fortunately I do not have much such texts. YMMV.





Juha Nieminen

unread,
Jun 30, 2021, 7:10:02 AM6/30/21
to
Richard Damon <Ric...@damon-family.org> wrote:
> As to the question of how to put the characters in a string, that is
> what escape codes like \u and \U are for.

While perhaps not ideal in terms of code readability (or writability, as
one needs to look up the unicode code points of each non-ascii character),
I suppose this is the best solution for portable code.

Juha Nieminen

unread,
Jun 30, 2021, 7:14:08 AM6/30/21
to
Paavo Helde <myfir...@osa.pri.ee> wrote:
> I want my code to work anywhere with any compiler/framework conventions
> and settings, and all my internal strings are in UTF-8 anyway, so I can
> use strict ASCII source files with hardcoded UTF-8 characters, e.g.:
>
> std::string s = "Copyright \xC2\xA9 2001-2020";

Does that work for wide string literals? Because I don't think it does.
In other words:

std::wstring s = L"Copyright \xC2\xA9 2001-2020";

However, as suggested in another reply, using "\uXXXX" instead ought
to work just fine (regardless of whether it's a narrow or wide char
literal). As long as you don't need the readability, of course.

MrSpud...@0c9tv3nddl090w2ynhm.gov

unread,
Jun 30, 2021, 7:29:11 AM6/30/21
to
I suspect thats something only Windows programmers have to worry about. utf8
has been the de facto standard on *nix for years.

Richard Damon

unread,
Jun 30, 2021, 7:36:15 AM6/30/21
to
And, as somewhat common in the standard, this sort of thing was designed
so if you wanted to, you could have written the file is some encoding
that the compiler doesn't know, and then run it through a pre-processor
that performs this transform to the 'ugly' codes. Thus the original
source file (that would be the one to edit to make changes) would be
very readable, and the uglyness is hidden in a temporary intermediary file.

For example, the file might be in a .cpp16 file that is UTF-16 encoded.
The make file will have a recipe for convert .cpp16 to .cpp switching to
the compilers 'natural' character set.

Richard Damon

unread,
Jun 30, 2021, 7:39:50 AM6/30/21
to
\x works in wide string literal too, and puts in a character with that
value. The difference is that if the wide string type isn't unicode
encoded then it might get the wrong character in the string.

Alf P. Steinbach

unread,
Jun 30, 2021, 7:50:47 AM6/30/21
to
It gets the wrong characters in the wide string literal, period.

- Alf

David Brown

unread,
Jun 30, 2021, 8:02:16 AM6/30/21
to
UTF-8 has been the standard for most purposes for many years (with
UTF-32 used internally sometimes).

However, there are a few very important exceptions that use UTF-16,
because they started using Unicode in the early days when it looked like
UTF-16 (or in fact just UCS-2) would be sufficient. That includes
Windows, Java, QT and Javascript. Moving these to UTF-8 takes time.


Manfred

unread,
Jun 30, 2021, 10:54:52 AM6/30/21
to
Yes, it is mandated by the standard. Ref. "universal-character-name"

As long as you don't need the readability, of course.
>

But you have no readability with "\xHH" either, do you?

Paavo Helde

unread,
Jun 30, 2021, 10:56:54 AM6/30/21
to
Sure, you can use \x also in wide strings. However, as a wide string is
not an 8-bit encoding, one should not use UTF-8 encoding, but either
UTF-16 or UTF-32, depending on sizeof(wchar_t).

A working example for Windows/MSVC++ with UTF-16 and sizeof(wchar_t)==2:


#include <Windows.h>
#include <string>

int main() {

std::wstring s =
L"Copyright \xA9 2001-2020\r\n"
L"The smallest infinite cardinal number is \x2080\x5d0\r\n"
L"A single hieroglyph not fitting in 16 bits: \xD840\xDC0F";

::MessageBoxW(nullptr, s.c_str(), L"Test", MB_OK);
}

This would not be portable to platforms where sizeof(wchar_t)==4. As far
as I gather, the \u and \U escapes ought to be more portable.




Öö Tiib

unread,
Jun 30, 2021, 12:22:18 PM6/30/21
to
On Wednesday, 30 June 2021 at 10:57:30 UTC+3, Juha Nieminen wrote:
> Character encoding was a problem in the 1960's, and it's still a problem
> today, no matter how much computers advance. Sheesh.
>
> Problem is, how to reliably write wide char string literals that contain
> non-ascii characters?
>
> Suppose you write for example this:
>
> const wchar_t* str = L"???";
>
> In the *source code* that string literal may be eg. UTF-8 encoded. However,
> the compiler needs to convert it to wide chars.
>
> Problem is, how does the compiler know which encoding is being used in
> that 8-bit string literal in the source code, in order for it to convert
> it properly to wide chars?

By telling it to compiler. Or to IDE that deals with compiler.

> Some compilers may assume it's UTF-8 encoded source code. Others may
> assume it's ISO-Latin-1 encoded (I'm looking at you, Visual Studio).
> Obviously the end result will be garbage if the wrong assumption is made.

It was already in VS2008 something like ...open the file in VS, File->Advanced
Save Options then “Encoding” combo let to select UTF-8.

> In most compilers (such as Visual Studio) you can specify which encoding
> to assume for source files, but this has to be done at the project
> settings level. I don't think there's any way to specify the encoding
> in the source code itself.

Maybe some compilers examine BOM but I have no knowledge there.
All of source files I keep in UTF-8. That does not need BOM. If I get some
UTF-16 or UTF-32 file then I turn it into UTF-8 anyway first before committing
anywhere.

> What does the C++ standard say? Does it say that source code files are
> always UTF-8 encoded, or is it up to the implementation? I assume that if
> it's the latter, the standard doesn't provide any mechanism to specify
> which encoding is being used. Or does it?

Compiler's command line is not standardized yet.

James Kuyper

unread,
Jun 30, 2021, 2:19:46 PM6/30/21
to
On 6/30/21 7:50 AM, Alf P. Steinbach wrote:
> On 30 Jun 2021 13:39, Richard Damon wrote:
...
>> \x works in wide string literal too, and puts in a character with that
>> value. The difference is that if the wide string type isn't unicode
>> encoded then it might get the wrong character in the string.
>
> It gets the wrong characters in the wide string literal, period.
"The escape \ooo consists of the backslash followed by one, two, or
three octal digits that are taken to specify the value of the desired
character. ... The value of a character-literal is
implementation-defined if it falls outside of the implementation-defined
range defined for ... wchar_t (for character-literals prefixed by L)."
(5.13.3p7)

The value of a wide character is determined by the current encoding. For
wide character literals using the u or U prefixes, that encoding is
UTF-16 and UTF-32, respectively, making octal escapes redundant with and
less convenient than the use of UCNs. But as he said, they do work for
such strings.

Juha Nieminen

unread,
Jul 1, 2021, 12:42:18 AM7/1/21
to
The problem is that "\xC2\xA9" in UTF-8 is not the same thing as
"\xC2\xA9" in UTF-16 or UTF-32 (whichever wchar_t happens to be).

"\uXXXX", however, ought to work regardless because it specifies the
actual unicode codepoint you want, rather than its encoding.

Christian Gollwitzer

unread,
Jul 1, 2021, 1:20:00 AM7/1/21
to
Am 30.06.21 um 09:57 schrieb Juha Nieminen:
> Character encoding was a problem in the 1960's, and it's still a problem
> today, no matter how much computers advance. Sheesh.
>
> Problem is, how to reliably write wide char string literals that contain
> non-ascii characters?
>
> Suppose you write for example this:
>
> const wchar_t* str = L"???";
>
> In the *source code* that string literal may be eg. UTF-8 encoded. However,
> the compiler needs to convert it to wide chars.

I think it is best to avoid wide strings. Now that doesn't help you if
you need them to call native Windows functions which insist on wchar_t.

I'm still wondering why you need to put a Unicode string in the source
code at all. Could you use an i18n feature of Windows to lookup the real
string? I'm not an expert of i18n on Windows, but using GNU gettext, you
would write some ASCII equivalent thing in the code and then have an
auxiliary translation file with a well defined encoding. At runtime the
ASCII string is merely a key into the table. Plus the added bonus that
you can support multiple languages.

Christian

Ralf Goertz

unread,
Jul 1, 2021, 3:19:55 AM7/1/21
to
Am Wed, 30 Jun 2021 12:37:55 +0200
I know that. It's called a byte order mark. And gcc ignores it.


David Brown

unread,
Jul 1, 2021, 4:30:02 AM7/1/21
to
Code can require non-ASCII characters without need internationalisation.
gettext and the like are certainly useful, but they are very heavy
tools compared to a fixed string or small table of strings in the code.
If you are writing a program for use in single company in Germany
(since you have a German email address), with all the texts in German,
would you want to use internationalisation frameworks just to make
"groß" turn out right?

The OP could also be working on embedded systems or some other code for
which having a single self-contained executable is important.

Juha Nieminen

unread,
Jul 1, 2021, 4:44:36 AM7/1/21
to
David Brown <david...@hesbynett.no> wrote:
> Code can require non-ASCII characters without need internationalisation.
> gettext and the like are certainly useful, but they are very heavy
> tools compared to a fixed string or small table of strings in the code.
> If you are writing a program for use in single company in Germany
> (since you have a German email address), with all the texts in German,
> would you want to use internationalisation frameworks just to make
> "groß" turn out right?

There are also many situations where using non-ascii characters in
string literals may not be related to language and internationalization.
After all, Unicode contains loads of characters that are not related
to spoken languages, such as math symbols, and lots of other types
of symbols which are universal and don't require any sort of
internationalization. Sometimes these symbols may be used all on
their own, sometimes as part of text (eg. in labels and titles).

Also, unit tests for code supporting Unicode may benefit from
being able to use string literals with non-ascii characters.
(Of course, as noted in other posts in this thread, there is a
working solution to get around this, and it's the use of the \u
escape character.)

Kli-Kla-Klawitter

unread,
Jul 1, 2021, 5:11:11 AM7/1/21
to
No, wrong - gcc honors it since vesion 1.01.

Ralf Goertz

unread,
Jul 1, 2021, 5:36:46 AM7/1/21
to
Am Thu, 1 Jul 2021 11:10:55 +0200
I created this file b.cc:

int main() {
return 0;
}

using vi with

:set fileencoding=utf16
:set bomb

Then

~/c> file b.cc
b.cc: C source, Unicode text, UTF-16, big-endian text

or

~/c> od -h b.cc
0000000 fffe 6900 6e00 7400 2000 6d00 6100 6900
0000020 6e00 2800 2900 2000 7b00 0a00 2000 2000
0000040 2000 2000 7200 6500 7400 7500 7200 6e00
0000060 2000 3000 3b00 0a00 7d00 0a00
0000074

where you can see the BOM fffe. Feeding this to gcc (or g++) you get:

~/c> gcc b.cc
b.cc:1:1: error: stray ‘\376’ in program
1 | �� i n t m a i n ( ) {
| ^
b.cc:1:2: error: stray ‘\377’ in program
1 | �� i n t m a i n ( ) {
| ^
b.cc:1:3: warning: null character(s) ignored
1 | �� i n t m a i n ( ) {
| ^
b.cc:1:5: warning: null character(s) ignored

etc.

How does that qualify as “gcc honoring the BOM”?

David Brown

unread,
Jul 1, 2021, 6:58:26 AM7/1/21
to
On 01/07/2021 10:44, Juha Nieminen wrote:
> David Brown <david...@hesbynett.no> wrote:
>> Code can require non-ASCII characters without need internationalisation.
>> gettext and the like are certainly useful, but they are very heavy
>> tools compared to a fixed string or small table of strings in the code.
>> If you are writing a program for use in single company in Germany
>> (since you have a German email address), with all the texts in German,
>> would you want to use internationalisation frameworks just to make
>> "groß" turn out right?
>
> There are also many situations where using non-ascii characters in
> string literals may not be related to language and internationalization.

Good point.

> After all, Unicode contains loads of characters that are not related
> to spoken languages, such as math symbols, and lots of other types
> of symbols which are universal and don't require any sort of
> internationalization. Sometimes these symbols may be used all on
> their own, sometimes as part of text (eg. in labels and titles).
>
> Also, unit tests for code supporting Unicode may benefit from
> being able to use string literals with non-ascii characters.
> (Of course, as noted in other posts in this thread, there is a
> working solution to get around this, and it's the use of the \u
> escape character.)
>

Yes - but such workarounds are hideous compared to writing:

printf("Temperature %.1f °C\n", 123.4);


I am glad most of my code only has to compile with gcc, and I can ignore
such portability matters.

Alf P. Steinbach

unread,
Jul 1, 2021, 7:31:39 AM7/1/21
to
You snipped some context, the example we're talking about.

That does decidedly not work in the sense of producing the intended string.

Perhaps I can make you understand this by talking about source code in
general. Yes, that example code is valid C++, so a conforming compiler
shall compile it with no errors; and yes, that code has a well defined
meaning, look, here's the C++ standard, it spells it out, what the
meaning is. But no, it doesn't do what you intended.

- Alf

Christian Gollwitzer

unread,
Jul 1, 2021, 8:01:54 AM7/1/21
to
Am 01.07.21 um 10:29 schrieb David Brown:
>
> Code can require non-ASCII characters without need internationalisation.
> gettext and the like are certainly useful, but they are very heavy
> tools compared to a fixed string or small table of strings in the code > If you are writing a program for use in single company in Germany
> (since you have a German email address), with all the texts in German,
> would you want to use internationalisation frameworks just to make
> "groß" turn out right?

I can see your point, but actually, most programs developed here in
Germany are still written in English. This is true for comments and
variable names etc., because there might be a non-German coworker
involved, and mostly because people are simply used to English as "the
computer language". I've seen German comments, variable names and
literal strings only at the university in introductory programming
courses etc. But admittedly, I never produced GUIs in C++, because there
are easier options available - and these usually come with good Unicode
support (e.g. Python). These smaller tools did not get i18n'ed.
I still think that if I had to make a program with a German interface,
it would make sense to write it with English strings and translate it
with a tool - because then, adding French or Turkish later on would be
easy.

>
> The OP could also be working on embedded systems or some other code for
> which having a single self-contained executable is important.
>

OK yes there are certainly points where this approach is not suitable.
Just wanted to bring another solution to the table.

Ceterum censeo wchar_t esse inutilem ;)

Christian

David Brown

unread,
Jul 1, 2021, 8:57:59 AM7/1/21
to
On 01/07/2021 14:01, Christian Gollwitzer wrote:
> Am 01.07.21 um 10:29 schrieb David Brown:
>>
>> Code can require non-ASCII characters without need internationalisation.
>>   gettext and the like are certainly useful, but they are very heavy
>> tools compared to a fixed string or small table of strings in the code
>> >   If you are writing a program for use in single company in Germany
>> (since you have a German email address), with all the texts in German,
>> would you want to use internationalisation frameworks just to make
>> "groß" turn out right?
>
> I can see your point, but actually, most programs developed here in
> Germany are still written in English. This is true for comments and
> variable names etc., because there might be a non-German coworker
> involved, and mostly because people are simply used to English as "the
> computer language". I've seen German comments, variable names and
> literal strings only at the university in introductory programming
> courses etc.

Sure. The same is true here in Norway. But it is not true everywhere.
And even when you have English identifiers, comments, etc., the text
strings you show to users are often in a language other than English.

> But admittedly, I never produced GUIs in C++, because there
> are easier options available - and these usually come with good Unicode
> support (e.g. Python). These smaller tools did not get i18n'ed.

I do that too.

> I still think that if I had to make a program with a German interface,
> it would make sense to write it with English strings and translate it
> with a tool - because then, adding French or Turkish later on would be
> easy.

Most software is written for one or a few customers, and one language
will always be sufficient. Of course, such software is also almost
always written for one compiler and one target, and portability of
source code is not an issue. This is particularly true if you have good
modularisation in the code - the user-facing parts with the text strings
are less likely to be re-used elsewhere than the more library-like code
underneath.

>
>>
>> The OP could also be working on embedded systems or some other code for
>> which having a single self-contained executable is important.
>>
>
> OK yes there are certainly points where this approach is not suitable.
> Just wanted to bring another solution to the table.
>
> Ceterum censeo wchar_t esse inutilem ;)
>

Agreed - and salt the ground it was built on, to save future generations
from its curse!



Manfred

unread,
Jul 1, 2021, 9:44:36 AM7/1/21
to
In Windows programming the need for wchar_t strings is relatively
common, since this its native character set. Most APIs are provided with
both ASCII and WCHAR variants, however if you need more-than-plain-ascii
text support you almost invariably end up #define'ing UNICODE and thus
default to the wide char variants - moreover some APIs are given for
WCHAR strings only.
In these cases if you need strings literals they are best expressed as
WCHAR strings directly to avoid unnecessary conversions at runtime.


>
>     Christian

Kli-Kla-Klawitter

unread,
Jul 1, 2021, 10:17:47 AM7/1/21
to
Use v1.01.

James Kuyper

unread,
Jul 1, 2021, 10:58:54 AM7/1/21
to
On 7/1/21 12:42 AM, Juha Nieminen wrote:
> Richard Damon <Ric...@damon-family.org> wrote:
>> On 6/30/21 7:13 AM, Juha Nieminen wrote:
>>> Paavo Helde <myfir...@osa.pri.ee> wrote:
>>>> I want my code to work anywhere with any compiler/framework conventions
>>>> and settings, and all my internal strings are in UTF-8 anyway, so I can
>>>> use strict ASCII source files with hardcoded UTF-8 characters, e.g.:
>>>>
>>>> std::string s = "Copyright \xC2\xA9 2001-2020";
>>>
>>> Does that work for wide string literals? Because I don't think it does.
>>> In other words:
>>>
>>> std::wstring s = L"Copyright \xC2\xA9 2001-2020";
>>>
>>> However, as suggested in another reply, using "\uXXXX" instead ought
>>> to work just fine (regardless of whether it's a narrow or wide char
>>> literal). As long as you don't need the readability, of course.
>>>
>>
>> \x works in wide string literal too, and puts in a character with that
>> value. The difference is that if the wide string type isn't unicode
>> encoded then it might get the wrong character in the string.
>
> The problem is that "\xC2\xA9" in UTF-8 is not the same thing as
> "\xC2\xA9" in UTF-16 or UTF-32 (whichever wchar_t happens to be).

Why would you use wchar_t if you char about unicode? You should be using
string literal using either the u8, u, or U prefixes, and store/access
the strings as arrays of char, char16_t, or char32_t, respectively. Such
literals are guaranteed to be in UTF-8, UTF-16, or UTF-32 encoding,
respectively.

Keith Thompson

unread,
Jul 1, 2021, 1:13:56 PM7/1/21
to
Kli-Kla-Klawitter <kliklakl...@gmail.com> writes:
> Am 01.07.2021 um 11:36 schrieb Ralf Goertz:
>> Am Thu, 1 Jul 2021 11:10:55 +0200
>> schrieb Kli-Kla-Klawitter <kliklakl...@gmail.com>:
[...]
>>>> I know that. It's called a byte order mark. And gcc ignores it.
>>>
>>> No, wrong - gcc honors it since vesion 1.01.
>> I created this file b.cc:
[...]
>> How does that qualify as “gcc honoring the BOM”?
>
> Use v1.01.

You think you're funny. You're not.

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Kli-Kla-Klawitter

unread,
Jul 1, 2021, 1:27:24 PM7/1/21
to
Am 01.07.2021 um 19:13 schrieb Keith Thompson:
> Kli-Kla-Klawitter <kliklakl...@gmail.com> writes:
>> Am 01.07.2021 um 11:36 schrieb Ralf Goertz:
>>> Am Thu, 1 Jul 2021 11:10:55 +0200
>>> schrieb Kli-Kla-Klawitter <kliklakl...@gmail.com>:
> [...]
>>>>> I know that. It's called a byte order mark. And gcc ignores it.
>>>>
>>>> No, wrong - gcc honors it since vesion 1.01.
>>> I created this file b.cc:
> [...]
>>> How does that qualify as “gcc honoring the BOM”?
>>
>> Use v1.01.
>
> You think you're funny. You're not.

v1.01 does honor the BOM.

Keith Thompson

unread,
Jul 1, 2021, 2:22:04 PM7/1/21
to
At the risk of giving the impression I'm taking you seriously, the
oldest version of gcc available from gnu.org is 1.42, released in 1992.

I've see no evidence that gcc v1.01 would have honored the BOM, but it
doesn't matter, since that version is obsolete and unavailable.

I conclude that you are a troll.

Real Troll

unread,
Jul 1, 2021, 2:47:49 PM7/1/21
to
On 01/07/2021 19:21, Keith Thompson wrote:
>
> I conclude that you are a troll.
>
It takes one troll to know another. This is an expert opinion of a Real Troll!



Juha Nieminen

unread,
Jul 2, 2021, 1:27:06 AM7/2/21
to
James Kuyper <james...@alumni.caltech.edu> wrote:
> Why would you use wchar_t if you char about unicode? You should be using
> string literal using either the u8, u, or U prefixes, and store/access
> the strings as arrays of char, char16_t, or char32_t, respectively. Such
> literals are guaranteed to be in UTF-8, UTF-16, or UTF-32 encoding,
> respectively.

No my choice in this case.

James Kuyper

unread,
Jul 2, 2021, 1:45:10 AM7/2/21
to
Recent messages reminded me of Window's strong incentives to use
wchar_t, something I've thankfully had no experience with. As a result,
what I'm about to say may be incorrect - but it seems to me that a
conforming implementation of C++ targeting Windows should have wchar_t
be the same as char16_t, so you should be able to freely use u prefixed
string literals and char16_t with code that needs to be portable to
Windows, but can also be used on other platforms. If your code didn't
need to be portable to other platforms, you could, by definition, rely
upon Window's own guarantees about wchar_t.

Bo Persson

unread,
Jul 2, 2021, 2:33:17 AM7/2/21
to
The problem is that the language explicitly requires wchar_t to be a
distinct type.

Once upon a time it was implemented as a typedef, but that option was
removed already in C++98. Overload resolution and things...

Kli-Kla-Klawitter

unread,
Jul 2, 2021, 4:03:51 AM7/2/21
to
Am 01.07.2021 um 20:21 schrieb Keith Thompson:
> Kli-Kla-Klawitter <kliklakl...@gmail.com> writes:
>> Am 01.07.2021 um 19:13 schrieb Keith Thompson:
>>> Kli-Kla-Klawitter <kliklakl...@gmail.com> writes:
>>>> Am 01.07.2021 um 11:36 schrieb Ralf Goertz:
>>>>> Am Thu, 1 Jul 2021 11:10:55 +0200
>>>>> schrieb Kli-Kla-Klawitter <kliklakl...@gmail.com>:
>>> [...]
>>>>>>> I know that. It's called a byte order mark. And gcc ignores it.
>>>>>>
>>>>>> No, wrong - gcc honors it since vesion 1.01.
>>>>> I created this file b.cc:
>>> [...]
>>>>> How does that qualify as “gcc honoring the BOM”?
>>>>
>>>> Use v1.01.
>>> You think you're funny. You're not.
>>
>> v1.01 does honor the BOM.
>
> At the risk of giving the impression I'm taking you seriously, the
> oldest version of gcc available from gnu.org is 1.42, released in 1992.

No, the oldest gcc-version is v0.9 from March 22, 1987.

Keith Thompson

unread,
Jul 2, 2021, 5:31:41 AM7/2/21
to
Ralf Goertz <m...@myprovider.invalid> writes:
[...]
On my system, gcc doesn't handle UTF-16 at all, with or without a BOM.
(I don't know whether there's a way to configure it to do so.)

It does handle UTF-8 with or without a BOM.

$ file b.cpp
b.cpp: C source, UTF-8 Unicode (with BOM) text
$ cat b.cpp
int main() { }
$ hd b.cpp
00000000 ef bb bf 69 6e 74 20 6d 61 69 6e 28 29 20 7b 20 |...int main() { |
00000010 7d 0a |}.|
00000012
$ gcc -c b.cpp
$

gcc 9.3.0 on Ubuntu 20.04. (There is, of course, no point in going back
to ancient versions of gcc.)

MrSpu...@92wvlb1hltq4dhc.gov.uk

unread,
Jul 2, 2021, 5:38:44 AM7/2/21
to
On Fri, 02 Jul 2021 02:31:24 -0700
Keith Thompson <Keith.S.T...@gmail.com> wrote:
>Ralf Goertz <m...@myprovider.invalid> writes:
>[...]
>
>On my system, gcc doesn't handle UTF-16 at all, with or without a BOM.
>(I don't know whether there's a way to configure it to do so.)

Just out of interest, what byte order is the BOM in? Catch 22?

Paavo Helde

unread,
Jul 2, 2021, 6:53:12 AM7/2/21
to
This is probably a troll question, but answering anyway: the BOM marker
U+FEFF is in the correct byte order, in both little-endian and
big-endian UTF-16. That's how you tell them apart.

The trick is in that the reverse value U+FFFE is not valid Unicode
character, so there is no possibility of mixup. Neither sequence is also
valid UTF-8, to avoid mixup with that.

Juha Nieminen

unread,
Jul 2, 2021, 7:52:24 AM7/2/21
to
James Kuyper <james...@alumni.caltech.edu> wrote:
> so you should be able to freely use u prefixed
> string literals and char16_t with code that needs to be portable to
> Windows, but can also be used on other platforms.

But doesn't that have the exact same problem as in my original post?
In other words, in code like this:

const char16_t *str = u"something";

the stuff between the quotation marks in the source code will use
whatever encoding (ostensibly but not assuredly UTF-8), which the
compiler needs to convert to UTF-16 at compile time for writing
it to the output binary.

Or is it guaranteed that the characters between u" and " will
always be interpreted as UTF-8?

MrSpud...@nx8574z6ey2rc0y563k5i2.net

unread,
Jul 2, 2021, 8:53:29 AM7/2/21
to
On Fri, 2 Jul 2021 13:52:53 +0300
Paavo Helde <myfir...@osa.pri.ee> wrote:
>02.07.2021 12:38 MrSpu...@92wvlb1hltq4dhc.gov.uk kirjutas:
>> On Fri, 02 Jul 2021 02:31:24 -0700
>> Keith Thompson <Keith.S.T...@gmail.com> wrote:
>>> Ralf Goertz <m...@myprovider.invalid> writes:
>>> [...]
>>>
>>> On my system, gcc doesn't handle UTF-16 at all, with or without a BOM.
>>> (I don't know whether there's a way to configure it to do so.)
>>
>> Just out of interest, what byte order is the BOM in? Catch 22?
>>
>
>This is probably a troll question, but answering anyway: the BOM marker

No, not a troll.

>U+FEFF is in the correct byte order, in both little-endian and
>big-endian UTF-16. That's how you tell them apart.

So its just 2 bytes in sequence, not a 16 bit value?

Alf P. Steinbach

unread,
Jul 2, 2021, 9:22:31 AM7/2/21
to
The BOM is a Unicode code point, U+FEFF as Paavo mentioned, originally
standing for an invisible zero-width hard space. It's encoded with
either little endian UTF-16 (then as two bytes), or as big endian UTF-16
(then as two bytes), or as endianness agnostic UTF-8 (then as three
bytes). The encoded BOM yields a reliable encoding indicator, though
pedantic people might argue that it's just statistical -- after all, one
just might happen to have a Windows 1252 encoded file with three
characters at the start with the same byte values as the UTF-8 BOM.

In the same vein, one just might happen to have a `.txt` file in Windows
with the letters "MZ" at the very start, like

MZ, Mishtara Zva'it, is the Military Police Corps of Israel. blah

and if you then try to open the file in your default text editor by just
typing the file name in old Cmd, those letters will be misinterpreted as
the initials of Mark Zbikowski, marking the file as an executable...

Since the chance of that happening isn't absolutely 0 one should never
use text file names as commands, or the UTF-8 BOM as an encoding marker.


- Alf

Alf P. Steinbach

unread,
Jul 2, 2021, 9:37:33 AM7/2/21
to
I agree, but unfortunately both the C and C++ standards require
`wchar_t` to be able represent all code points in the largest supported
extended character set, and while that requirement worked nicely with
original Unicode, already in 1992 or thereabouts (not quite sure) it was
in conflict with extremely firmly established practice in Windows.

Bringing the standards in agreement with actual practice should be a
goal, but for C++ it's seldom done.

It /was/ done, in C++11, for the hopeless idealistic C++98 goal of clean
C++-ish `<c...>` headers that didn't pollute the global namespace. But
C++17 added stuff to only some `<c...>` headers and not to the
corresponding `<... .h>` headers, and that fact was then used quite
recently in a proposal to un-deprecate the .h headers but add all new
stuff only to `<c...>` headers. Which satisfies the politics but not the
in-practice of using code that uses C libraries designed as
C++-compatible, without running afoul of qualification issues.

And it /was/ done, in C++11, for both `throw` specifications and for the
`export` keyword, not to mention the C++11 optional conversion between
function pointers and `void*`, in order to support the reality of Posix.

But it seems to me, it's very very hard to get such changes through the
committee, just as individual people generally don't like to admit that
they've been wrong. Instead, C++20 started on a path of /introducing/
more conflicts with reality. In particular for `std::filesystem::path`,
where they threw the baby out with the bathwater for purely academic
idealism reasons.

- Alf

Paavo Helde

unread,
Jul 2, 2021, 10:32:26 AM7/2/21
to
The file contains bytes, it's up to the reading code how to interpret
the bytes. It can interpret the bytes as uint16_t, i.e. cast the file
buffer as 'const uint16_t*' and read the first 2-byte value. If it is
0xFEFF, then it knows this is an UTF-16 file in a matching byte order.
If it is 0xFFFE, then it knows it's an UTF-16 file in an opposite byte
order, and the rest of the file needs to be byte-swapped.

It can also interpret the buffer as containing uint8_t bytes, but then
the logic is a bit more complex, it must then know if it itself is
running on a big-endian or little-endian machine, and behave accordingly.

James Kuyper

unread,
Jul 2, 2021, 4:15:39 PM7/2/21
to
On 7/2/21 7:52 AM, Juha Nieminen wrote:
> James Kuyper <james...@alumni.caltech.edu> wrote:
>> so you should be able to freely use u prefixed
>> string literals and char16_t with code that needs to be portable to
>> Windows, but can also be used on other platforms.
>
> But doesn't that have the exact same problem as in my original post?
No, because the original post used wchar_t and the L prefix, for which
the relevant encoding is implementation-defined. That's not the case for
char and the u8 prefix, char16_t and the u prefix, or for char32_t and
the U prefix.

I have a feeling that there's a misunderstanding somewhere in this
conversation, but I'm not sure yet what it is.

> In other words, in code like this:
>
> const char16_t *str = u"something";
>
> the stuff between the quotation marks in the source code will use
> whatever encoding (ostensibly but not assuredly UTF-8), which the

I don't understand why you think that the source code encoding matters.
The only thing that matters is what characters are encoded. As long as
those encodings are for the characters {'u', '""', '\\', 'x', 'C', '2',
'\\', 'x', 'A', '9', '"'}, any fully conforming implementation must give
you the standard defined behavior for u"\xC2\xA9".

> compiler needs to convert to UTF-16 at compile time for writing
> it to the output binary.


When using the u8 prefix, UTF-8 encoding is guaranteed, for which every
codepoint from U+0000 to U+007F is represented by a single character
with a numerical value matching the code point.

When using the u prefix, UTF-16 encoding is guaranteed, for which every
codepoint from U+0000 to U+D7FF, and from U+E000 to U+FFFF, is
represented by a single character with a numerical value matching the
codepoint.

When using the U prefix, UTF-32 encoding is guaranteed, for which every
codepoint from U+0000 to U+D7FF, and from U+E000 to U+10FFFF, is
represented by a single character with a numerical value matching the
codepoint.

Since the meanings of the octal and hexadecimal escape sequences are
defined in terms of the numerical values of the corresponding
characters, if you use any of those prefixes, specifying the value of a
character within the specified range by using an octal or hexadecimal
escape sequence is precisely as portable as using the UCN with the same
numerical value. Using UCNs would be better because they are less
restricted, working just as well with the L prefix and with no prefix.
However, within those ranges, octal and hexadecimal escapes will work
just as well.

Do you know of any implementation of C++ that claims to be fully
conforming, for which that is not the case? If so, how do they justify
that claim?

'\xC2' and '\xA9' are in the ranges for the u and U prefixes. They would
not be in the range for the u8 prefix, but since the context was wide
characters, the u8 prefix is not relevant.


> Or is it guaranteed that the characters between u" and " will
> always be interpreted as UTF-8?

Source code characters between the u" and the " will be interpreted
according to an implementation-defined character encoding. But so long
as they encode {'\\', 'x', 'C', '2', '\\', 'x', 'A', '9'}, you should
get the standard-defined behavior for u"\xC2\xA9".

Alf P. Steinbach

unread,
Jul 2, 2021, 9:30:32 PM7/2/21
to
On 2 Jul 2021 22:15, James Kuyper wrote:
> On 7/2/21 7:52 AM, Juha Nieminen wrote:
>> James Kuyper <james...@alumni.caltech.edu> wrote:
>>> so you should be able to freely use u prefixed
>>> string literals and char16_t with code that needs to be portable to
>>> Windows, but can also be used on other platforms.
>>
>> But doesn't that have the exact same problem as in my original post?
> No, because the original post used wchar_t and the L prefix, for which
> the relevant encoding is implementation-defined. That's not the case for
> char and the u8 prefix, char16_t and the u prefix, or for char32_t and
> the U prefix.
>
> I have a feeling that there's a misunderstanding somewhere in this
> conversation, but I'm not sure yet what it is.

Juha is concerned about the compiler assuming some other source code
encoding than the actual one.

The implementation defined encoding of `wchar_t`, where in practice the
possibilities as of 2021 are either UTF-16 or UTF-32, doesn't matter.

A correct source code encoding assumption can be guaranteed by simply
statically asserting that the basic execution character set is UTF-8, as
I showed in my original answer in this thread.


>> In other words, in code like this:
>>
>> const char16_t *str = u"something";
>>
>> the stuff between the quotation marks in the source code will use
>> whatever encoding (ostensibly but not assuredly UTF-8), which the
>
> I don't understand why you think that the source code encoding matters.

It matters because if the compiler assumes wrong, and Visual C++
defaults to assuming Windows ANSI when no other indication is present
and it's not forced by options, then one gets incorrect literals.

Which may or may not be caught by unit testing.


> The only thing that matters is what characters are encoded. As long as
> those encodings are for the characters {'u', '""', '\\', 'x', 'C', '2',
> '\\', 'x', 'A', '9', '"'}, any fully conforming implementation must give
> you the standard defined behavior for u"\xC2\xA9".

Consider:

#include <iostream>
using std::cout, std::hex, std::endl;

auto main() -> int
{
const char16_t s16[] = u"\xC2\xA9";
for( const int code: s16 ) {
if( code ) { cout << hex << code << " "; }
}
cout << endl;
}

The output of this program, i.e. the UTF-16 encoding values in `s16`, is

c2 a9

Since Unicode is an extension of Latin-1 the UTF-16 interpretation of
`\xC2` and `xA9` is as Latin-1 characters, respectively "Â" and (not a
coincidence) "©" according to my Windows 10 console in codepage 1252.

Which is not the single "©" that an UTF-8 interpretation gives.


[snip]


- Alf

James Kuyper

unread,
Jul 3, 2021, 12:44:57 AM7/3/21
to
On 7/2/21 9:30 PM, Alf P. Steinbach wrote:
> On 2 Jul 2021 22:15, James Kuyper wrote:
>> On 7/2/21 7:52 AM, Juha Nieminen wrote:
>>> James Kuyper <james...@alumni.caltech.edu> wrote:
>>>> so you should be able to freely use u prefixed
>>>> string literals and char16_t with code that needs to be portable to
>>>> Windows, but can also be used on other platforms.
>>>
>>> But doesn't that have the exact same problem as in my original post?
>> No, because the original post used wchar_t and the L prefix, for which
>> the relevant encoding is implementation-defined. That's not the case for
>> char and the u8 prefix, char16_t and the u prefix, or for char32_t and
>> the U prefix.
>>
>> I have a feeling that there's a misunderstanding somewhere in this
>> conversation, but I'm not sure yet what it is.

I now have a much better idea what the misunderstanding is. See below.
> Juha is concerned about the compiler assuming some other source code
> encoding than the actual one.
>
> The implementation defined encoding of `wchar_t`, where in practice the
> possibilities as of 2021 are either UTF-16 or UTF-32, doesn't matter.
>
> A correct source code encoding assumption can be guaranteed by simply
> statically asserting that the basic execution character set is UTF-8, as
> I showed in my original answer in this thread.

The encoding of the basic execution character set is irrelevant if the
string literals are prefixed with u8, u, or U, and use only valid escape
sequences to specify members of the extended character set. The encoding
for such literals is explicitly mandated by the standard. Are you (or
he) worrying about a failure to conform to those mandates?

...
>> I don't understand why you think that the source code encoding matters.
>
> It matters because if the compiler assumes wrong, and Visual C++
> defaults to assuming Windows ANSI when no other indication is present
> and it's not forced by options, then one gets incorrect literals.

Even when u8, u or U prefixes are specified?

...
>> The only thing that matters is what characters are encoded. As long as
>> those encodings are for the characters {'u', '""', '\\', 'x', 'C', '2',
>> '\\', 'x', 'A', '9', '"'}, any fully conforming implementation must give
>> you the standard defined behavior for u"\xC2\xA9".
>
> Consider:
>
> #include <iostream>
> using std::cout, std::hex, std::endl;
>
> auto main() -> int
> {
> const char16_t s16[] = u"\xC2\xA9";
> for( const int code: s16 ) {
> if( code ) { cout << hex << code << " "; }
> }
> cout << endl;
> }
>
> The output of this program, i.e. the UTF-16 encoding values in `s16`, is
>
> c2 a9


Yes, that's precisely what the C++ standard mandates, regardless of the
encoding of the source character set. Which is why I mistakenly thought
that's what he was trying to do.

> Since Unicode is an extension of Latin-1 the UTF-16 interpretation of
> `\xC2` and `xA9` is as Latin-1 characters, respectively "Â" and (not a
> coincidence) "©" according to my Windows 10 console in codepage 1252.
>
> Which is not the single "©" that an UTF-8 interpretation gives.

OK - it had not occurred to me that he was trying to encode "©", since
that is not the right way to do so. In a sense, I suppose that's the
point you're making.
My point is that all ten of the following escape sequences should be
perfectly portable ways of specifying that same code point in each of
three Unicode encodings:

UTF-8: u8"\u00A9\U000000A9"
UTF-16: u"\251\xA9\u00A9\U000000A9"
UTF-32: U"\251\xA9\u00A9\U000000A9"

Do you know of any implementation which is non-conforming because it
misinterprets any of those escape sequences?

Juha Nieminen

unread,
Jul 3, 2021, 3:00:16 AM7/3/21
to
James Kuyper <james...@alumni.caltech.edu> wrote:
>> In other words, in code like this:
>>
>> const char16_t *str = u"something";
>>
>> the stuff between the quotation marks in the source code will use
>> whatever encoding (ostensibly but not assuredly UTF-8), which the
>
> I don't understand why you think that the source code encoding matters.

Because the source file will most often be a text file using 8-bit
characters and, in these situations most likely (although not assuredly)
using UTF-8 encoding for non-ascii characters.

However, when you write:

const char16_t *str = u"something";

if that "something" contains non-ascii characters, which in this case will
be (usually) UTF-8 encoded in this source code, the compiler will have to
interpret that UTF-8 string and convert it to UTF-16 for the output
binary.

So the problem is the same as with wchar_t: How does the compiler know
which encoding is being used in this source file? It needs to know that
since it has to generate an UTF-16 string literal into the output binary
from those characters appearing in the source code.

> When using the u8 prefix, UTF-8 encoding is guaranteed, for which every
> codepoint from U+0000 to U+007F is represented by a single character
> with a numerical value matching the code point.

UTF-8 encoding is guaranteed *for the result*, ie. what the compiler writes
to the output binary. Is it guaranteed to *read* the characters in the
source code between the quotation marks and interpret it as UTF-8?

> When using the u prefix, UTF-16 encoding is guaranteed, for which every
> codepoint from U+0000 to U+D7FF, and from U+E000 to U+FFFF, is
> represented by a single character with a numerical value matching the
> codepoint.

Same issue, even more relevantly here.

> Do you know of any implementation of C++ that claims to be fully
> conforming, for which that is not the case? If so, how do they justify
> that claim?

Visual Studio will, by default (ie. with default project settings after
having created a new project) interpret the source files as Windows-1252
(which is very similar to ISO-Latin-1).

This means that when you write L"something" or u"something", if there
are any non-ascii characters between the parentheses, UTF-8 encoded,
then the result will be incorrect. (In order to make Visual Studio do
the correct conversion, you need to specify that the file is UTF-8 encoded
in the project settings).

>> Or is it guaranteed that the characters between u" and " will
>> always be interpreted as UTF-8?
>
> Source code characters between the u" and the " will be interpreted
> according to an implementation-defined character encoding. But so long
> as they encode {'\\', 'x', 'C', '2', '\\', 'x', 'A', '9'}, you should
> get the standard-defined behavior for u"\xC2\xA9".

Yes, but that's not the correct desired character in UTF-16, only in UTF-8.
You'll get garbage as your UTF-16 string literal.

Alf P. Steinbach

unread,
Jul 3, 2021, 7:31:18 AM7/3/21
to
Having a handy simple way to guarantee a correct source code encoding
assumption doesn't seem irrelevant to me.

On the contrary it's directly a solution to the OP's problem, which to
me appears to be maximally relevant.


> The encoding
> for such literals is explicitly mandated by the standard. Are you (or
> he) worrying about a failure to conform to those mandates?

No, Juha is worrying about the compiler's source code encoding assumption.


> ...
>>> I don't understand why you think that the source code encoding matters.
>>
>> It matters because if the compiler assumes wrong, and Visual C++
>> defaults to assuming Windows ANSI when no other indication is present
>> and it's not forced by options, then one gets incorrect literals.
>
> Even when u8, u or U prefixes are specified?

Yes. As an example, consider

const auto& s = u"Blåbær, Mr. Watson.";

If the source is UTF-8 encoded, without a BOM or other encoding marker,
and if the Visual C++ compiler is not told to assume UTF-8 source code,
then it will incorrectly assume that this is Windows ANSI encoded.

The UTF-8 bytes in the source code will then be interpreted as Windows
ANSI character codes, e.g. as Windows ANSI Western, codepage 1252.

The compiler will then see this source code:

const auto& s = u"Blåbær, Mr. Watson.";

And it will proceeed to encode /that/ string with UTF-8 in the resulting
string value.
No, they should work. These escapes are an alternative solution to
Juha's problem. However, they lack readability and involve much more
work than necessary, so IMO the thing to do is to assert UTF-8.

- Alf

James Kuyper

unread,
Jul 3, 2021, 8:45:04 AM7/3/21
to
On 7/3/21 7:31 AM, Alf P. Steinbach wrote:
> On 3 Jul 2021 06:44, James Kuyper wrote:
>> On 7/2/21 9:30 PM, Alf P. Steinbach wrote:
...
>>> It matters because if the compiler assumes wrong, and Visual C++
>>> defaults to assuming Windows ANSI when no other indication is present
>>> and it's not forced by options, then one gets incorrect literals.
>>
>> Even when u8, u or U prefixes are specified?
>
> Yes. As an example, consider
>
> const auto& s = u"Blåbær, Mr. Watson.";

The comment that led to this sub-thread was specifically about the
usability of escape sequences to specify members of the extended
character set, and that's the only thing I was talking about. While that
string does contain such members, it contains not a single escape sequence.

...
>> OK - it had not occurred to me that he was trying to encode "©", since
>> that is not the right way to do so. In a sense, I suppose that's the
>> point you're making.
>> My point is that all ten of the following escape sequences should be
>> perfectly portable ways of specifying that same code point in each of
>> three Unicode encodings:
>>
>> UTF-8: u8"\u00A9\U000000A9"
>> UTF-16: u"\251\xA9\u00A9\U000000A9"
>> UTF-32: U"\251\xA9\u00A9\U000000A9"
>>
>> Do you know of any implementation which is non-conforming because it
>> misinterprets any of those escape sequences?
> No, they should work. These escapes are an alternative solution to
> Juha's problem. ...

They are the only solution that this sub-thread has been about.

> ... However, they lack readability and involve much more
> work than necessary, so IMO the thing to do is to assert UTF-8.

Those are reasonable concerns. That the system's assumptions about the
source character set would prevent those escapes from working is not.

James Kuyper

unread,
Jul 3, 2021, 9:03:23 AM7/3/21
to
On 7/3/21 2:59 AM, Juha Nieminen wrote:
> James Kuyper <james...@alumni.caltech.edu> wrote:
>>> In other words, in code like this:
>>>
>>> const char16_t *str = u"something";
>>>
>>> the stuff between the quotation marks in the source code will use
>>> whatever encoding (ostensibly but not assuredly UTF-8), which the
>>
>> I don't understand why you think that the source code encoding matters.
>
> Because the source file will most often be a text file using 8-bit
> characters and, in these situations most likely (although not assuredly)
> using UTF-8 encoding for non-ascii characters.

Every comment I made on this sub-thread was prefaced on the absence of
any actual members of the extended character set - I was talking only
about the feasibility of using escape sequences to specify such members.

...
>> Do you know of any implementation of C++ that claims to be fully
>> conforming, for which that is not the case? If so, how do they justify
>> that claim?
>
> Visual Studio will, by default (ie. with default project settings after
> having created a new project) interpret the source files as Windows-1252
> (which is very similar to ISO-Latin-1).

So, that shouldn't cause a problem for escape sequences, which, as a
matter of deliberate design, consist entirely of characters from the
basic source character set.

>>> Or is it guaranteed that the characters between u" and " will
>>> always be interpreted as UTF-8?
>>
>> Source code characters between the u" and the " will be interpreted
>> according to an implementation-defined character encoding. But so long
>> as they encode {'\\', 'x', 'C', '2', '\\', 'x', 'A', '9'}, you should
>> get the standard-defined behavior for u"\xC2\xA9".
>
> Yes, but that's not the correct desired character in UTF-16, only in UTF-8.
> You'll get garbage as your UTF-16 string literal.

The 'u' mandates UTF-16, which is the only thing that's relevant to the
interpretation of that string literal. That it is the correct pair of
characters, given that UTF-16 has been mandated. Whether or not it's the
intended character depends upon how well your code expresses your
intentions. Alf says that the character that was intended was U+00A9, so
that code does not correctly express that intention. The correct way to
specify it doesn't depend upon the source character set, it only depends
upon the desired encoding of the string. Each of the following ten
escape sequences is a different portably correct ways of expressing that
intention:

Chris Vine

unread,
Jul 3, 2021, 9:23:52 AM7/3/21
to
On Sat, 3 Jul 2021 06:59:59 +0000 (UTC)
Juha Nieminen <nos...@thanks.invalid> wrote:
[snip]
> So the problem is the same as with wchar_t: How does the compiler know
> which encoding is being used in this source file? It needs to know that
> since it has to generate an UTF-16 string literal into the output binary
> from those characters appearing in the source code.

For encodings other than the 96 charcters of the basic source character
set (which map onto ASCII) that the C++ standard requires, this is
implementation defined and the compiler should document it. In the
case of gcc, it documents that the source character set is UTF-8 unless
a different source file encoding is indicated by the -finput-charset
option.

With gcc you can also set the narrow execution character set with the
-fexec-charset option. Presumably for any one string literal this can
be overridden by prefixing it with u8, or it wouldn't be consistent
with the standard, but I have never checked whether that is in fact the
case.

This is what gcc says about character sets, which is somewhat divergent
from the C and C++ standards:
http://gcc.gnu.org/onlinedocs/cpp/Character-sets.html

I doubt this is often relevant. What most multi-lingual programs do is
have source strings in English using the ASCII subset of UTF-8 and
translate to UTF-8 dynamically by reference to the locale. Gnu's
gettext is a quite commonly used implementation of this approach.

Alf P. Steinbach

unread,
Jul 3, 2021, 10:28:20 AM7/3/21
to
On 3 Jul 2021 14:44, James Kuyper wrote:
> On 7/3/21 7:31 AM, Alf P. Steinbach wrote:
>> On 3 Jul 2021 06:44, James Kuyper wrote:
[snippety]
>>>
>>> UTF-8: u8"\u00A9\U000000A9"
>>> UTF-16: u"\251\xA9\u00A9\U000000A9"
>>> UTF-32: U"\251\xA9\u00A9\U000000A9"
>>>
>>> Do you know of any implementation which is non-conforming because it
>>> misinterprets any of those escape sequences?
>> No, they should work. These escapes are an alternative solution to
>> Juha's problem. ...
>
> They are the only solution that this sub-thread has been about.
>
>> ... However, they lack readability and involve much more
>> work than necessary, so IMO the thing to do is to assert UTF-8.
>
> Those are reasonable concerns. That the system's assumptions about the
> source character set would prevent those escapes from working is not.

As far as I know nobody's argued that the source encoding assumption
would prevent any escapes from working.

If I understand you correctly your “this sub-thread” about escapes and
universal character designators -- let's just call them all escapes
-- started when you responded to my response Richard Daemon, who had
responded to Juha Niemininen, who wrote:


[>>]
Does that work for wide string literals? Because I don't think it does.
In other words:

std::wstring s = L"Copyright \xC2\xA9 2001-2020";
[<<]


Richard responded to that:


[>>]
\x works in wide string literal too, and puts in a character with that
value. The difference is that if the wide string type isn't unicode
encoded then it might get the wrong character in the string.
[<<]


I responded to Richard:


[>>]
It gets the wrong characters in the wide string literal, period.
[<<]


Which it decidedly does.

It's trivial to just try it out and see; QED.

You responded to that where you snipped Juha's example, indicating some
misunderstanding on your part:


[>>]
The value of a wide character is determined by the current encoding. For
wide character literals using the u or U prefixes, that encoding is
UTF-16 and UTF-32, respectively, making octal escapes redundant with and
less convenient than the use of UCNs. But as he said, they do work for
such strings.
[<<]


So, in your mind this sub-thread may have been about whether escape
sequences (including universal character designators) are affected by
the source encoding, but to me it has been about whether Juha's example
yields the desired string, as he correctly surmised that it didn't.

And the outer context from the top thread, is about the source
encoding's effect on string literals, which hopefully is now clear.


- Alf

Richard Damon

unread,
Jul 3, 2021, 11:48:26 AM7/3/21
to
It puts into the string exactly the characters that you specified, the
character of value 0x00C2 and then the character of value 0x00A9. THAT
is what it says to do. If you meant the \x to mean this is a UTF-8
encode string, why are you expecting that?

The one issue with \x is it puts in the characters in whatever encoding
wide strings use, so you can't just assume unicode values unless you are
willing to assume wide string is unicode encoded.

Juha Nieminen

unread,
Jul 3, 2021, 1:00:05 PM7/3/21
to
James Kuyper <james...@alumni.caltech.edu> wrote:
> The comment that led to this sub-thread was specifically about the
> usability of escape sequences to specify members of the extended
> character set, and that's the only thing I was talking about. While that
> string does contain such members, it contains not a single escape sequence.

The problem is that the "\xC2\xA9" was presented as a solution to
the compiler wrongly assuming some source file encoding other than UTF-8.
Those two bytes are the UTF-8 encoding of a non-ascii character.

In other words, it's explicitly entering the UTF-8 encoding of that
non-ascii character. This works if we are specifying a narrow string
literal (and we want it to be UTF-8 encoded).

My point is that it doesn't work for a wide string literal. If you
say L"\xC2\xA9" you will *not* get that non-ascii character you
wanted. Instead, you get two UTF-16 (or UTF-32, depending on
how large wchar_t is) characters which are completely different
from the one you wanted. You essentially get garbage.

Juha Nieminen

unread,
Jul 3, 2021, 1:06:17 PM7/3/21
to
Chris Vine <chris@cvine--nospam--.freeserve.co.uk> wrote:
> I doubt this is often relevant. What most multi-lingual programs do is
> have source strings in English using the ASCII subset of UTF-8 and
> translate to UTF-8 dynamically by reference to the locale. Gnu's
> gettext is a quite commonly used implementation of this approach.

It's quite relevant. For example, if you are writing unit tests for some
library dealing with wide strings (or UTF-16 strings), it's quite common
to write string literals in your tests, so you need to be aware of this
problem: What will work just fine with gcc might not work with Visual
Studio, and your unit test will succeed in one but not the other.

The solution offered elsewhere in this tread is the correct way to go,
ie. using the "\uXXXX" escape codes for such string literals, as they
will always be interpreted correctly by the compiler (even if the
readability of the source code suffers as a consequence).

Richard Damon

unread,
Jul 3, 2021, 2:06:35 PM7/3/21
to
Add the solution for the readability is to just write the code as native
literals, but NOT as the actual C++ file, and have a filter stage that
translates this file into the actual C++ code with the escapes.

The language was designed for this sort of functionality.

James Kuyper

unread,
Jul 3, 2021, 5:50:02 PM7/3/21
to
On 7/3/21 12:59 PM, Juha Nieminen wrote:
> James Kuyper <james...@alumni.caltech.edu> wrote:
>> The comment that led to this sub-thread was specifically about the
>> usability of escape sequences to specify members of the extended
>> character set, and that's the only thing I was talking about. While that
>> string does contain such members, it contains not a single escape sequence.
>
> The problem is that the "\xC2\xA9" was presented as a solution to
> the compiler wrongly assuming some source file encoding other than UTF-8.
> Those two bytes are the UTF-8 encoding of a non-ascii character.

That sequence specifies a character with a value of 0xC2 followed by a
character with a value of 0xA9. When the characters in question are
wider than 8 bits, that is NOT the UTF-8 encoding of the character you
want. Which just means you need to specify the right character.

> In other words, it's explicitly entering the UTF-8 encoding of that
> non-ascii character. This works if we are specifying a narrow string
> literal (and we want it to be UTF-8 encoded).
>
> My point is that it doesn't work for a wide string literal. If you
> say L"\xC2\xA9" you will *not* get that non-ascii character you
> wanted. ...

That's because you didn't specify what you wanted. You should have used
\u00A9 rather than \xC2\xA9.

> ... Instead, you get two UTF-16 (or UTF-32, depending on
> how large wchar_t is) characters which are completely different
> from the one you wanted. You essentially get garbage.

You got precisely what you specified - if it's not what you wanted, you
need to change your specification.

Juha Nieminen

unread,
Jul 8, 2021, 4:11:52 AM7/8/21
to
James Kuyper <james...@alumni.caltech.edu> wrote:
>> ... Instead, you get two UTF-16 (or UTF-32, depending on
>> how large wchar_t is) characters which are completely different
>> from the one you wanted. You essentially get garbage.
>
> You got precisely what you specified - if it's not what you wanted, you
> need to change your specification.

No, I didn't. I wanted a way to specify wide string literals, and that
solution was incorrect.

Juha Nieminen

unread,
Jul 8, 2021, 4:13:20 AM7/8/21
to
Clearly you have never written unit tests.

James Kuyper

unread,
Jul 8, 2021, 5:53:00 AM7/8/21
to
Paavo Helde's solution of using "\xC2\xA9" was correct for narrow string
literals (on systems with CHAR_BIT==8, a requirement that he didn't
bother mentioning). He was relying upon a UTF-8 => UTF-16 conversion
routine of his own creation to get the corresponding wide string.

You asked whether L"\xC2\xA9" would work, and the answer is "No",
because it specifies two wide characters when only one is desired. You
were aware that it wouldn't work, but seemed to be suggesting that
there's a potentially faulty UTF-8=>UTF-16 conversion involved in it's
failure to be correct. There is no such conversion. L"\xC2\xA9"
specifies directly a wchar_t array of length 3 initialized with {0xC2,
0xA9, 0}, which is not what you wanted.

I initially didn't address that point properly because I hadn't realized
that only one character was desired.
However, u"\xA9" or U"\xA9" would work fine; L"\xA9" should produce the
desired result on systems where wchar_t uses UCS2 or UCS4 (==UTF-32)
encoding.

James Kuyper

unread,
Jul 8, 2021, 3:56:27 PM7/8/21
to
On 7/3/21 10:28 AM, Alf P. Steinbach wrote:
> On 3 Jul 2021 14:44, James Kuyper wrote:
>> On 7/3/21 7:31 AM, Alf P. Steinbach wrote:
>>> On 3 Jul 2021 06:44, James Kuyper wrote:
> [snippety]
>>>>
>>>> UTF-8: u8"\u00A9\U000000A9"
>>>> UTF-16: u"\251\xA9\u00A9\U000000A9"
>>>> UTF-32: U"\251\xA9\u00A9\U000000A9"
>>>>
>>>> Do you know of any implementation which is non-conforming because it
>>>> misinterprets any of those escape sequences?
>>> No, they should work. These escapes are an alternative solution to
>>> Juha's problem. ...
>>
>> They are the only solution that this sub-thread has been about.
>>
>>> ... However, they lack readability and involve much more
>>> work than necessary, so IMO the thing to do is to assert UTF-8.
>>
>> Those are reasonable concerns. That the system's assumptions about the
>> source character set would prevent those escapes from working is not.
>
> As far as I know nobody's argued that the source encoding assumption
> would prevent any escapes from working.

You said "It gets the wrong characters in the wide string literal,
period.", and other parts of the discussion implicated source encoding
assumptions as the reason why. The use of "period" implies no
exceptions, and there's a very large set of exceptions: at least two, as
as many as four, fully portable working escape sequences for every
single Unicode code point.

> If I understand you correctly your “this sub-thread” about escapes and
> universal character designators -- let's just call them all escapes
> -- started when you responded to my response Richard Daemon, who had
> responded to Juha Niemininen, who wrote:
>
>
> [>>]
> Does that work for wide string literals? Because I don't think it does.
> In other words:
>
> std::wstring s = L"Copyright \xC2\xA9 2001-2020";
> [<<]
>
>
> Richard responded to that:
>
>
> [>>]
> \x works in wide string literal too, and puts in a character with that
> value. The difference is that if the wide string type isn't unicode
> encoded then it might get the wrong character in the string.
> [<<]
>
>
> I responded to Richard:
>
>
> [>>]
> It gets the wrong characters in the wide string literal, period.
> [<<]
>
>
> Which it decidedly does.
>
> It's trivial to just try it out and see; QED.

I did try it: as he said, it can get the wrong character if the string
type isn't unicode encoded, and as I pointed out, it can also get the
wrong character if the wrong escape sequence is used (which seems
trivially obvious). But it's perfectly capable of giving the right
characters when the right escape sequence is used with a prefix that
mandates a unicode encoding.

By saying "... it gets the wrong characters ... period.", you were
denying that it's ever possible for it to get the right characters,
which is demonstrably false. I've tried out the sequences I specified in
the message you quoted above. They all work on my systems, and according
to my understanding of the standard, they're required to work on all
fully conforming implementations, regardless of source encoding
assumptions - if that's not the case, I want to know how the exceptions
can be justified.

...
> So, in your mind this sub-thread may have been about whether escape
> sequences (including universal character designators) are affected by
> the source encoding, but to me it has been about whether Juha's example
> yields the desired string, as he correctly surmised that it didn't.

Yes, but that's because it was the wrong escape sequence, not because
there's any inherent problem with using correct escape sequences for
that purpose.

0 new messages