On 30 Jun 2021 09:57, Juha Nieminen wrote:
> Character encoding was a problem in the 1960's, and it's still a problem
> today, no matter how much computers advance. Sheesh.
>
> Problem is, how to reliably write wide char string literals that contain
> non-ascii characters?
>
> Suppose you write for example this:
>
> const wchar_t* str = L"???";
>
> In the *source code* that string literal may be eg. UTF-8 encoded. However,
> the compiler needs to convert it to wide chars.
>
> Problem is, how does the compiler know which encoding is being used in
> that 8-bit string literal in the source code, in order for it to convert
> it properly to wide chars?
The compiler necessarily assumes some source encoding.
g++ and Visual C++ use different schemes for determining the source code
encoding assumption.
g++ uses a single encoding assumption that you can change via options,
while Visual C++ by default determines the encoding for each individual
file, which is a much more flexible scheme. However, in modern
programming work you don't want to use that flexible Visual C++ scheme
because the base assumption, when no other indication is present, is
that a file is Windows ANSI encoded, while in modern programming work
it's most likely UTF-8 encoded. So it's now a good idea to use the
Visual C++ UTF-8 option, plus some others, e.g.
/nologo /utf-8 /EHsc /GR /permissive- /FI"iso646.h" /std:c++17
/Zc:__cplusplus /Zc:externC- /W4 /wd4459 /D _CRT_SECURE_NO_WARNINGS=1 /D
_STL_SECURE_NO_WARNINGS=1
> Some compilers may assume it's UTF-8 encoded source code. Others may
> assume it's ISO-Latin-1 encoded (I'm looking at you, Visual Studio).
> Obviously the end result will be garbage if the wrong assumption is made.
Yes. You can to some extent prevent Visual C++ mis-interpretation by
using the UTF-8 BOM as an encoding indicator, and I recommend that.
However, there are costs, in particular that mindless Linux fanbois (all
fanbois are mindless, even C++ fanbois) hung up on supporting archaic
Linux tools that can't handle the BOM, can then brand you as this and
that; and that's not hypothetical, it's direct experience. Also, even
though using a BOM is a very strong convention in Windows the Cmd `type`
command can't handle it, so that one is nudged in the direction of
Powershell, which is a monstrosity that I really hate.
> In most compilers (such as Visual Studio) you can specify which encoding
> to assume for source files, but this has to be done at the project
> settings level.
Uhm, no, you can specify compiler options per file if you want, in each
file's properties.
Visual Studio 2019 screenshot: (
https://ibb.co/tJ5jNJC)
> I don't think there's any way to specify the encoding
> in the source code itself.
Not in standard C++. For Visual C++ there is an undocumented (or used to
be undocumented) `#pragma` used e.g. in automatically generated resource
scripts, .rc files. I don't recall the name. Also, there is the UTF-8
BOM. An UTF-8 BOM is a pretty surefire way to force UTF-8 assumption.
> What does the C++ standard say? Does it say that source code files are
> always UTF-8 encoded, or is it up to the implementation?
It's totally up to the implementation.
That wouldn't be so bad if the standard had addressed the issue of a
collection of source files, in particular headers, with different
encodings, e.g. if the standard had /required/ all source files in a
translation unit to have the same encoding.
That's the assumption of g++, but not of Visual C++.
> I assume that if
> it's the latter, the standard doesn't provide any mechanism to specify
> which encoding is being used. Or does it?
Right. It's a mess. :-o :-)
But, practical solutions:
• Use UTF-8 BOM and Just Ignore™ whining from Linux fanbois.
• For good measure also use `/utf-8` option with Visual C++.
• Where it matters you can /statically assert/ UTF-8 encoding.
A `static_assert` depends on both that the the compiler's source file
encoding assumption is correct, whatever it is, and that the basic
execution character set (encoding of literals in the executable) is
UTF-8. These are separate encoding choices and can be specified
separately both with g++ and Visual C++. But assuming they both hold,
constexpr inline auto utf8_is_the_execution_character_set()
-> bool
{
constexpr auto& slashed_o = "ø";
return (sizeof( slashed_o ) == 3 and slashed_o[0] == '\xC3' and
slashed_o[1] == '\xB8');
}
When a `static_assert(utf8_is_the_execution_character_set())` holds you
can be pretty sure that the source encoding assumption is correct.
- Alf