Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Character encoding conversion in wide string literals

65 views
Skip to first unread message

Juha Nieminen

unread,
Dec 7, 2021, 6:37:29 AM12/7/21
to
Recently I stumbled across a problem where I had wide string literals
with non-ascii characters UTF-8 encoded. In other words, I had code like
this (I'm using non-ascii in the code below, I hope it doesn't get
mangled up, but even if it does, it should nevertheless be clear what
I'm trying to express):

std::wstring str = L"non-ascii chars: ???";

The C++ source file itself uses UTF-8 encoding, meaning that that line
of code is likewise UTF-8 encoded. If it were a narrow string literal
(being assigned to a std::string) then it works just fine (primarily
because the compiler doesn't need to do anything to it, it can simply
take those bytes from the source file as is).

However, since it's a wide string literal (being assigned to a std::wstring)
it's not as clear-cut anymore. What does the standard say about this
situation?

The thing is that it works just fine in Linux using gcc. The compiler will
re-encode the UTF-8 encoded characters in the source file inside the
parentheses into whatever encoding wide char string use, so the correct
content will end up in the executable binary (and thus in the wstring).

Apparently it does not work correctly in (some recent version of)
Visual Studio, where apparently it just takes the byte values from the
source file within the parentheses as-is, and just assigns those values
as-is to the wide chars that end up in the binary. (Or something like that.)

Does the standard specify what the compiler should do in this situation?
If not, then what is the proper way of specifying wide string literals
that contain non-ascii characters?

Alf P. Steinbach

unread,
Dec 7, 2021, 10:46:07 AM12/7/21
to
On 7 Dec 2021 12:37, Juha Nieminen wrote:
> Recently I stumbled across a problem where I had wide string literals
> with non-ascii characters UTF-8 encoded. In other words, I had code like
> this (I'm using non-ascii in the code below, I hope it doesn't get
> mangled up, but even if it does, it should nevertheless be clear what
> I'm trying to express):
>
> std::wstring str = L"non-ascii chars: ???";
>
> The C++ source file itself uses UTF-8 encoding, meaning that that line
> of code is likewise UTF-8 encoded. If it were a narrow string literal
> (being assigned to a std::string) then it works just fine (primarily
> because the compiler doesn't need to do anything to it, it can simply
> take those bytes from the source file as is).
>
> However, since it's a wide string literal (being assigned to a std::wstring)
> it's not as clear-cut anymore. What does the standard say about this
> situation?
>
> The thing is that it works just fine in Linux using gcc. The compiler will
> re-encode the UTF-8 encoded characters in the source file inside the
> parentheses into whatever encoding wide char string use, so the correct
> content will end up in the executable binary (and thus in the wstring).
>
> Apparently it does not work correctly in (some recent version of)
> Visual Studio, where apparently it just takes the byte values from the
> source file within the parentheses as-is, and just assigns those values
> as-is to the wide chars that end up in the binary. (Or something like that.)

The Visual C++ compiler assumes that source code is Windows ANSI encoded
unless

* you use an encoding option such as `/utf8`, or
* the source is UTF-8 with BOM, or
* the source is UTF-16.

Independently of that Visual C++ assumes that the execution character
set (the byte-based encoding that should be used for text data in the
executable) is Windows ANSI, unless it's specified as something else.
The `/utf8` option specifies also that. It's a combo option that
specifies both source encoding and execution character set as UTF-8.

Unfortunately as of VS 2022 `/utf8` is not set by default in a VS
project, and unfortunately there's nothing you can just click to set it.
You have to to type it in (right click project then) Properties -> C/C++
-> Command Line. I usually set "/utf-8 /Zc:__cplusplus".


> Does the standard specify what the compiler should do in this situation?
> If not, then what is the proper way of specifying wide string literals
> that contain non-ascii characters?

I'll let others discuss that, but (1) it does, and (2) just so you're
aware: the main problem is that the C and C++ standards do not conform
to reality in their requirement that a `wchar_t` value should suffice to
encode all possible code points in the wide character set.

In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
some emojis etc. that appear as a single character and constitute one
21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
"surrogate pair".

That's probably not your problem though, but it is a/the problem.


- ALf

James Kuyper

unread,
Dec 7, 2021, 11:39:00 AM12/7/21
to
The standard says a great many things about it, but the most important
things it says are that the relevant character sets and encodings are
implementation-defined. If an implementation uses utf-8 for it's native
character encoding, your code should work fine. The most likely
explanation why it doesn't work is that your utf-8 encoded source code
file is being interpreted using some other encoding, probably ASCII or
one of its many variants.

I have relatively little experience programming for Windows, and
essentially none with internationalization. Therefore, the following
comments about Windows all convey second or third-hand information, and
should be treated accordingly. Many people posting on this newsgroup
know more than I do about such things - hopefully someone will correct
any errors I make:

* When Unicode first came out, Windows choose to use UCS-2 to support
it, and made that it's default character encoding.
* When Unicode expanded beyond the capacity of UCS-2, Windows decided to
transition over to using UTF-16. There was an annoyingly long transition
period during which some parts of Windows used UTF-16, while other parts
still used UCS-2. I cannot confirm whether or not that transition period
has completed yet.
* I remember hearing rumors that modern versions of Windows do provide
some support for UTF-8, but that support is neither complete, nor the
default. You have know what you need to do to enable such support - I don't.

> If not, then what is the proper way of specifying wide string literals
> that contain non-ascii characters?

The most portable way of doing it to use what the standard calls
Universal Character Names, or UCNs for short. "\u" followed by 4
hexadecimal digits represents the character whose code point is
identified by those digits. "\U" followed by eight hexadecimal digits
represents the character whose Unicode code point is identified by those
digits.
Here's some key things to keep in mind when using UCNs:

5.2p1: during translation phase 1, the implementation is required to
convert any source file character that is not in the basic source
character set into the corresponding UCN.
5.2p2: Interrupting a UCN with an escaped new-line has undefined behavior.
5.2p4: Creating something that looks like a UCN by using the ## operator
has undefined behavior.
5.2p5: During translation phase 5, UCN's are converted to the execution
character set.
5.3p2: A UCN whose hexadecimal digits don't represent a code point or
which represents a surrogate code point renders the program ill-formed.
A UCN that represents a control character or a member of the basic
character set renders the program ill-formed unless it occurs in a
character literal or string literal.
5.4p3: The conversion to UCNs is reverted in raw string literals.
5.10p1: UCNs are allowed in identifiers, but only if they fall into one
of the ranges listed in Table 2 of the standard.
5.13.3p8: Any UCN for which there is no corresponding member of the
execution character set is translated to an implementation-defined encoding.
5.13.5p13: A UCN occurring in a UTF-16 string literal may yield a
surrogate pair. A UCN occurring in a narrow string literal may map to
one or more char or char8_t elements.

Here's a more detailed explanation of what the standard says about this
situation:
The standard talks about three different implementation-defined
character sets:
* The physical source character set which is used in your source code file.
* The source character set which is used internally by the compiler
while processing your code.
* The execution character set used by your program when it is executed.

The standard talks about 5 different character encodings:
The implementation-defined narrow and wide native encodings used by
character constants and string literals with no prefix, or with the "L"
prefix, respectively. These are stored in arrays of char and wchar_t,
respectively
The UTF-8, UTF-16, and UTF-32 encodings used by character constants with
u8, u, and U prefixes, respectively. These are stored in arrays of
char8_t, char16_t, and char32_t, respectively.

Virtually every standard library template that handles characters is
required to support specializations for wchar_t, char8_t, char16_t, and
char32_t.

The standard mandates support for std::codecvt facets enabling
conversion between the narrow and wide native encodings, and facets for
converting between UTF-8 and either UTF-16 or UTF-32.
The standard specifies the <cuchar> header which incorporates routines
form the C standard library header <uchar.h> for converting between the
narrow native encoding and either UTF-16 or UTF-32.
Therefore, conversion between wchar_t and either char16_t or char32_t
requires three conversion steps.

James Kuyper

unread,
Dec 7, 2021, 11:49:18 AM12/7/21
to
On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
...
> aware: the main problem is that the C and C++ standards do not conform
> to reality in their requirement that a `wchar_t` value should suffice to
> encode all possible code points in the wide character set.

The purpose of the C and C++ standards is prescriptive, not descriptive.
It's therefore missing the point to criticize them for not conforming to
reality. Rather, you should say that some popular implementations fail
to conform to the standards.

> In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
> some emojis etc. that appear as a single character and constitute one
> 21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
> "surrogate pair".

The C++ standard explicitly addresses that point, though the C standard
does not.

Keith Thompson

unread,
Dec 7, 2021, 12:42:00 PM12/7/21
to
What exactly do you mean by "Windows ANSI"? Windows-1252 or something
else? (Microsoft doesn't call it "ANSI", because it isn't.)

[...]

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Alf P. Steinbach

unread,
Dec 7, 2021, 1:00:13 PM12/7/21
to
On 7 Dec 2021 17:48, James Kuyper wrote:
> On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
> ...
>> aware: the main problem is that the C and C++ standards do not conform
>> to reality in their requirement that a `wchar_t` value should suffice to
>> encode all possible code points in the wide character set.
>
> The purpose of the C and C++ standards is prescriptive, not descriptive.
> It's therefore missing the point to criticize them for not conforming to
> reality. Rather, you should say that some popular implementations fail
> to conform to the standards.

No, in this case it's the standard's fault. They failed to standardize
existing practice and instead standardized a completely unreasonable
requirement, given that 16-bit `wchar_t` was established as the API
foundation in the most widely used OS on the platform, something that
could not easily be changed. In particular this was the C standard
committee: their choice here was as reasonable and practical as their
choice of not supporting pointers outside of original (sub-) array.

It was idiotic. It was simple blunders. But inn both cases, as I recall,
they tried to cover up the blunder by writing a rationale; they took the
blunders to heart and made them into great obstacles, to not lose face.


>> In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
>> some emojis etc. that appear as a single character and constitute one
>> 21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
>> "surrogate pair".
>
> The C++ standard explicitly addresses that point, though the C standard
> does not.

Happy to hear that but some more specific information would be welcome.


- Alf

Alf P. Steinbach

unread,
Dec 7, 2021, 1:07:18 PM12/7/21
to
"Windows ANSI" is the encoding specified by the `GetACP` API function,
which, but as I recall that's more or less undocumented, just serves up
the codepage number specified by registry value

Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage@ACP

This means that "Windows ANSI" is a pretty dynamic thing. Not just
system-dependent, but at-the-moment-configuration dependent. Though in
English-speaking countries it's Windows 1252 by default.

And that in turn means that using the defaults with Visual C++, you can
end up with pretty much any encoding whatsoever of narrow literals.

Which means that it's a good idea to take charge.

Option `/utf8` is one way to take charge.


- Alf

Manfred

unread,
Dec 7, 2021, 1:07:34 PM12/7/21
to
On 12/7/2021 5:38 PM, James Kuyper wrote:
> * I remember hearing rumors that modern versions of Windows do provide
> some support for UTF-8, but that support is neither complete, nor the
> default. You have know what you need to do to enable such support - I don't.

One relevant addition that is relatively recent is support to/from UTF-8
in the APIs WideCharToMultiByte and MultiByteToWideChar.
These allow to handle UTF-8 programmatically in code.
Windows itself still uses UTF-16 internally.
I don't know how filenames are stored on disk.

Paavo Helde

unread,
Dec 7, 2021, 1:33:16 PM12/7/21
to
07.12.2021 19:41 Keith Thompson kirjutas:
>
> What exactly do you mean by "Windows ANSI"? Windows-1252 or something
> else? (Microsoft doesn't call it "ANSI", because it isn't.)

It does. From
https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp

"Retrieves the current Windows ANSI code page identifier for the
operating system."

This is in contrast to the GetOEMCP() function which is said to return
"OEM code page", not "ANSI code page". Both terms are misnomers from the
previous century.

Both these codepage settings traditionally refer to some narrow char
codepage identifiers, which will vary depending on the user regional
settings and are thus unpredictable and unusable for basically anything
related to internationalization.

The only meaningful strategy is to set these both to UTF-8 which now
finally has some (beta stage?) support in Windows 10, and to upgrade all
affected software to properly support this setting.


James Kuyper

unread,
Dec 7, 2021, 1:55:46 PM12/7/21
to
On 12/7/21 12:59 PM, Alf P. Steinbach wrote:
> On 7 Dec 2021 17:48, James Kuyper wrote:
>> On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
...
>> The purpose of the C and C++ standards is prescriptive, not descriptive.
>> It's therefore missing the point to criticize them for not conforming to
>> reality. Rather, you should say that some popular implementations fail
>> to conform to the standards.
>
> No, in this case it's the standard's fault. They failed to standardize
> existing practice and instead standardized a completely unreasonable
> requirement, given that 16-bit `wchar_t` was established as the API
> foundation in the most widely used OS on the platform, something that
> could not easily be changed. In particular this was the C standard
> committee: their choice here was as reasonable and practical as their
> choice of not supporting pointers outside of original (sub-) array.

It was existing practice. From the very beginning, wchar_t was supposed
to be "an integral type whose range of values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales". When char32_t was added to the language,
moving that specification to char32_t might have been a reasonable thing
to do, but continuing to apply that specification to wchar_t was NOT an
innovation. The same version of the standard that added char32_t also
added char16_t, which is what should now be used for UTF-16 encoding,
not wchar_t.

It's an abuse of what wchar_t was intended for, to use it for a
variable-length encoding. None of the functions in the C or C++ standard
library for dealing with wchar_t values has ever had the right kind of
interface to allow it to be used as a variable-length encoding. To see
what I'm talking about, look at the mbrto*() and *tomb() functions from
the C standard library, that have been incorporated by reference into
the C++ standard library. Those functions do have interfaces designed to
handle a variable-length encoding.

...
>>> In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
>>> some emojis etc. that appear as a single character and constitute one
>>> 21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
>>> "surrogate pair".
>>
>> The C++ standard explicitly addresses that point, though the C standard
>> does not.
>
> Happy to hear that but some more specific information would be welcome.

5.3p2:
"A universal-character-name designates the character in ISO/IEC 10646
(if any) whose code point is the hexadecimal number represented by the
sequence of hexadecimal-digits in the universal-character-name. The
program is ill-formed if that number ... is a surrogate code point. ...
A surrogate code point is a value in the range [D800, DFFF] (hexadecimal)."

5.13.5p8: "[Note: A single c-char may produce more than one char16_t
character in the form of surrogate pairs. A surrogate pair is a
representation for a single code point as a sequence of two 16-bit code
units. — end note]"

5.13.5p13: "a universal-character-name in a UTF-16 string literal may
yield a surrogate pair. ... The size of a UTF-16 string literal is the
total number of escape sequences, universal-character-names, and other
characters, plus one for each character requiring a surrogate pair, plus
one for the terminating u’\0’."

Note that it's UTF-16, which should be encoded using char16_t, for which
this issue is acknowledged. wchar_t is not, and never was, supposed to
be a variable-length encoding like UTF-8 and UTF-16.

James Kuyper

unread,
Dec 7, 2021, 1:56:10 PM12/7/21
to
Note that it was referred to as "ANSI" because Microsoft proposed it for
ANSI standardization, but that proposal was never approved. Continuing
to refer to it as "ANSI" decades later is a rather sad failure to
acknowledge that rejection.

Keith Thompson

unread,
Dec 7, 2021, 3:22:04 PM12/7/21
to
It appears my previous statement was incorrect. At least some Microsoft
documentation does still (incorrectly) refer to "Windows ANSI".

https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp

The history, as I recall, is that Microsoft proposed one or more 8-bit
extensions of the 7-bit ASCII character set as ANSI standards.
Windows-1252, which has various accented letters and other symbols in
the range 128-255, is the best known variant. But Microsoft's proposal
was never adopted by ANSI, leaving us with a bunch of incorrect
documentation. Instead, ISO created the 8859-* 8-bit character sets,
including 8859-1, or Latin-1. Latin-1 differs from Windows-1252 in that
Latin-1 it has control characters in the range 128-159, while
Windows-1252 has printable characters.

https://en.wikipedia.org/wiki/Windows-1252

Keith Thompson

unread,
Dec 7, 2021, 3:26:53 PM12/7/21
to
Paavo Helde <ees...@osa.pri.ee> writes:
> 07.12.2021 19:41 Keith Thompson kirjutas:
>> What exactly do you mean by "Windows ANSI"? Windows-1252 or
>> something
>> else? (Microsoft doesn't call it "ANSI", because it isn't.)
>
> It does. From
> https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp
>
> "Retrieves the current Windows ANSI code page identifier for the
> operating system."

Yes, I has missed that.

But Microsoft has also said:

The term ANSI as used to signify Windows code pages is a historical
reference, but is nowadays a misnomer that continues to persist in
the Windows community.

https://en.wikipedia.org/wiki/Windows-1252
https://web.archive.org/web/20150204175931/http://download.microsoft.com/download/5/6/8/56803da0-e4a0-4796-a62c-ca920b73bb17/21-Unicode_WinXP.pdf

Microsoft's mistake was to start using the term "ANSI" before it
actually became an ANSI standard. Once that mistake was in place,
cleaning it up was very difficult.

> This is in contrast to the GetOEMCP() function which is said to return
> "OEM code page", not "ANSI code page". Both terms are misnomers from
> the previous century.
>
> Both these codepage settings traditionally refer to some narrow char
> codepage identifiers, which will vary depending on the user regional
> settings and are thus unpredictable and unusable for basically
> anything related to internationalization.
>
> The only meaningful strategy is to set these both to UTF-8 which now
> finally has some (beta stage?) support in Windows 10, and to upgrade
> all affected software to properly support this setting.

Yes, I advocate using UTF-8 whenever practical.

Öö Tiib

unread,
Dec 7, 2021, 7:39:15 PM12/7/21
to
On Tuesday, 7 December 2021 at 20:00:13 UTC+2, Alf P. Steinbach wrote:
> On 7 Dec 2021 17:48, James Kuyper wrote:
> > On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
> > ...
> >> aware: the main problem is that the C and C++ standards do not conform
> >> to reality in their requirement that a `wchar_t` value should suffice to
> >> encode all possible code points in the wide character set.
> >
> > The purpose of the C and C++ standards is prescriptive, not descriptive.
> > It's therefore missing the point to criticize them for not conforming to
> > reality. Rather, you should say that some popular implementations fail
> > to conform to the standards.
> No, in this case it's the standard's fault. They failed to standardize
> existing practice and instead standardized a completely unreasonable
> requirement, given that 16-bit `wchar_t` was established as the API
> foundation in the most widely used OS on the platform, something that
> could not easily be changed. In particular this was the C standard
> committee: their choice here was as reasonable and practical as their
> choice of not supporting pointers outside of original (sub-) array.
>
> It was idiotic. It was simple blunders. But inn both cases, as I recall,
> they tried to cover up the blunder by writing a rationale; they took the
> blunders to heart and made them into great obstacles, to not lose face.

If C and/or C++ committee had standardized that wchar_t means
precisely "UTF-16 LE code unit" and nothing else then it would be
something different on Windows by now.

On case of Microsoft the only way to make it to change their idiotic
"existing practices" appears to be to standardize those. Once idiotic
practice of Microsoft is standardized then Microsoft finds resources
to switch from such to some reasonable one (as their "innovative"
extension).

David Brown

unread,
Dec 8, 2021, 3:18:30 AM12/8/21
to
My understanding is that at that time, the Windows wide character set
was UCS2, not UTF-16. Thus a 16-bit wchar_t was sufficient to encode
all wide characters.

It turned out that UCS2 was a dead-end, and now UTF-16 is a hack-job
that combines all the disadvantages of UTF-8 with all the disadvantages
of UTF-32, and none of the benefits of either. We can't blame MS for
going for UCS2 - they were early adopters and Unicode was 16-bit, so it
was a good choice at the time. They, and therefore their users, were
unlucky (along with Java, QT, Python, and no doubt others). Changing is
not easy - you have to make everything UTF-8 and yet still support a
horrible mix of wchar_t, char16_t, UCS2, and UTF-16 for legacy.

But as far as I can see, the C and C++ standards were fine with 16-bit
wchar_t when they were written. I have heard, but have no reference or
source, that the inclusion of 16-bit wchar_t in the standards was
promoted by MS in the first place.
0 new messages