The standard says a great many things about it, but the most important
things it says are that the relevant character sets and encodings are
implementation-defined. If an implementation uses utf-8 for it's native
character encoding, your code should work fine. The most likely
explanation why it doesn't work is that your utf-8 encoded source code
file is being interpreted using some other encoding, probably ASCII or
one of its many variants.
I have relatively little experience programming for Windows, and
essentially none with internationalization. Therefore, the following
comments about Windows all convey second or third-hand information, and
should be treated accordingly. Many people posting on this newsgroup
know more than I do about such things - hopefully someone will correct
any errors I make:
* When Unicode first came out, Windows choose to use UCS-2 to support
it, and made that it's default character encoding.
* When Unicode expanded beyond the capacity of UCS-2, Windows decided to
transition over to using UTF-16. There was an annoyingly long transition
period during which some parts of Windows used UTF-16, while other parts
still used UCS-2. I cannot confirm whether or not that transition period
has completed yet.
* I remember hearing rumors that modern versions of Windows do provide
some support for UTF-8, but that support is neither complete, nor the
default. You have know what you need to do to enable such support - I don't.
> If not, then what is the proper way of specifying wide string literals
> that contain non-ascii characters?
The most portable way of doing it to use what the standard calls
Universal Character Names, or UCNs for short. "\u" followed by 4
hexadecimal digits represents the character whose code point is
identified by those digits. "\U" followed by eight hexadecimal digits
represents the character whose Unicode code point is identified by those
digits.
Here's some key things to keep in mind when using UCNs:
5.2p1: during translation phase 1, the implementation is required to
convert any source file character that is not in the basic source
character set into the corresponding UCN.
5.2p2: Interrupting a UCN with an escaped new-line has undefined behavior.
5.2p4: Creating something that looks like a UCN by using the ## operator
has undefined behavior.
5.2p5: During translation phase 5, UCN's are converted to the execution
character set.
5.3p2: A UCN whose hexadecimal digits don't represent a code point or
which represents a surrogate code point renders the program ill-formed.
A UCN that represents a control character or a member of the basic
character set renders the program ill-formed unless it occurs in a
character literal or string literal.
5.4p3: The conversion to UCNs is reverted in raw string literals.
5.10p1: UCNs are allowed in identifiers, but only if they fall into one
of the ranges listed in Table 2 of the standard.
5.13.3p8: Any UCN for which there is no corresponding member of the
execution character set is translated to an implementation-defined encoding.
5.13.5p13: A UCN occurring in a UTF-16 string literal may yield a
surrogate pair. A UCN occurring in a narrow string literal may map to
one or more char or char8_t elements.
Here's a more detailed explanation of what the standard says about this
situation:
The standard talks about three different implementation-defined
character sets:
* The physical source character set which is used in your source code file.
* The source character set which is used internally by the compiler
while processing your code.
* The execution character set used by your program when it is executed.
The standard talks about 5 different character encodings:
The implementation-defined narrow and wide native encodings used by
character constants and string literals with no prefix, or with the "L"
prefix, respectively. These are stored in arrays of char and wchar_t,
respectively
The UTF-8, UTF-16, and UTF-32 encodings used by character constants with
u8, u, and U prefixes, respectively. These are stored in arrays of
char8_t, char16_t, and char32_t, respectively.
Virtually every standard library template that handles characters is
required to support specializations for wchar_t, char8_t, char16_t, and
char32_t.
The standard mandates support for std::codecvt facets enabling
conversion between the narrow and wide native encodings, and facets for
converting between UTF-8 and either UTF-16 or UTF-32.
The standard specifies the <cuchar> header which incorporates routines
form the C standard library header <uchar.h> for converting between the
narrow native encoding and either UTF-16 or UTF-32.
Therefore, conversion between wchar_t and either char16_t or char32_t
requires three conversion steps.