"Supporting Unicode" means that the extended character set is Unicode.
At the moment Unicode contains more than 128,000 characters, so a 16-bit
integer cannot encode all distinct values of Unicode.
And yes, Microsoft is well aware of this and thus MSVC supports 32-bit
\UNNNNNNNN Unicode character literals, for example. Here is a demo program:
#include <iostream>
int main() {
// elephant-camel-ant
wchar_t message[] = L"\U0001F418\U0001F42B\U0001F41C";
std::cout << "Size in wchar_t elements: " <<
sizeof(message)/sizeof(message[0]) << "\n";
}
The wide string literal is specified as 3 Unicode characters (elephant,
camel and ant) plus the terminating zero. In Windows/MSVC the program
output is:
Size in wchar_t elements: 7
This is because each character has been encoded by an UTF-16 surrogate
pair (plus there is a single terminator zero wchar_t).
If MSVC did not support full Unicode, then:
a) it would not recognize 32-bit \UNNNNNNNN Unicode character literals
(C++ standard contains also 16-bit \uNNNN Unicode character literals).
or
b) it would truncate the values somehow into 16-bit wchar_t instead of
translating them into valid UTF-16.
As it stands now, it appears MSVC is containing special support for all
Unicode, translating it into native UTF-16. This is a reasonable
implementation for Windows. However, this plainly contradicts the C++
standard:
2.13.5/12: "A string-literal that begins with L, such as L"asdf", is a
wide string literal. A wide string literal has type
“array of n const wchar_t”."
2.13.5/15: "The size of a char32_t or wide string literal is the total
number of escape sequences, universal-character-names, and other
characters, plus one for the terminating U’\0’ or L’\0’."
It appears MSVC has implemented what C++ standard calls char16_t string
literals ( u"asdf" ) (2.13.5/15: "a universal-character-name in a
char16_t string literal may yield a surrogate pair. [...] The size of a
char16_t string literal is the total number of escape sequences,
universal-character-names, and other characters, plus one for each
character requiring a surrogate pair, plus one for the terminating
u’\0’. [ Note: The size of a char16_t string literal is the number of
code units, not the number of characters. —end note ]")
If MSVC renamed their wchar_t to char16_t and L"abc" to u"abc", then it
would become conforming, as far as I can see. Of course they are not
going to do that as this would break a mountain range of existing code.
Cheers
Paavo