On 11/21/21 9:27 AM, Alf P. Steinbach wrote:
> On 20 Nov 2021 21:43, James Kuyper wrote:
>> C++ has been changing a little faster than I can easily keep up with. I
>> only recently noticed that C++ now (since 2017?) seems to require
>> support for UTF-8, UTF-16, and UTF-32 encodings, which used to be
>> optional (and even earlier, was non-existent). Investigating futher, I
>> was surprised by something that seems to be missing.
>>
>> The standard describes five different character encodings used at
>> execution time. Two of them are implementation-define native encodings
>> for narrow and wide characters, stored in char and wchar_t respectively.
>> The other three are Unicode encodings, UTF-8, UTF-16, and UTF-32, stored
>> in char8_t, char16_t, and char32_t respectively. The native encodings
>> could both also be Unicode encodings, but the following question is
>> specifically about implementations where that is not the case.
>>
>> There are codecvt facets (28.3.1.1.1) for converting between the native
>> encodings, and between char8_t and the other Unicode encodings, but as
>> far as I can tell, the only way to convert between native and Unicode
>> encodings are the character conversion functions in <cuchar> (21.5.5)
>> incorporated from the C standard library.
...
> The std::codecvt stuff is probably/maybe what you're looking for.
As indicated above, I'm quite aware of the existence of codecvt.
However, in the latest draft version of the standard that I have,
n4860.pdf, the codecvt facets listed in table 102 (28.3.1.1.1p2) are:
codecvt<char, char, mbstate_t>
codecvt<char16_t, char8_t, mbstate_t>
codecvt<char32_t, char8_t, mbstate_t>
codecvt<wchar_t, char, mbstate_t>
Which one should I use to convert between native and Unicode encodings?
None of them seem suitable, which was the point of my message.
The change from char to char8_t occurred between n4659.pdf (2017-03-21)
and n4849.pdf (2020-01-04).
If I'm correct about the routines in <cuchar> converting between the
native encoding for narrow characters and unicode encodings, it should
have been straightforward to implement corresponding codecvt facets -
why didn't they mandate them?
> For conversion UTF-8 -> UTF-16 MSVC and MinGW g++ yield different
> results wrt. endianess, and wrt. to state after conversion failure.
Well, Unicode leaves the endianess for UTF-16 unspecied, provides a BOM
to clarify the abiguity, and recommends assuming big-endian if no BOM is
present. Windows decided to go for little-endian. This is a problem, but
it's a Unicode problem, not a C++ problem; C++ is doing nothing more
than failing to resolve the ambiguity that Unicode left ambiguous.