conversions between native and unicode encodings

James Kuyper

unread,

Nov 20, 2021, 3:44:00 PM11/20/21

to

C++ has been changing a little faster than I can easily keep up with. I
only recently noticed that C++ now (since 2017?) seems to require
support for UTF-8, UTF-16, and UTF-32 encodings, which used to be
optional (and even earlier, was non-existent). Investigating futher, I
was surprised by something that seems to be missing.

The standard describes five different character encodings used at
execution time. Two of them are implementation-define native encodings
for narrow and wide characters, stored in char and wchar_t respectively.
The other three are Unicode encodings, UTF-8, UTF-16, and UTF-32, stored
in char8_t, char16_t, and char32_t respectively. The native encodings
could both also be Unicode encodings, but the following question is
specifically about implementations where that is not the case.

There are codecvt facets (28.3.1.1.1) for converting between the native
encodings, and between char8_t and the other Unicode encodings, but as
far as I can tell, the only way to convert between native and Unicode
encodings are the character conversion functions in <cuchar> (21.5.5)
incorporated from the C standard library.

Is it correct that the <uchar> routines do in fact perform such
conversions? It's hard to be sure, because the detailed description is
only cross-referenced from the C standard, which doesn't use the term
"native encoding", and allows __STDC_UTF_16__ and __STDC_UTF_32__ to not
be pre#defined.

Is it correct that the <cuchar> routines are the only way to perform
such conversions? It seems odd to me that the only way to perform such
conversions uses a C style interface.

Sam

unread,

Nov 20, 2021, 4:35:51 PM11/20/21

to

James Kuyper writes:

> Is it correct that the <cuchar> routines are the only way to perform
> such conversions? It seems odd to me that the only way to perform such
> conversions uses a C style interface.

The C++ library's support for transcoding between Unicode and various
character sets has always sucked. This still remains the case.

Alf P. Steinbach

unread,

Nov 21, 2021, 9:27:57 AM11/21/21

to

The std::codecvt stuff is probably/maybe what you're looking for.

For conversion UTF-8 -> UTF-16 MSVC and MinGW g++ yield different
results wrt. endianess, and wrt. to state after conversion failure.

When you have to compensate for compiler differences in order to get
portable code that uses standard library stuff, for the same platform,
then you know it's really BAD.

The UTF-8 specializations were deprecated in C++17, and one would
naturally think it was in order to replace with something better, not
suffering from all that badness, in e.g. C++20.

But no, the idiots (pardon the expression) only wanted to introduce
overloads with `char8_t` instead of `char`, that's what c++20 offered. I
havent' tested, because I refuse to "upgrade" to C++20. But I presume
these academic overloads suffer from all the badness of the old.

- Alf

James Kuyper

unread,

Nov 22, 2021, 12:05:58 AM11/22/21

to

On 11/21/21 9:27 AM, Alf P. Steinbach wrote:
> On 20 Nov 2021 21:43, James Kuyper wrote:
>> C++ has been changing a little faster than I can easily keep up with. I
>> only recently noticed that C++ now (since 2017?) seems to require
>> support for UTF-8, UTF-16, and UTF-32 encodings, which used to be
>> optional (and even earlier, was non-existent). Investigating futher, I
>> was surprised by something that seems to be missing.
>>
>> The standard describes five different character encodings used at
>> execution time. Two of them are implementation-define native encodings
>> for narrow and wide characters, stored in char and wchar_t respectively.
>> The other three are Unicode encodings, UTF-8, UTF-16, and UTF-32, stored
>> in char8_t, char16_t, and char32_t respectively. The native encodings
>> could both also be Unicode encodings, but the following question is
>> specifically about implementations where that is not the case.
>>
>> There are codecvt facets (28.3.1.1.1) for converting between the native
>> encodings, and between char8_t and the other Unicode encodings, but as
>> far as I can tell, the only way to convert between native and Unicode
>> encodings are the character conversion functions in <cuchar> (21.5.5)
>> incorporated from the C standard library.

...

> The std::codecvt stuff is probably/maybe what you're looking for.

As indicated above, I'm quite aware of the existence of codecvt.
However, in the latest draft version of the standard that I have,
n4860.pdf, the codecvt facets listed in table 102 (28.3.1.1.1p2) are:

codecvt<char, char, mbstate_t>
codecvt<char16_t, char8_t, mbstate_t>
codecvt<char32_t, char8_t, mbstate_t>
codecvt<wchar_t, char, mbstate_t>

Which one should I use to convert between native and Unicode encodings?
None of them seem suitable, which was the point of my message.
The change from char to char8_t occurred between n4659.pdf (2017-03-21)
and n4849.pdf (2020-01-04).

If I'm correct about the routines in <cuchar> converting between the
native encoding for narrow characters and unicode encodings, it should
have been straightforward to implement corresponding codecvt facets -
why didn't they mandate them?

> For conversion UTF-8 -> UTF-16 MSVC and MinGW g++ yield different
> results wrt. endianess, and wrt. to state after conversion failure.

Well, Unicode leaves the endianess for UTF-16 unspecied, provides a BOM
to clarify the abiguity, and recommends assuming big-endian if no BOM is
present. Windows decided to go for little-endian. This is a problem, but
it's a Unicode problem, not a C++ problem; C++ is doing nothing more
than failing to resolve the ambiguity that Unicode left ambiguous.

daniel...@gmail.com

unread,

Nov 22, 2021, 8:30:34 AM11/22/21

to

On Sunday, November 21, 2021 at 9:27:57 AM UTC-5, Alf P. Steinbach wrote:
> The std::codecvt stuff ...

>
> For conversion UTF-8 -> UTF-16 MSVC and MinGW g++ yield different
> results wrt. endianess, and wrt. to state after conversion failure.
>
> When you have to compensate for compiler differences in order to get
> portable code that uses standard library stuff, for the same platform,
> then you know it's really BAD.
>
> The UTF-8 specializations were deprecated in C++17, and one would
> naturally think it was in order to replace with something better, not
> suffering from all that badness, in e.g. C++20.
>

But if the committee spent time on practical things like unicode encoding
conversion and validation, which have massive prior experience to
draw on, where would they find time to spend time on, say, ranges?

Daniel