On 10/3/23 12:37, MarioCCCP wrote:
>
>
> I have been away from C++ for too long and I am very confused
> (particularly with the differences between wchar_t strings and unicode
> strings, like UTF-8).
The execution character set has a multibyte encoding that is used by
most standard library routines that that take arguments as strings of
[[un]signed] char. It could use UTF-8 (or even UTF-16, if CHAR_BIT >=
16), but it is not required to do so.
wchar_t uses a fixed-length encoding capable of encoding every supported
character. It could have a Unicode encoding, either UTF-32, or UCS-2 if
the set of supported characters is sufficiently restricted, but it is
not required to.
The C++ standard incorporates by reference functions for converting
between those encodings:
The <cwchar> header declares routines for converting between wchar_t and
multi-byte strings: std::mbstowcs(), std::wcstombs(), std::wbtowc(), and
std::wctomb(). std::wcrtomb() std::mnbrtowc(), std::mbstowcs(),
std::wcsrtombs(). Their definitions are not in the C++ standard itself,
but incorporated by reference from the C standard.
char8_t, char16_t, and char32_t are newer typedefs used to store strings
in UTF-8, UTF-16, or UTF-32 format.
The <cuchar> header defines routines for converting between the
multi-byte and any of the UTF-N encodings, with names like mbrtoc8() and
c8rtomb() (and similarly for c16 and 32). Their definitions are also
incorporated by reference from the C standard.
Combining those routines, you can use multi-byte strings as an
intermediary to convert between any pair of the other encodings
mentioned above.