part of the issue with C in these regards is that it aliases characters
and bytes.
having a separate char and byte types could have made more sense.
granted... 'wchar_t'...
in this case, 'char' more typically represents a byte, and it could make
more sense simply to nail down the byte size as 8 bits, and by
extension, 'char'.
many of us have good results using UTF-8 for nearly everything. those
things which don't work well in UTF-8, can typically use UTF-16 or UTF-32.
generally, UTF-32 is often unnecessary:
it is rare to find text using any characters outside the BMP;
it is also rare to find fonts which support it (hard, actually, even to
find fonts which effectively support most of the Unicode BMP);
...
so, while some people may object, naive UCS2 may actually work pretty
well in many cases involving internationalized text.
however, a person can use 32-bit characters, and treat the high bits as
formatting data (text color and style).
for example, in my case I have a tweaked character encoding which uses
32-bits per character:
if the character fits in 16 bits, then the high 16-bits are used as
formatting;
if the character does not, it has 20 bits, loses its background color
(the background color comes from a prior character).
some combinations of formatting options are also assumed to be mutually
exclusive to help save bits (such as
superscript/subscript/strikethrough, ...).
in the source-text form (UTF-8), this information is generally
represented using ANSI-codes (though other options could be possible).
FWIW (OT):
in my own (scripting) language, it more goes the route of making bytes
and characters semantically different types:
byte/sbyte/ubyte: bytes, defined as always 8 bits (sbyte = signed byte);
char: default character (*1);
cchar: C character, defined as being (by default) 8 bits;
char8: explicit 8-bit character;
char16: explicit 16-bit character;
char32: explicit 32-bit character.
*1: generally, it is 16-bits in storage (arrays or structs), but 32-bits
when in 'working' forms (in variables or function arguments). elsewhere,
it will try to align with 'wchar_t'.
they also differ partly in that they represent different parts of the
numeric tower (byte and friends are part of the integer tower, with
'char' and friends as a partially disjoint character tower, where casts
are used to convert between them).
within the FFI (C <-> BS):
'char' <-> 'cchar';
'unsigned char' <-> 'byte/ubyte';
'signed char' <-> 'sbyte'.
'wchar_t' <-> 'char';
...
note that, sizes of byte/short/int/long/... are explicitly defined as
8/16/32/64 bits. in targets where C differs, they will not line up with
their C name-equivalents (for example, a hypothetical implementation on
a 16-bit target would still use a 32-bit 'int', even if C were using a
16-bit 'int').
...