On 2017-10-05 17:43, Daniel wrote:
> On Thursday, October 5, 2017 at 2:32:03 PM UTC-4, James R. Kuyper wrote:
>>
>> Actually, there's no such requirement for any of those types, char16_t
>> and char32_t are required to be typedefs for uint_least16_t and
>> uint_least32_t, respectively, which need not have a size of exactly 16
That's not quite right - I was thinking of C, where that statement was
perfectly correct. In C++, char16_t and char32_t are their own distinct
types. But it's still correct to say that 16 and 32 bits, respectively,
are only minimum values for the widths of those types. There's no
requirement that they be exactly that size.
>> or 32 bits, respectively. It's extremely likely that each of those four
>> types will have one of those three sizes, but it's not a requirement.
>
> Thanks for remarking on that, I'd overlooked that. I find it lacking that
> there's nothing in basic_string that tags the encoding, and have
> been using sizeof(CharT) as an indicator of that, e.g. assuming
> wchar_t holds utf16 if sizeof(wchar_t) == 16, or utf32 if sizeof(wchar_t)
> == 32. ...
I presume you mean sizeof(...)*CHAR_BIT?
The encoding used for narrow (char), and wide (wchar_t) strings and
characters is completely implementation-defined. There's no guarantee
that it has anything to do with either ASCII or Unicode. I gather that,
particularly in Japan, it is (or at least, used to be) commonplace for
neither of them to have either encoding.
> ... I realize this isn't technically correct. Is there at least a
> presumption that char32_t holds utf32? as there's nothing that prevents
> you from stuffing utf8 or utf16 into it.
You're right - there's nothing to prevent you from stuffing a arbitrary
numeric value that's within range into any object of either type.
However, there's facilities for creating and interpreting utf-8, utf-16
and utf-32 strings, and those facilities use char, char16_t, and
char32_t, respectively.
"A string literal that begins with u8, such as u8"asdf", is a UTF-8
string literal and is initialized with the given characters as encoded
in UTF-8.
Ordinary string literals and UTF-8 string literals are also referred to
as narrow string literals. A narrow string literal has type “array of n
const char”, where n is the size of the string as defined below, and has
static storage duration (3.7).
A string literal that begins with u, such as u"asdf", is a char16_t
string literal. A char16_t string literal has type “array of n const
char16_t”, where n is the size of the string as defined below; it has
static storage duration and is initialized with the given characters. A
single c-char may produce more than one char16_t character in the form
of surrogate pairs.
A string literal that begins with U, such as U"asdf", is a char32_t
string literal. A char32_t string literal has type “array of n const
char32_t”, where n is the size of the string as defined below; it has
static storage duration and is initialized with the given characters."
(2.14.5p7-10)
"... The specialization codecvt<char16_t, char, mbstate_t> converts
between the UTF-16 and UTF-8 encoding forms, and the specialization
codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding forms." (22.4.1.4p3).
"For the facet codecvt_utf8:
— The facet shall convert between UTF-8 multibyte sequences and UCS2 or
UCS4 (depending on the size of Elem) within the program.
...
For the facet codecvt_utf16:
— The facet shall convert between UTF-16 multibyte sequences and UCS2 or
UCS4 (depending on the size of Elem) within the program.
...
For the facet codecvt_utf8_utf16:
— The facet shall convert between UTF-8 multibyte sequences and UTF-16
(one or two 16-bit codes) within the program." (22.5p4-6)