On 18/08/2019 01:33, Keith Thompson wrote:
> David Brown <
david...@hesbynett.no> writes:
>> On 17/08/2019 18:26, Bonita Montero wrote:
>>>> Exactly, yes. 16-bit wchar_t can't do that on a system that
>>>> supports Unicode (regardless of the encoding).
>>>
>>> There will be no more Unicode codepoints populated than could be
>>> adressed by UTF-16. At least until we gonna support alien languages.
>>
>> Does your ignorance know no bounds?
>>
>> Aren't you even capable of the simplest of web searches or references
>> before talking rubbish in public?
>
> I think you have incorrectly assumed that Bonita is asserting that there
> are no more than 65536 Unicode codepoints.
Possibly. If I have misinterpreted her, then I will be glad to be
corrected.
>
> UTF-16 can represent all Unicode codepoints. It cannot represent each
> Unicode codepoint in 16 bits; some of them require two 16-bit values.
Correct - and that is something I have said several times.
>
> I presume Bonita meant that UTF-16 can represent all of Unicode (which
> is true). But a 16-bit wchar_t cannot, because the standard requires
> wchar_t to be "able to represent all members of the execution
> wide-character set" (that's the point Bonita missed or ignored).
Agreed - and again, it is something I have said several times.
It is fine (both in the sense of working practically and in being
compliant with the standards) to use char16_t and u16string to handle
all Unicode strings and characters. You can't store all Unicode
characters in a /single/ char16_t object - but a char16_t is for storing
Unicode code /units/, not code /points/, so it is fine for the job.
But a whcar_t has to be able to store any /character/ - for Unicode,
that means 21 bits of code point.
>
> To use wchar_t to represent all Unicode code points, you either have to
> make wchar_t at least 21 bits (more likely 32 bits) *or* you have to use
> it in a way that doesn't satisfy the requirements of the standard, such
> as using UTF-16 to encode some characters in more than one wchar_t.
>
In practice, I think people on Windows use wchar_t strings (or arrays)
for holding UTF-16 encoding strings. That will work fine. But it
encourages mistaken assumptions - such as that ws[9] holds the tenth
Unicode character in the string, or that wcslen returns the number of
characters in the string. These assumptions hold for a proper wchar_t,
such as the 32-bit wchar_t on Unix systems (or a 16-bit wchar_t on
Windows while it used UCS-2 rather than UTF-16).
I think the sensible practice would be to deprecate the use of wchar_t
as much as possible, using instead char8_t for UTF-8 strings when
dealing with string data (and especially for data interchange), and
char32_t for UTF-32 encoding internally if you need
character-by-character access. These are unambiguous and function
identically across platforms (except perhaps for the endianness of
char32_t). For interaction with legacy code and API's on Windows,
char16_t is a better choice than wchar_t.
Going forward, C++ could drop wchar_t and support for wide character
execution sets other than Unicode, just as it is dropping support for
signed integer representations other than two's complement. This kind
of thing limits flexibility in theory, but not in practice, and it would
simplify things a bit.