On 11/16/2015 04:12 AM, Tijl Coosemans wrote:
> On Sun, 15 Nov 2015 18:08:15 -0500 James Kuyper <
james...@verizon.net> wrote:
>> On 11/15/2015 05:34 PM, Tijl Coosemans wrote:
...
>>> The word character is used with two different meanings here I believe.
>>> There's the common definition of character as in a character of a
>>> character set and there's the term wide character which is defined in
>>> 3.7 as a wchar_t value. There's no mention of charXX_t in 3.7 so the
>>> meaning of wide character in the description of the unicode functions
>>> is not entirely clear but it seems to me that it is meant to mean
>>> charXX_t value and not a character from a character set. Otherwise
>>> the description of mbrtocXX() doesn't make any sense because it says
>>> that one multibyte character (a character from a character set) can
>>> produce multiple wide characters as output (charXX_t values which
>>> together encode a character from a character set).
>>
>> What makes you think that this is not in fact the case?
>
> I do think this is the case.
Sorry - I guess I got confused as to what you were claiming.
>> What would you expect the following code to do on a platform that
>> pre#defines __STDC_UTF_16__?
>>
>> mbstate_t state = {0};
>
> mbstate_t isn't necessarily a struct. You need to initialise it to zero
> using memset.
What data type could mbstate_t be that it can't be zero-initialized by
using {0}? mbstate_t is required to be "a complete object type other
than an array type" (7.29.1p2). 0 is permitted initializer for any
arithmetic or pointer type. Braces are optionally permitted around the
initializer for any scalar type (6.7.9p11). mbstate_t is not allowed to
be an array type, but if it were, {0} would be allowed for that too.
The only types for which {0} is not a permitted initializer are, as far
as I can figure out, incomplete types and function types - mbstate_t is
not allowed to be either of those.
Note that the standard frequently uses the phrase "an mbstate_t object
initialized to zero". If this could be achieved by initialization, but
only by a call to memset(), that phrase wouldn't seem appropriate.
>> char mb[] = "\U10437";
>> char16_t c16[2];
>>
>> mbrtoc16(&c16, mb, sizeof mb - 1, &state);
>> mbrtoc16(c16+1, mb, sizeof mb - 1, &state);
>>
>> If I understand 7.28.1.1 correctly (which is not guaranteed), I would
>> expect c16[0] == 0xD801 && c16[1] == 0xDC37, based upon
>> <
https://en.wikipedia.org/wiki/UTF-16#Examples>. I'd expect both calls
>> to return "sizeof mb - 1".
>
> The first call reads sizeof(mb)-1 bytes that together form one multibyte
> character (here character means member of a character set). Then it
> determines how many wide characters (here character means char16_t object)
> are needed to encode this character (from a character set). It stores
> the first wide character in c16[0] and returns sizeof(mb)-1 (the number
> of bytes read). The second call sees that there are more wide characters
> in mbstate and stores the next one in c16[1]. Then it returns -3.
Not quite: it returns (size_t)(-3), or in other words, SIZE_MAX-2.
I was confused by the description of the return values from mbrtoc16(),
but your assertion that second call returns -3 seems to make sense
(after correction).
I'm still unclear about how you're supposed to know whether a second
call to mbrtoc16() with the same input will be needed.
--
James Kuyper