On 15/03/2019 14.20, Thiago Macieira wrote:
> On Friday, 15 March 2019 09:41:20 PDT Matthew Woehlke wrote:
>> I'm curious whether the standard requires this code to compile:
>>
>> #define UY(str) u"" str
>> #define UX(str) UY(str)
>> #define UL(str) UX(str)
>>
>> constexpr auto x = u"\u269E \U0001f387 \u269F";
>> constexpr auto y = UL("\u269E \U0001f387 \u269F");
>>
>> static_assert(x[0] == y[0], "oh noes!");
>> static_assert(sizeof(x) == sizeof(y), "oh noes!");
>>
>> It does on Linux/GCC and probably on any setup where the local code page
>> is UTF-8. However, it does not compile on Visual Studio with CP-1252.
>
> You've complicated the problem way beyond what the real issue is, with macros,
> and by removing the actual issue from visibility.
Agreed in hindsight. Originally, I wasn't convinced that the
preprocessor wasn't part of the problem.
(Moreover, I'd originally believed the ultimate macro was defined as
`u##s`, not `u"" s`. In fact, MSVC does get *that* case right, as expected.)
> The problem can be summarised with this:
>
> auto x = u"" "\u0102";
> static_assert(x[0] = 0x0102, "oh noes");
>
> Which I think almost everyone will agree is what we expect the compiler to do.
> And it's what GCC does, but not Visual Studio. I'm going to argue that Visual
> Studio's behaviour is buggy, regardless of what the standard says.
...even if the execution character set is *not* UTF-8? If you're arguing
that, I'm going to have to disagree, at least in as much as I feel the
standard is quite clear that that case won't work.
That said, I could see an argument that that's stupid, and we should
consider the *final* character set before that mangling happens. I think
this would ultimately amount to switching (or maybe interleaving) phases
5 and 6, and as previously stated, I think it would be interesting to
entertain a proposal along those lines.
> When run with the /utf-8 option (which is the only sane way to run VS where
> Unicode is concerned), the compiler interprets "\u0102" from narrow, using the
> input character set, and converts to the execution character set, as required
> by Phase 5. Given the /utf-8 option, both the input and the execution
> character sets are UTF-8. So in this step, "\u0102" becomes execution bytes
> 0xC4 0x82.
>
> Then it sees u"" and needs to concatenate, as per Phase 6. Since Table 9 is
> very clear saying that u"a" "b" is the same as u"ab" after concatenation, it
> needs to produce UTF-16. So the only reasonable thing for it to do would be to
> take the 0xC4 0x82 bytes and interpret according to the execution character
> set, yielding U+0102 and then writing UTF-16 word 0x0102.
Agreed, but now we're talking about something different from my original
case. (Disclaimer: I haven't actually tried this case myself.)
> That's not what it did. It interpreted 0xC4 0x82 using CP-1252, which is
> neither the input nor the execution character set. That meant it converted to
> an internal representation of U+00C4 U+201A. That's mojibake.
Yup :-).
--
Matthew