Preprocessor and fun with string literals

Matthew Woehlke

unread,

Mar 15, 2019, 12:41:25 PM3/15/19

to std-dis...@isocpp.org

I'm curious whether the standard requires this code to compile:

#define UY(str) u"" str
#define UX(str) UY(str)
#define UL(str) UX(str)

constexpr auto x = u"\u269E \U0001f387 \u269F";
constexpr auto y = UL("\u269E \U0001f387 \u269F");

static_assert(x[0] == y[0], "oh noes!");
static_assert(sizeof(x) == sizeof(y), "oh noes!");

It does on Linux/GCC and probably on any setup where the local code page
is UTF-8. However, it does not compile on Visual Studio with CP-1252.

Does the standard specify what is supposed to happen here? If so, where?

(p.s. Some of the macro indirection may not matter; this is a simplified
version of Qt5's QStringLiteral, which has the same number of levels of
indirection.)

--
Matthew

Tadeus Prastowo

unread,

Mar 15, 2019, 1:26:25 PM3/15/19

to std-discussion

On Fri, Mar 15, 2019 at 5:41 PM Matthew Woehlke
<mwoehlk...@gmail.com> wrote:
>
> I'm curious whether the standard requires this code to compile:
>
> #define UY(str) u"" str
> #define UX(str) UY(str)
> #define UL(str) UX(str)
>
> constexpr auto x = u"\u269E \U0001f387 \u269F";
> constexpr auto y = UL("\u269E \U0001f387 \u269F");
>
> static_assert(x[0] == y[0], "oh noes!");
> static_assert(sizeof(x) == sizeof(y), "oh noes!");
>
> It does on Linux/GCC and probably on any setup where the local code page
> is UTF-8. However, it does not compile on Visual Studio with CP-1252.
>
> Does the standard specify what is supposed to happen here? If so, where?

http://eel.is/c++draft/lex.string#12

That refers to http://eel.is/c++draft/lex.phases, which in turn
specifies that macro expansion is performed first before concatenating
adjacent string literals. In other words, you are actually seeing:

constexpr auto x = u"\u269E \U0001f387 \u269F";

constexpr auto y = u"" "\u269E \U0001f387 \u269F";

That, according to http://eel.is/c++draft/lex.string#12, results in:

constexpr auto x = u"\u269E \U0001f387 \u269F";

constexpr auto y = u"\u269E \U0001f387 \u269F";

So, the Visual Studio (which version?) errs because the concatenation
of adjacent string literals in this case is not implementation
defined.

--
Best regards,
Tadeus

Matthew Woehlke

unread,

Mar 15, 2019, 1:45:19 PM3/15/19

to std-dis...@isocpp.org, Tadeus Prastowo

On 15/03/2019 13.26, Tadeus Prastowo wrote:

> On Fri, Mar 15, 2019 at 5:41 PM Matthew Woehlke wrote:
>> I'm curious whether the standard requires this code to compile:
>>
>> #define UY(str) u"" str
>> #define UX(str) UY(str)
>> #define UL(str) UX(str)
>>
>> constexpr auto x = u"\u269E \U0001f387 \u269F";
>> constexpr auto y = UL("\u269E \U0001f387 \u269F");
>>
>> static_assert(x[0] == y[0], "oh noes!");
>> static_assert(sizeof(x) == sizeof(y), "oh noes!");
>>

>> Does the standard specify what is supposed to happen here? If so, where?
>
> http://eel.is/c++draft/lex.string#12
>
> That refers to http://eel.is/c++draft/lex.phases, which in turn
> specifies that macro expansion is performed first before concatenating
> adjacent string literals. In other words, you are actually seeing:
>
> constexpr auto x = u"\u269E \U0001f387 \u269F";
> constexpr auto y = u"" "\u269E \U0001f387 \u269F";

Right. However...

> That, according to http://eel.is/c++draft/lex.string#12, results in:
>
> constexpr auto x = u"\u269E \U0001f387 \u269F";
> constexpr auto y = u"\u269E \U0001f387 \u269F";

...I don't think this is correct. Again, according to [lex.phases]:

5. Each source character set member in a character literal or a string
literal, as well as each escape sequence and universal-character-
name in a character literal or a non-raw string literal, is
converted to the corresponding member of the execution character
set; if there is no corresponding member, it is converted to an
implementation-defined member

Note that this is phase *5*. Literal concatenation is phase *6*. So it
would seem that the MSVC behavior is not only permissible, it is
*required*¹.

IOW, I have (after phase 4):

constexpr auto x = u"\u269E \U0001f387 \u269F";
constexpr auto y = u"" "\u269E \U0001f387 \u269F";

...which, after phase 5, is:

constexpr auto x = u"\u269E \U0001f387 \u269F";

constexpr auto y = u"" "? ? ?";

...and after phase 6 is:

constexpr auto x = u"\u269E \U0001f387 \u269F";

constexpr auto y = u"? ? ?";

...which trips the static_assert because the strings are not the same.

(¹ ...given an execution character set that cannot encode the characters
in question.)

I wonder how a proposal to swap phases 5 and 6 would go over...

Anyway, thanks for the pointers!

--
Matthew

Tadeus Prastowo

unread,

Mar 15, 2019, 2:03:45 PM3/15/19

to Matthew Woehlke, std-discussion

I agree with your analysis. In the same way, compiling the program
for an execution platform that can encode the characters should not
trip any static assertion. In other words, the GCC is also correct.

> I wonder how a proposal to swap phases 5 and 6 would go over...
>
> Anyway, thanks for the pointers!

Thanks for correcting my analysis.

> --
> Matthew

--
Best regards,
Tadeus

Thiago Macieira

unread,

Mar 15, 2019, 2:20:37 PM3/15/19

to std-dis...@isocpp.org

On Friday, 15 March 2019 09:41:20 PDT Matthew Woehlke wrote:
> I'm curious whether the standard requires this code to compile:
>
> #define UY(str) u"" str
> #define UX(str) UY(str)
> #define UL(str) UX(str)
>
> constexpr auto x = u"\u269E \U0001f387 \u269F";
> constexpr auto y = UL("\u269E \U0001f387 \u269F");
>
> static_assert(x[0] == y[0], "oh noes!");
> static_assert(sizeof(x) == sizeof(y), "oh noes!");
>
> It does on Linux/GCC and probably on any setup where the local code page
> is UTF-8. However, it does not compile on Visual Studio with CP-1252.

You've complicated the problem way beyond what the real issue is, with macros,
and by removing the actual issue from visibility.

The problem can be summarised with this:

auto x = u"" "\u0102";
static_assert(x[0] = 0x0102, "oh noes");

Which I think almost everyone will agree is what we expect the compiler to do.
And it's what GCC does, but not Visual Studio. I'm going to argue that Visual
Studio's behaviour is buggy, regardless of what the standard says. My
argument:

When run with the /utf-8 option (which is the only sane way to run VS where
Unicode is concerned), the compiler interprets "\u0102" from narrow, using the
input character set, and converts to the execution character set, as required
by Phase 5. Given the /utf-8 option, both the input and the execution
character sets are UTF-8. So in this step, "\u0102" becomes execution bytes
0xC4 0x82.

Then it sees u"" and needs to concatenate, as per Phase 6. Since Table 9 is
very clear saying that u"a" "b" is the same as u"ab" after concatenation, it
needs to produce UTF-16. So the only reasonable thing for it to do would be to
take the 0xC4 0x82 bytes and interpret according to the execution character
set, yielding U+0102 and then writing UTF-16 word 0x0102.

That's not what it did. It interpreted 0xC4 0x82 using CP-1252, which is
neither the input nor the execution character set. That meant it converted to
an internal representation of U+00C4 U+201A. That's mojibake.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel System Software Products

Matthew Woehlke

unread,

Mar 15, 2019, 3:18:19 PM3/15/19

to std-dis...@isocpp.org, Thiago Macieira

On 15/03/2019 14.20, Thiago Macieira wrote:
> On Friday, 15 March 2019 09:41:20 PDT Matthew Woehlke wrote:
>> I'm curious whether the standard requires this code to compile:
>>
>> #define UY(str) u"" str
>> #define UX(str) UY(str)
>> #define UL(str) UX(str)
>>
>> constexpr auto x = u"\u269E \U0001f387 \u269F";
>> constexpr auto y = UL("\u269E \U0001f387 \u269F");
>>
>> static_assert(x[0] == y[0], "oh noes!");
>> static_assert(sizeof(x) == sizeof(y), "oh noes!");
>>
>> It does on Linux/GCC and probably on any setup where the local code page
>> is UTF-8. However, it does not compile on Visual Studio with CP-1252.
>
> You've complicated the problem way beyond what the real issue is, with macros,
> and by removing the actual issue from visibility.

Agreed in hindsight. Originally, I wasn't convinced that the
preprocessor wasn't part of the problem.

(Moreover, I'd originally believed the ultimate macro was defined as
`u##s`, not `u"" s`. In fact, MSVC does get *that* case right, as expected.)

> The problem can be summarised with this:
>
> auto x = u"" "\u0102";
> static_assert(x[0] = 0x0102, "oh noes");
>
> Which I think almost everyone will agree is what we expect the compiler to do.
> And it's what GCC does, but not Visual Studio. I'm going to argue that Visual
> Studio's behaviour is buggy, regardless of what the standard says.

...even if the execution character set is *not* UTF-8? If you're arguing
that, I'm going to have to disagree, at least in as much as I feel the
standard is quite clear that that case won't work.

That said, I could see an argument that that's stupid, and we should
consider the *final* character set before that mangling happens. I think
this would ultimately amount to switching (or maybe interleaving) phases
5 and 6, and as previously stated, I think it would be interesting to
entertain a proposal along those lines.

> When run with the /utf-8 option (which is the only sane way to run VS where
> Unicode is concerned), the compiler interprets "\u0102" from narrow, using the
> input character set, and converts to the execution character set, as required
> by Phase 5. Given the /utf-8 option, both the input and the execution
> character sets are UTF-8. So in this step, "\u0102" becomes execution bytes
> 0xC4 0x82.
>
> Then it sees u"" and needs to concatenate, as per Phase 6. Since Table 9 is
> very clear saying that u"a" "b" is the same as u"ab" after concatenation, it
> needs to produce UTF-16. So the only reasonable thing for it to do would be to
> take the 0xC4 0x82 bytes and interpret according to the execution character
> set, yielding U+0102 and then writing UTF-16 word 0x0102.

Agreed, but now we're talking about something different from my original
case. (Disclaimer: I haven't actually tried this case myself.)

> That's not what it did. It interpreted 0xC4 0x82 using CP-1252, which is
> neither the input nor the execution character set. That meant it converted to
> an internal representation of U+00C4 U+201A. That's mojibake.

Yup :-).

--
Matthew

Thiago Macieira

unread,

Mar 15, 2019, 4:05:51 PM3/15/19

to Matthew Woehlke, std-dis...@isocpp.org

On Friday, 15 March 2019 12:18:15 PDT Matthew Woehlke wrote:
> > The problem can be summarised with this:
> > auto x = u"" "\u0102";
> > static_assert(x[0] = 0x0102, "oh noes");
> >
> > Which I think almost everyone will agree is what we expect the compiler to
> > do. And it's what GCC does, but not Visual Studio. I'm going to argue
> > that Visual Studio's behaviour is buggy, regardless of what the standard
> > says.
> ...even if the execution character set is *not* UTF-8? If you're arguing
> that, I'm going to have to disagree, at least in as much as I feel the
> standard is quite clear that that case won't work.

No, but note the part where I said "which is the only sane way to run VS where
Unicode is concerned". I quite frankly don't care anymore about anyone running
MSVC without that option. With that option, the behaviour is *clearly* buggy.

Without the option, you can argue that the compiler is doing the right thing
as per the standard text. And that the standard should be fixed so that the
compiler is required to do what we expect it to do. I agree with both. But I
don't care to spend time on this, since using /utf-8 should be mandatory.

Richard Smith

unread,

Mar 15, 2019, 10:26:08 PM3/15/19

to std-dis...@isocpp.org, Thiago Macieira

Swapping phase 5 and 6 is certainly wrong. See the example in [lex.string]p12:

"[Example:
"\xA" "B"
contains the two characters ’\xA’ and ’B’ after concatenation (and not
the single hexadecimal character
’\xAB’). — end example]

If you concatenate and then interpret escape sequences, you
misinterpret escape sequences.

Instead, I think we should remove phases 5 and 6 entirely, parse one
or more string-literal tokens as a string literal expression, and only
perform the translation from the contents of the string literal tokens
into characters in the execution character set as part of specifying
the semantics of a string literal expression.

> > When run with the /utf-8 option (which is the only sane way to run VS where
> > Unicode is concerned), the compiler interprets "\u0102" from narrow, using the
> > input character set, and converts to the execution character set, as required
> > by Phase 5. Given the /utf-8 option, both the input and the execution
> > character sets are UTF-8. So in this step, "\u0102" becomes execution bytes
> > 0xC4 0x82.
> >
> > Then it sees u"" and needs to concatenate, as per Phase 6. Since Table 9 is
> > very clear saying that u"a" "b" is the same as u"ab" after concatenation, it
> > needs to produce UTF-16. So the only reasonable thing for it to do would be to
> > take the 0xC4 0x82 bytes and interpret according to the execution character
> > set, yielding U+0102 and then writing UTF-16 word 0x0102.
>
> Agreed, but now we're talking about something different from my original
> case. (Disclaimer: I haven't actually tried this case myself.)
>
> > That's not what it did. It interpreted 0xC4 0x82 using CP-1252, which is
> > neither the input nor the execution character set. That meant it converted to
> > an internal representation of U+00C4 U+201A. That's mojibake.
>
> Yup :-).
>
> --
> Matthew
>

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to std-discussio...@isocpp.org.
> To post to this group, send email to std-dis...@isocpp.org.
> Visit this group at https://groups.google.com/a/isocpp.org/group/std-discussion/.

Matthew Woehlke

unread,

Mar 19, 2019, 11:10:20 AM3/19/19

to std-dis...@isocpp.org, Richard Smith, Thiago Macieira

On 15/03/2019 22.25, Richard Smith wrote:
> Swapping phase 5 and 6 is certainly wrong. See the example in [lex.string]p12:
>
> "[Example:
> "\xA" "B"
> contains the two characters ’\xA’ and ’B’ after concatenation (and not
> the single hexadecimal character
> ’\xAB’). — end example]
>
> If you concatenate and then interpret escape sequences, you
> misinterpret escape sequences.

You're right, of course. Considering that, the status quo makes sense
for a world that has only "regular" string literals. I wonder if anyone
thought about these questions when we started adding u"", U"", etc.
literals?

> Instead, I think we should remove phases 5 and 6 entirely, parse one
> or more string-literal tokens as a string literal expression, and only
> perform the translation from the contents of the string literal tokens
> into characters in the execution character set as part of specifying
> the semantics of a string literal expression.

I *think* that makes sense... not sure I'm following 100%.

Here's my thinking; is this the same effect?

- *New* phase 5: identify adjacent string literals (i.e. literals that
will be concatenated) and replace their encoding specifiers with the
"merged" specifier as determined by current phase 6.

- "New" phase 6, 8, ...: same as current phases 5, 7, ... respectively
(i.e. existing phases 5, 7+ all get bumped one number).

- New(ish) phase 7: simplified version of current phase 6, because the
rules for how to combine literals with different prefixes are no longer
needed in this phase.

So, looking at the original example, we would have:

Input:
u"" "\u2605"

Phase 5: literal encoding promotion
u"" u"\u2605"

Phase 6: literal encoding
char16_t[]{0} char16_t[]{0x2605, 0}

Phase 7: concatenation
char16_t[]{0x2605, 0}

(NUL's are illustrative only; it's not obvious at what stage they are
required to be added, or even if such a strict requirement exists.)

--
Matthew

Richard Smith

unread,

Mar 19, 2019, 2:46:44 PM3/19/19

to std-dis...@isocpp.org, Thiago Macieira

This makes sense to me.

> (NUL's are illustrative only; it's not obvious at what stage they are
> required to be added, or even if such a strict requirement exists.)

See [lex.string]p7: "After any necessary concatenation, in translation
phase 7 (5.2), ’\0’ is appended to every string literal so that
programs that scan a string can find its end."

Matthew Woehlke

unread,

Mar 19, 2019, 3:16:59 PM3/19/19

to std-dis...@isocpp.org, Richard Smith, Thiago Macieira

On 19/03/2019 14.46, Richard Smith wrote:

> On Tue, 19 Mar 2019 at 08:10, Matthew Woehlke wrote:
>> (NUL's are illustrative only; it's not obvious at what stage they are
>> required to be added, or even if such a strict requirement exists.)
>
> See [lex.string]p7: "After any necessary concatenation, in translation
> phase 7 (5.2), ’\0’ is appended to every string literal so that
> programs that scan a string can find its end."

Ah, of course... stuff that happens in phase 7 that's *not described in
lex.phases*.

Well, at least Giuseppe's comment (same subject, qt-devel list) makes
sense now.

Ah, well, at least we seem to have a plausible idea for reducing user
surprise. Anyone want to (co)write a paper? (I don't promise at this
time either to write, or to not write, such a paper...)

--
Matthew

sdo...@gmail.com

unread,

Mar 20, 2019, 11:17:34 AM3/20/19

to ISO C++ Standard - Discussion, ric...@metafoo.co.uk, thi...@macieira.org

The important part, at least for not getting broken results, is doing the encoding from internal representation to the required encoding only after deciding what kind of string literal the end result would be. This is another avenue for getting arbitrary bytes via hex escape sequences in unicode literals, resulting in potentially ill-formed encoding, but in principle no worse than writing it directly.

Reply all

Reply to author

Forward