Preprocessor arithmetic

Neil Booth

unread,

Apr 20, 2008, 2:19:12 AM4/20/08

to

Regarding pp-arithmetic, 6.10.1p4 says:

For the purposes of this token conversion and evaluation, all signed

integer types and all unsigned integer types act as if they have the
same representation as, respectively, the types intmax_t and
uintmax_t defined in the header <stdint.h>.145) This includes
interpreting character constants, ....

I wonder if this blanket change in type semantics was really intended.
There are many references to types in the section about character
constants. For example, the above text would appear to require a
conforming implementation to accept

#if '\x7fffffff'
#endif

because constraint 6.4.4.4p9 is not be violated and there is no other
reason to reject the lines. However GCC, Comeau, and others all reject
this with a diagnostic complaining the hexadecimal escape sequence is
out of range.

Are they non-conforming, have I misunderstood the standard, or is
this a defect in the standard?

Neil.

WANG Cong

unread,

Apr 20, 2008, 10:02:09 AM4/20/08

to

Neil Booth wrote:

But the standard also states:

This includes interpreting character constants, which may involve converting
escape sequences into execution character set members.

According to my own understanding, the reason why gcc complains about this
is that '\x7fffffff' is beyond the range of the execution character set.

Thanks.

--
Hi, I'm a .signature virus, please copy/paste me to help me spread
all over the world.

Neil Booth

unread,

Apr 20, 2008, 11:27:49 AM4/20/08

to

WANG Cong wrote:

> But the standard also states:
>
> This includes interpreting character constants, which may involve converting
> escape sequences into execution character set members.
>
> According to my own understanding, the reason why gcc complains about this
> is that '\x7fffffff' is beyond the range of the execution character set.

Being the co-author of that part of GCC your understanding is mistaken.
It's complaining about the escape sequence being outside the range
of the target's "unsigned char". However the standard's wording
appears to require that to be at least 64 bits wide, because "unsigned
char" must have the representation of uintmax_t for this pptoken to
token conversion.

Neil.

cr88192

unread,

Apr 21, 2008, 8:16:31 PM4/21/08

to

"Neil Booth" <dev@null> wrote in message
news:480b60f1$0$285$44c9...@news2.asahi-net.or.jp...

actually, I would think that the above token is incorrect:
'\xHH', aka, it only takes 2 hex digits;
'\uHHHH', aka, 4 hex digits;
'\UHHHHHHHH', taking 8 hex digits.

if '\x' can take more than 2 hex chars, it is a mystery then how it is
unambiguously parsed in strings?...

for example, "\x27abadexample", ...

as further detail:
the way my compiler deals with character escapes, is that currently strings
are internally assumed to always be UTF-8 ('long' strings are, likewise,
internally UTF-8 until the final output is generated).

I will assume then, that GCC does something different (such as leaving
escapes as escapes until some later stage of the compilation process?...).

or such...

> Neil.

Harald van Dĳk

unread,

Apr 22, 2008, 12:52:26 AM4/22/08

to

On Tue, 22 Apr 2008 10:16:31 +1000, cr88192 wrote:
> actually, I would think that the above token is incorrect: '\xHH', aka,
> it only takes 2 hex digits;
>

> if '\x' can take more than 2 hex chars, it is a mystery then how it is
> unambiguously parsed in strings?...
>
> for example, "\x27abadexample", ...

This is a string consisting of 8 characters, { '\x27abade', 'x', 'a',
'm', 'p', 'l', 'e', '\0' }. On implementations where UCHAR_MAX <
0x27abade, this violates a constraint, but does not introduce any parsing
ambiguity any more than a+++++b does.

Richard Tobin

unread,

Apr 22, 2008, 10:05:37 AM4/22/08

to

In article <b446e$480d2aed$ca83b482$32...@saipan.com>,
cr88192 <cr8...@NOSPAM.hotmail.com> wrote:

>if '\x' can take more than 2 hex chars, it is a mystery then how it is
>unambiguously parsed in strings?...
>
>for example, "\x27abadexample", ...

C99 says (6.4.4.4):

Each octal or hexadecimal escape sequence is the longest sequence of
characters that can constitute the escape sequence.

I suppose you could argue that "can constitute" is not completely
unambiguous, but I think the intent is clear.

-- Richard
--
:wq

lawrenc...@siemens.com

unread,

Apr 22, 2008, 10:01:43 AM4/22/08

to

Right. If you need to stop a hex escape from consuming more than you
want, use multiple adjacent string literals: "\x27" "abadexample".

-Larry Jones

What's the matter? Don't you trust your own kid?! -- Calvin

Keith Thompson

unread,

Apr 22, 2008, 11:38:38 AM4/22/08

to

Right, but the syntax limits octal escape sequences to 3 digits, while
hexadecimal escape sequences can be arbitrarily long:

octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit

hexadecimal-escape-sequence:
\x hexadecimal-digit
hexadecimal-escape-sequence hexadecimal-digit

So this:
"\00000"
is a null character followed by two digits '0', whereas this:
"\x00000"
is just a null character (followed, in both cases, by another
null character to terminate the string).

--
Keith Thompson (The_Other_Keith) <ks...@mib.org>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Douglas A. Gwyn

unread,

Apr 22, 2008, 4:03:02 PM4/22/08

to

Neil Booth wrote:
> #if '\x7fffffff' ...

> However the standard's wording
> appears to require that to be at least 64 bits wide, because "unsigned
> char" must have the representation of uintmax_t for this pptoken to
> token conversion.

I don't follow the reasoning. Nor why you didn't apply it to
'xFFFFFFFFFFFFFFFFFFFF'
Anyway, I think you understand what the intent really was.

cr88192

unread,

Apr 22, 2008, 5:10:50 PM4/22/08

to

"Keith Thompson" <ks...@mib.org> wrote in message
news:87k5ipx...@kvetch.smov.org...

yes, ok.

I had missed these, and as such may need to fix my parser.
I guess I had assumed Java syntax or something, where \x only takes 2 chars
(\u and \U being used for more). since, elsewhere, \u and \U had been
specified, and took the expected number of chars, I had mistakingly assumed
that \x was similar, not having read the C standard "in detail" (my compiler
was written mostly from skimming over the standard and from my personal
experience, which it seems contains many minor errors...).

Neil Booth

unread,

Apr 22, 2008, 7:05:38 PM4/22/08

to

Your example is a multicharacter constant. Mine is a single-character
constant. That's a big difference. Also mine has a hex escape
sequence, for which there is a constraint that it be in the range
of unsigned char. There is no such clear constraint on multicharacter
constants that don't get into imlpementation-defined territory.

The preprocessor arithmetic language states that, when converting
pptokens to tokens, all types have the range of [u]intmax_t. Hence
my example would not be violating the hex escape constraint, as it is a
31-bit number and uintmax_t is at lesat 64 bits.

However, many (all?) compilers have chosen to ignore the standard's
wording here and still treat the constraint as being on the unmodified
type, which presumably is the intended behaviour.

The wording in the standard is particularly poor here, as it even
singles out character constants saying "yes, this widening rule
really does apply to them", viz:

token conversion and evaluation, all signed integer types and all
unsigned integer types act as if they have the same representation as,
respectively, the types intmax_t and uintmax_t defined in the header

<stdint.h>.145) This includes interpreting character constants...

The point of my post was that, "no, I suspect it only applies to some
parts of character constant conversion, but which parts precisely?".
You will note there are several references to types in 6.4.4.4; only
some of those are probably intended to be "widened".

One cannot point out that the standard is carefully and precisely
worded, and yes it really does mean what it says, when dealing with
issues related to its meaning and wording (a point I see a lot), but
then on the other in cases like this say that whilst it doesn't mean
precisely what it says the intent was nevertheless clear.

Neil.

Charlie Gordon

unread,

Apr 23, 2008, 9:24:59 AM4/23/08

to

"Neil Booth" <dev@null> a écrit dans le message de news:
480e6f42$0$281$44c9...@news2.asahi-net.or.jp...

> Douglas A. Gwyn wrote:
>> Neil Booth wrote:
>>> #if '\x7fffffff' ...
>>> However the standard's wording
>>> appears to require that to be at least 64 bits wide, because "unsigned
>>> char" must have the representation of uintmax_t for this pptoken to
>>> token conversion.
>>
>> I don't follow the reasoning. Nor why you didn't apply it to
>> 'xFFFFFFFFFFFFFFFFFFFF'
>> Anyway, I think you understand what the intent really was.
>
> Your example is a multicharacter constant. Mine is a single-character
> constant. That's a big difference. Also mine has a hex escape
> sequence, for which there is a constraint that it be in the range
> of unsigned char. There is no such clear constraint on multicharacter
> constants that don't get into imlpementation-defined territory.
>
> The preprocessor arithmetic language states that, when converting
> pptokens to tokens, all types have the range of [u]intmax_t. Hence
> my example would not be violating the hex escape constraint, as it is a
> 31-bit number and uintmax_t is at lesat 64 bits.
>
> However, many (all?) compilers have chosen to ignore the standard's
> wording here and still treat the constraint as being on the unmodified
> type, which presumably is the intended behaviour.

A more telling test is this:

#if !'\x100000000'
#error "preprocessor is not C99 compliant"
#endif

gcc versions 3.4.4 through 4.2.3 gives this output:
ppchar.c:1:6: warning: hex escape sequence out of range
ppchar.c:2:2: #error "preprocessor is not C99 compliant"

I'd be surprised to see a compiler accept it ;-)

--
Chqrlie.

Douglas A. Gwyn

unread,

Jul 5, 2008, 8:30:49 PM7/5/08

to

"Neil Booth" <dev@null> wrote in message

news:480e6f42$0$281$44c9...@news2.asahi-net.or.jp...
> Your example is a multicharacter constant. ...

Obviously the \ character got lost somewhere during editing.