Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[JW] skipped group which contains #nonstandard

20 views
Skip to first unread message

Jun Woong

unread,
Mar 1, 2001, 9:07:16 AM3/1/01
to
Hi...

C90 6.8.1:
"... skipped: directives are processed only through the name that
determines the directive in order to keep track of the level of
nested conditionals; ..."

#define MU_SO_YOU
#ifdef MU_SO_YOU
... /* this group behaves in s.c. manner */
#else
#nonstandrad /* is this group successfully skipped? */
#endif


Thanks in advance...


--
Jun Woong (myco...@hanmail.net)
Dept. of Physics, Univ. of Seoul

Christian Bau

unread,
Mar 1, 2001, 11:08:31 AM3/1/01
to
In article <UBsn6.130511$My5.5...@news.hananet.net>, "Jun Woong"
<myco...@hanmail.net> wrote:

> Hi...
>
> C90 6.8.1:
> "... skipped: directives are processed only through the name that
> determines the directive in order to keep track of the level of
> nested conditionals; ..."
>
> #define MU_SO_YOU
> #ifdef MU_SO_YOU
> ... /* this group behaves in s.c. manner */
> #else
> #nonstandrad /* is this group successfully skipped? */
> #endif

I'll just add another question to this. In the C99 Standard (not in the
first draft, and not in the final draft) a new line popped up in the
description of "preprocessor groups". Up to C99 Final Draft a "group part"
is one of the following:

1. pp-tokens followed by newline
2. if-section (#if/#elif/#endif etc. )
3. control-line (#include, #line, #pragma etc. )

In the C99 Standard, there is an additional choice

4. # <non-directive>

The line
#nonstandard
obviously matches # <non-directive> in the C99 Standard. Apart from this
entry in the syntax in Chapter 6.10, <non-directive> is not mentioned
anywhere. So what does that mean? Is it undefined behaviour since its
behaviour is not defined, or is it a syntax error that should be
diagnosed, or is it passed on to the next translation phase unchanged like
"pp-tokens followed by newline"? If that is the case, is the following
fragment ok:

#define nothing(x) /* nothing */
nothing (
#nonstandard
)

Clive D. W. Feather

unread,
May 1, 2001, 3:47:24 AM5/1/01
to
In article <UBsn6.130511$My5.5...@news.hananet.net>, Jun Woong
<myco...@hanmail.net> writes

>C90 6.8.1:
>"... skipped: directives are processed only through the name that
> determines the directive in order to keep track of the level of
> nested conditionals; ..."
>
>#define MU_SO_YOU
>#ifdef MU_SO_YOU
> ... /* this group behaves in s.c. manner */
>#else
> #nonstandrad /* is this group successfully skipped? */
>#endif

The syntax was altered in C99 to make it clear. #nonstandard is a
"non-directive", so it doesn't take part in preprocessing (just like
text lines don't); it just gets skipped.

We also tidied up:

[#4] When in a group that is skipped (6.10.1), the directive
syntax is relaxed to allow any sequence of preprocessing
tokens to occur between the directive name and the following
new-line character.

--
Clive D.W. Feather, writing for himself | Home: <cl...@davros.org>
Tel: +44 20 8371 1138 (work) | Web: <http://www.davros.org>
Fax: +44 20 8371 4037 (D-fax) | Work: <cl...@demon.net>
Written on my laptop; please observe the Reply-To address

Clive D. W. Feather

unread,
May 1, 2001, 4:57:24 AM5/1/01
to
In article
<christian.bau-0...@christian-mac.isltd.insignia.com>,
Christian Bau <christ...@isltd.insignia.com> writes

>In the C99 Standard, there is an additional choice
>
> 4. # <non-directive>
>
>The line
> #nonstandard
>obviously matches # <non-directive> in the C99 Standard. Apart from this
>entry in the syntax in Chapter 6.10, <non-directive> is not mentioned
>anywhere. So what does that mean? Is it undefined behaviour since its
>behaviour is not defined, or is it a syntax error that should be
>diagnosed, or is it passed on to the next translation phase unchanged like
>"pp-tokens followed by newline"?

It is passed on to the next translation phase unchanged (assuming it
survives TP4). At that point it will almost certainly produce a syntax
error.

> If that is the case, is the following
>fragment ok:
>
> #define nothing(x) /* nothing */
> nothing (
> #nonstandard
> )

It depends. From 6.10.3#10:

Within the sequence of preprocessing tokens making up an
invocation of a function-like macro, new-line is considered
a normal white-space character.

But from #11:

If there are sequences of preprocessing tokens within the list
of arguments that would otherwise act as preprocessing
directives, the behavior is undefined.

So:

nothing (
#include <stdio.h>
)

is forbidden. But I would argue that a non-directive is allowed in this
situation. To argue otherwise would be to say that a non-directive would
"act as a preprocessing directive". On the third hand, I don't think
this was intended when we were cleaning up this bit.

Nick Maclaren

unread,
May 1, 2001, 5:28:17 AM5/1/01
to

In article <fb7F1lVM...@romana.davros.org>,

"Clive D. W. Feather" <cl...@on-the-train.demon.co.uk> writes:
|> In article <UBsn6.130511$My5.5...@news.hananet.net>, Jun Woong
|> <myco...@hanmail.net> writes
|> >C90 6.8.1:
|> >"... skipped: directives are processed only through the name that
|> > determines the directive in order to keep track of the level of
|> > nested conditionals; ..."
|> >
|> >#define MU_SO_YOU
|> >#ifdef MU_SO_YOU
|> > ... /* this group behaves in s.c. manner */
|> >#else
|> > #nonstandrad /* is this group successfully skipped? */
|> >#endif
|>
|> The syntax was altered in C99 to make it clear. #nonstandard is a
|> "non-directive", so it doesn't take part in preprocessing (just like
|> text lines don't); it just gets skipped.

Yes. It was sufficiently unclear in C90 that several vendors
misunderstood the requirement.

|> We also tidied up:
|>
|> [#4] When in a group that is skipped (6.10.1), the directive
|> syntax is relaxed to allow any sequence of preprocessing
|> tokens to occur between the directive name and the following
|> new-line character.

This is better, but still not entirely clear.

#if 0
"
#endif

Now, a text-line is 'pp-tokens-opt new-line', but is the '"' a
valid pp-token? A pp-token is one of various things, including a
string-literal and:

each non-white-space character that cannot be one of the above

Now, if you interpret this as meaning only characters that by their
very nature cannot be any part of one of the other types of token,
this is a syntax error in a string literal, and therefore must be
diagnosed.

If you interpret it as meaning that it includes all characters
that are not part of a valid occurrence of one of the other types
of token, then it it is permitted.

However, UNLIKE with the C90 confusion, I think that this is
pretty harmless. I.e. I cannot think of a reasonable construction
that would be rejected by a vendor misunderstanding the standard.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679

Clive D. W. Feather

unread,
May 1, 2001, 3:18:02 PM5/1/01
to
In article <9clvjh$6jc$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren
<nm...@cus.cam.ac.uk> writes

> each non-white-space character that cannot be one of the above
>
>Now, if you interpret this as meaning only characters that by their
>very nature cannot be any part of one of the other types of token,
>this is a syntax error in a string literal, and therefore must be
>diagnosed.
>
>If you interpret it as meaning that it includes all characters
>that are not part of a valid occurrence of one of the other types
>of token, then it it is permitted.

I'd never imagined it could be interpreted other than as the second.
That is, a lone ' or " is a pp-token. I'm sure I've seen support for
that viewpoint somewhere (though I forget where). In particular:
- Why is " a syntax error in a string literal, while <stdio.h is not a
syntax error in a header-name ?
- I don't think that the split into pp-tokens (in TP3) is allowed to
fail. That forces the second interpretation.

Nick Maclaren

unread,
May 1, 2001, 4:41:39 PM5/1/01
to
In article <IrCApnMq...@romana.davros.org>,

Clive D. W. Feather <cl...@davros.org> wrote:
>In article <9clvjh$6jc$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren
><nm...@cus.cam.ac.uk> writes
>> each non-white-space character that cannot be one of the above
>>
>>Now, if you interpret this as meaning only characters that by their
>>very nature cannot be any part of one of the other types of token,
>>this is a syntax error in a string literal, and therefore must be
>>diagnosed.
>>
>>If you interpret it as meaning that it includes all characters
>>that are not part of a valid occurrence of one of the other types
>>of token, then it it is permitted.
>
>I'd never imagined it could be interpreted other than as the second.
>That is, a lone ' or " is a pp-token. I'm sure I've seen support for
>that viewpoint somewhere (though I forget where). In particular:
>- Why is " a syntax error in a string literal, while <stdio.h is not a
> syntax error in a header-name ?
>- I don't think that the split into pp-tokens (in TP3) is allowed to
> fail. That forces the second interpretation.

All that may be true, but I don't think that the standard says so :-(

I have certainly seen the first interpretation, but that was in the
context of the C90 standard. No, I can't tell you if <stdio.h would
also have been rejected!

I don't frankly think that it matters, UNLIKE the ambiguity in the
C90 standard. People just don't include nonsense like isolated
double quotes and <stdio.h, though there is no particular reason
why a compiler couldn't allow them as an extended syntax.

Zack Weinberg

unread,
May 1, 2001, 7:22:52 PM5/1/01
to
Nick Maclaren <nm...@cus.cam.ac.uk> writes:
>In article <IrCApnMq...@romana.davros.org>,
>Clive D. W. Feather <cl...@davros.org> wrote:
>>In article <9clvjh$6jc$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren
>><nm...@cus.cam.ac.uk> writes
>>> each non-white-space character that cannot be one of the above
>>>
>>>Now, if you interpret this as meaning only characters that by their
>>>very nature cannot be any part of one of the other types of token,
>>>this is a syntax error in a string literal, and therefore must be
>>>diagnosed.
>>>
>>>If you interpret it as meaning that it includes all characters
>>>that are not part of a valid occurrence of one of the other types
>>>of token, then it it is permitted.
>>
>>I'd never imagined it could be interpreted other than as the second.
>>That is, a lone ' or " is a pp-token. I'm sure I've seen support for
>>that viewpoint somewhere (though I forget where). In particular:
>>- Why is " a syntax error in a string literal, while <stdio.h is not a
>> syntax error in a header-name ?
>>- I don't think that the split into pp-tokens (in TP3) is allowed to
>> fail. That forces the second interpretation.
>
>All that may be true, but I don't think that the standard says so :-(
>
>I have certainly seen the first interpretation, but that was in the
>context of the C90 standard. No, I can't tell you if <stdio.h would
>also have been rejected!

Data point: GCC has always taken the first interpretation. A stray
' or " will cause an error (modulo the stupid multi-line string constant
extension, which is going to be removed.) #include <stdio.h will also
give an error. It had never occurred to me that it could be the other
way.

>I don't frankly think that it matters, UNLIKE the ambiguity in the
>C90 standard. People just don't include nonsense like isolated
>double quotes and <stdio.h, though there is no particular reason
>why a compiler couldn't allow them as an extended syntax.

Actually, we do get the occasional complaint from people who have
written English text inside an #if 0 as a creative sort of comment.
Inevitably they write an apostrophe and are very surprised when the
compiler thinks it's an incomplete character constant.

No one's ever argued with "don't do that, use /* */", though.

Another place this comes up is in the argument of #error:

#if 0
#error This program doesn't work...
#endif

-> unclosed character constant, you lose. The manual therefore suggests
that you put your #error messages inside a string literal; again, no
one's argued with that.

zw

Thomas Pornin

unread,
May 2, 2001, 6:23:01 AM5/2/01
to
According to Zack Weinberg <za...@stanford.edu>:

> Data point: GCC has always taken the first interpretation. A stray
> ' or " will cause an error (modulo the stupid multi-line string constant
> extension, which is going to be removed.) #include <stdio.h will also
> give an error. It had never occurred to me that it could be the other
> way.

Do not Fear, For the Standard is One and Great.

6.4/3 says:
<< The categories of preprocessing tokens are: header names, identifiers,
preprocessing numbers, character constants, string literals, punctuators,
and single non-white-space characters that do not lexically match the
other preprocessing token categories. If a ' or a " matches the last
category, the behaviour is undefined. >>

So I believe that a lone ' or ", even in code compiled out through a
#if, triggers the dreadful "undefined behaviour".


As for the:

#include <stdio.h

It does not match the form:

#include <h-char-sequence> new-line

because the h-char-sequence may not contain any new-line character.
So it is an instance of the:

#include pp-tokens new-line

form. So the tricky point is:

#define h h>
#include <stdio.h

which should imply macro-replacement, and the resulting combination of
the following pp-tokens: < stdio . h >

Yet the way those tokens are merged into one valid header name is
implementation defined (6.10.2/4) so I guess that a compiler may refuse
this construction, as long as it documents it. Quite incidently, my own
preprocessor rejects it (where the cpp from gcc-2.95.2 accepts it as
a <stdio.h>) but it should accept it in about half an hour, for this
construction has a tremendous hacking value.


--Thomas Pornin

Jun Woong

unread,
May 3, 2001, 9:56:00 AM5/3/01
to
In article <9cnggc$si8$1...@nntp.Stanford.EDU>, Zack Weinberg says...
[...]

>
>Data point: GCC has always taken the first interpretation. A stray
>' or " will cause an error (modulo the stupid multi-line string constant
>extension, which is going to be removed.)

As Thomas Pornin said, it can cause undefined behavior.

>#include <stdio.h will also
>give an error. It had never occurred to me that it could be the other
>way.
>

Try this with gcc.

#if 0
#include <stdio.h
#endif

Jun Woong

unread,
May 3, 2001, 10:10:46 AM5/3/01
to
In article <saFZVuW0...@romana.davros.org>, Clive D. W. Feather says...

>
>In article
><christian.bau-0...@christian-mac.isltd.insignia.com>,
>Christian Bau <christ...@isltd.insignia.com> writes
>>In the C99 Standard, there is an additional choice
>>
>> 4. # <non-directive>
>>
>>The line
>> #nonstandard
>>obviously matches # <non-directive> in the C99 Standard. Apart from this
>>entry in the syntax in Chapter 6.10, <non-directive> is not mentioned
>>anywhere. So what does that mean? Is it undefined behaviour since its
>>behaviour is not defined, or is it a syntax error that should be
>>diagnosed, or is it passed on to the next translation phase unchanged like
>>"pp-tokens followed by newline"?
>
>It is passed on to the next translation phase unchanged (assuming it
>survives TP4). At that point it will almost certainly produce a syntax
>error.

Can a C99-conforming implementation define an additional preprocessing
directive "#nonstandard" as extension?

>
>> If that is the case, is the following
>>fragment ok:
>>
>> #define nothing(x) /* nothing */
>> nothing (
>> #nonstandard
>> )
>

[...]

> But I would argue that a non-directive is allowed in this
>situation. To argue otherwise would be to say that a non-directive would
>"act as a preprocessing directive". On the third hand, I don't think
>this was intended when we were cleaning up this bit.

If the answer to my question above is "yes", in the implementation
which defines the "#nonstandard" pp-directive, does #nonstandard be
scanned as # <pp-directive> rather than as # <non-directive>?

If so, I think that the above construction (nothing (...)) can't be
guaranteed to be allowed in all implementations, because any
conforming implementation can define it as pp-directive.

Nick Maclaren

unread,
May 3, 2001, 10:24:42 AM5/3/01
to

In article <9con65$14c0$1...@nef.ens.fr>,

por...@bolet.ens.fr (Thomas Pornin) writes:
|>
|> Do not Fear, For the Standard is One and Great.
|>
|> 6.4/3 says:
|> << The categories of preprocessing tokens are: header names, identifiers,
|> preprocessing numbers, character constants, string literals, punctuators,
|> and single non-white-space characters that do not lexically match the
|> other preprocessing token categories. If a ' or a " matches the last
|> category, the behaviour is undefined. >>
|>
|> So I believe that a lone ' or ", even in code compiled out through a
|> #if, triggers the dreadful "undefined behaviour".

I had forgotten that. Yes, that does. That is most interesting,
because it eliminates all of the characters that are affected by
the ambiguity that I pointed out. The ambiguity remains, but only
as a trap for some later language extension ....

Zack Weinberg

unread,
May 3, 2001, 12:47:24 PM5/3/01
to
Nick Maclaren <nm...@cus.cam.ac.uk> writes:
>
>In article <9con65$14c0$1...@nef.ens.fr>,
>por...@bolet.ens.fr (Thomas Pornin) writes:
>|>
>|> Do not Fear, For the Standard is One and Great.
>|>
>|> 6.4/3 says:
>|> << The categories of preprocessing tokens are: header names, identifiers,
>|> preprocessing numbers, character constants, string literals, punctuators,
>|> and single non-white-space characters that do not lexically match the
>|> other preprocessing token categories. If a ' or a " matches the last
>|> category, the behaviour is undefined. >>
>|>
>|> So I believe that a lone ' or ", even in code compiled out through a
>|> #if, triggers the dreadful "undefined behaviour".
>
>I had forgotten that. Yes, that does. That is most interesting,
>because it eliminates all of the characters that are affected by
>the ambiguity that I pointed out. The ambiguity remains, but only
>as a trap for some later language extension ....

This strikes me as an excellent example of there being too much of the
syntax buried in the semantics.

zw

Zack Weinberg

unread,
May 3, 2001, 12:48:10 PM5/3/01
to
Jun Woong <myco...@hanmail.net> writes:
>In article <9cnggc$si8$1...@nntp.Stanford.EDU>, Zack Weinberg says...
>[...]
>>
>>Data point: GCC has always taken the first interpretation. A stray
>>' or " will cause an error (modulo the stupid multi-line string constant
>>extension, which is going to be removed.)
>
>As Thomas Pornin said, it can cause undefined behavior.
>
>>#include <stdio.h will also
>>give an error. It had never occurred to me that it could be the other
>>way.
>>
>
>Try this with gcc.
>
>#if 0
>#include <stdio.h
>#endif

test.c:2:10: missing terminating > character

(with 3.0 prerelease. Older gcc seems to have silently ignored it.)

zw

Jun Woong

unread,
May 3, 2001, 12:56:02 PM5/3/01
to
In article <9cs24a$cl9$1...@nntp.Stanford.EDU>, Zack Weinberg says...

I think gcc (3.0 pre) broken in this regard.
If my interpretation of C90 and C99 is correct, the construction above
must be skipped silently in both standards.

Nick Maclaren

unread,
May 3, 2001, 1:03:17 PM5/3/01
to

In article <C4gI6.8564$SZ5.7...@www.newsranger.com>,

Jun Woong<myco...@hanmail.net> writes:
|> In article <9cs24a$cl9$1...@nntp.Stanford.EDU>, Zack Weinberg says...
|> >>
|> >>Try this with gcc.
|> >>
|> >>#if 0
|> >>#include <stdio.h
|> >>#endif
|> >
|> >test.c:2:10: missing terminating > character
|> >
|> >(with 3.0 prerelease. Older gcc seems to have silently ignored it.)
|>
|> I think gcc (3.0 pre) broken in this regard.
|> If my interpretation of C90 and C99 is correct, the construction above
|> must be skipped silently in both standards.

No. A compiler is allowed to warn on legal constructions.

Jun Woong

unread,
May 3, 2001, 1:11:04 PM5/3/01
to
In article <9cs30l$fb2$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren says...

>
>
>In article <C4gI6.8564$SZ5.7...@www.newsranger.com>,
>Jun Woong<myco...@hanmail.net> writes:
>|> In article <9cs24a$cl9$1...@nntp.Stanford.EDU>, Zack Weinberg says...
>|> >>
>|> >>Try this with gcc.
>|> >>
>|> >>#if 0
>|> >>#include <stdio.h
>|> >>#endif
>|> >
>|> >test.c:2:10: missing terminating > character
>|> >
>|> >(with 3.0 prerelease. Older gcc seems to have silently ignored it.)
>|>
>|> I think gcc (3.0 pre) broken in this regard.
>|> If my interpretation of C90 and C99 is correct, the construction above
>|> must be skipped silently in both standards.
>
>No. A compiler is allowed to warn on legal constructions.
>

Yes, I should have removed "silently".

Zack Weinberg

unread,
May 3, 2001, 1:11:04 PM5/3/01
to
Jun Woong <myco...@hanmail.net> writes:
>In article <9cs24a$cl9$1...@nntp.Stanford.EDU>, Zack Weinberg says...
>>Jun Woong <myco...@hanmail.net> writes:
>>>In article <9cnggc$si8$1...@nntp.Stanford.EDU>, Zack Weinberg says...
>>>[...]
>>>>
>>>>Data point: GCC has always taken the first interpretation. A stray
>>>>' or " will cause an error (modulo the stupid multi-line string constant
>>>>extension, which is going to be removed.)
>>>
>>>As Thomas Pornin said, it can cause undefined behavior.
>>>
>>>>#include <stdio.h will also
>>>>give an error. It had never occurred to me that it could be the other
>>>>way.
>>>>
>>>
>>>Try this with gcc.
>>>
>>>#if 0
>>>#include <stdio.h
>>>#endif
>>
>>test.c:2:10: missing terminating > character
>>
>>(with 3.0 prerelease. Older gcc seems to have silently ignored it.)
>>
>
>I think gcc (3.0 pre) broken in this regard.
>If my interpretation of C90 and C99 is correct, the construction above
>must be skipped silently in both standards.

Your original argument has expired from my news server. Can you explain
why you think this? I do not see any reason why the syntax of #include
should be relaxed inside #if 0.

zw

Zack Weinberg

unread,
May 3, 2001, 1:11:30 PM5/3/01
to
Nick Maclaren <nm...@cus.cam.ac.uk> writes:
>
>In article <C4gI6.8564$SZ5.7...@www.newsranger.com>,
>Jun Woong<myco...@hanmail.net> writes:
>|> In article <9cs24a$cl9$1...@nntp.Stanford.EDU>, Zack Weinberg says...
>|> >>
>|> >>Try this with gcc.
>|> >>
>|> >>#if 0
>|> >>#include <stdio.h
>|> >>#endif
>|> >
>|> >test.c:2:10: missing terminating > character
>|> >
>|> >(with 3.0 prerelease. Older gcc seems to have silently ignored it.)
>|>
>|> I think gcc (3.0 pre) broken in this regard.
>|> If my interpretation of C90 and C99 is correct, the construction above
>|> must be skipped silently in both standards.
>
>No. A compiler is allowed to warn on legal constructions.

That's not a warning, the code is rejected.

zw

Nick Maclaren

unread,
May 3, 2001, 2:11:05 PM5/3/01
to

In article <9cs3g2$d85$1...@nntp.Stanford.EDU>,
Zack Weinberg <za...@stanford.edu> writes:

|> Nick Maclaren <nm...@cus.cam.ac.uk> writes:
|> >
|> >|> >>Try this with gcc.
|> >|> >>
|> >|> >>#if 0
|> >|> >>#include <stdio.h
|> >|> >>#endif
|> >|> >
|> >|> >test.c:2:10: missing terminating > character
|> >|> >
|> >|> >(with 3.0 prerelease. Older gcc seems to have silently ignored it.)
|> >|>
|> >|> I think gcc (3.0 pre) broken in this regard.
|> >|> If my interpretation of C90 and C99 is correct, the construction above
|> >|> must be skipped silently in both standards.
|> >
|> >No. A compiler is allowed to warn on legal constructions.
|>
|> That's not a warning, the code is rejected.

Oh, lovely! Yet another syntactic ambiguity of a class that
I have pointed out before :-(

The relevant definitions are:

group-part: if-section
control-line
text-line
# non-directive

control-line: # include pp-tokens new-line
. . .

text-line: pp-tokens-opt new-line

non-directive: pp-tokens new-line

Now, '#include <stdio.h' matches ALL of control-line, text-line
and '# non-directive'. Much later, the first match then fails,
because the text does not macro expand into a correct header name.
So far, so good. But what semantics does the language are we
using to specify the syntax have?

Operator precedence clearly isn't relevant here.

'Greedy matching' as in tokenisation isn't, either.

The 'all rules on the text, sequentially with no backtracking'
as in identifier recognition and parameter adjustment might
seem to support gcc, but it isn't entirely clear.

If we use the rule that if there is a single valid parsing,
then that is used and all invalid ones are ignored (as in the
widespread language recognition theory), then gcc is wrong.

gcc seems to use the first match in a section of specified
alternatives. And who am I to say that it is wrong? But I don't
think that that model is used anywhere else in the standard.

Zack Weinberg

unread,
May 3, 2001, 7:51:05 PM5/3/01
to

GCC's logic is somewhat different. I'm prepared to change it if it
turns out to be wrong - within limits. The lexer irrevocably commits
to tokenizing a header-name when it encounters a < or " as the first
nonwhitespace character after # include. If it reaches the end of the
line without finding the matching > or ", an error issues. It does
not attempt to go back and re-parse as "# include pp-tokens new-line,
which does not match one of the preceding two forms" (6.10.2p2).
(error is defined as a diagnostic which will lead to rejection of the
translation unit.)

This is the case even inside a failed conditional group.

I can see a definite case that inside a failed conditional group,
arbitrary pp-tokens should be acceptable after # include even if they
initially appear as and then fail to match header-name. That would be
a straightforward change (most simply, the error could downgrade
to a warning or be suppressed entirely while inside a failed conditional
group, since the directive won't execute anyway).

A more disturbing thought is that implementations might be required to
accept e.g.

#define h h>
#include <stdio.h

i.e. that one is required to re-tokenize most of the line if initial
matching as a header-name fails. This would require being able to back
up an arbitrary distance. Implementing this in GCC would take major
architectural changes.

I don't mean to argue for an interpretation solely on the basis of
implementor convenience; however, it is a factor which the standard
considers elsewhere. As a matter of *user* expectations, it seems to
me that being stringent about what can appear after #include is the
least surprising thing.

>So far, so good. But what semantics does the language are we
>using to specify the syntax have?
>
> Operator precedence clearly isn't relevant here.
>
> 'Greedy matching' as in tokenisation isn't, either.

Actually it may be. See 6.4.7 (header names). They aren't mentioned
anywhere in 6.10 or 6.10.2, which is probably a mistake...

zw

Nick Maclaren

unread,
May 4, 2001, 3:37:47 AM5/4/01
to
In article <9csqt9$omd$1...@nntp.Stanford.EDU>,

Zack Weinberg <za...@stanford.edu> wrote:
>
>GCC's logic is somewhat different. I'm prepared to change it if it
>turns out to be wrong - within limits. The lexer irrevocably commits
>to tokenizing a header-name when it encounters a < or " as the first
>nonwhitespace character after # include. If it reaches the end of the
>line without finding the matching > or ", an error issues. It does
>not attempt to go back and re-parse as "# include pp-tokens new-line,
>which does not match one of the preceding two forms" (6.10.2p2).
>(error is defined as a diagnostic which will lead to rejection of the
>translation unit.)

I suggest waiting until someone finds a good reason to change it!

The only point at issue here is precisely what the standard permits
and requires, and it is clear that the situation is so messy that
no programmer in his right mind is going to rely on any particular
behaviour. And there doesn't seem to be much of a requirement for
the construction.

>I don't mean to argue for an interpretation solely on the basis of
>implementor convenience; however, it is a factor which the standard
>considers elsewhere. As a matter of *user* expectations, it seems to
>me that being stringent about what can appear after #include is the
>least surprising thing.

I am certainly not dissenting.

>>So far, so good. But what semantics does the language are we
>>using to specify the syntax have?
>>
>> Operator precedence clearly isn't relevant here.
>>
>> 'Greedy matching' as in tokenisation isn't, either.
>
>Actually it may be. See 6.4.7 (header names). They aren't mentioned
>anywhere in 6.10 or 6.10.2, which is probably a mistake...

Oh, yes, ONCE you start considering a header name, THEN it is relevant.
But that is not the issue that I was thinking of. The point is that
group-part has four alternatives, of which the tokens '#' and 'include'
match three. Greedy matching doesn't help to distinguish BETWEEN
alternatives, but only how far to proceed once you have committed.

Jun Woong

unread,
May 4, 2001, 10:33:33 AM5/4/01
to
In article <9cs3f8$d75$1...@nntp.Stanford.EDU>, Zack Weinberg says...
>Jun Woong <myco...@hanmail.net> writes:
[...]

>>
>>I think gcc (3.0 pre) broken in this regard.
>>If my interpretation of C90 and C99 is correct, the construction above
>>must be skipped silently in both standards.
>
>Your original argument has expired from my news server. Can you explain
>why you think this? I do not see any reason why the syntax of #include
>should be relaxed inside #if 0.
>

I think, the given grammar for preprocessing is not suitable for
parsing per se, however we can interpret the behavior of pp-parser
unambiguously with the Standard's sections related to preprocessing
(at least in this case).

#if 0
#include <stdio.h
#endif

The fact that the group with probably invalid #include directive must
be skipped (with or without warning) is clear in both C90 and C99:

The line is split into valid pp-tokens in TP3: # include < stdio . h
thus, 6.10.2p4:
"A preprocessing directive of the form
# include pp-tokens new-line
(THAT DOES NOT MATCH ONE OF THE TWO PREVIOUS FORMS) is permitted. The
preprocessing tokens after include in the directive are processed
just as in normal text." [emphasis mine]

Then, the following wordings apply:

"When in a group that is skipped (6.10.1), the directive syntax is
relaxed to allow any sequence of preprocessing tokens to occur
between the directive name and the following new-line character."

"If it evaluates to false (zero), the group that it controls is


skipped: directives are processed only through the name that
determines the directive in order to keep track of the level of

nested conditionals; the rest of the directives' preprocessing tokens
are ignored, as are the other preprocessing tokens in the group."

The result in C90 does not differ as in C99, though it does not have
the second wording. Allowing pp-tokens subject to macro expresion in
#include directives might demand re-parsing them into a header name
from implementors.

Anyway, I believe all conforming implementation must skip the group
above successfully. Of course, a diagnostic message is another thing.

Nick Maclaren

unread,
May 4, 2001, 11:16:24 AM5/4/01
to

In article <15zI6.1095$vg1....@www.newsranger.com>,

Jun Woong<myco...@hanmail.net> writes:
|>
|> The fact that the group with probably invalid #include directive must
|> be skipped (with or without warning) is clear in both C90 and C99:

Er, no. It may be true, but it is not clear.

|> The line is split into valid pp-tokens in TP3: # include < stdio . h
|> thus, 6.10.2p4:
|> "A preprocessing directive of the form
|> # include pp-tokens new-line
|> (THAT DOES NOT MATCH ONE OF THE TWO PREVIOUS FORMS) is permitted. The
|> preprocessing tokens after include in the directive are processed
|> just as in normal text." [emphasis mine]

Why did you omit the rest of the paragraph:

(that does not match one of the two previous forms) is


permitted. The preprocessing tokens after include in the

directive are processed just as in normal text. (Each
identifier currently defined as a macro name is replaced by
its replacement list of preprocessing tokens.) THE
DIRECTIVE RESULTING AFTER ALL REPLACEMENTS SHALL MATCH ONE
OF THE TWO PREVIOUS FORMS. The method by which a
sequence of preprocessing tokens between a < and a >
preprocessing token pair or a pair of " characters is
combined into a single header name preprocessing token is
implementation-defined. [Emphasis mine]

Jun Woong

unread,
May 4, 2001, 11:55:57 AM5/4/01
to
In article <9cuh48$pf3$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren says...

>In article <15zI6.1095$vg1....@www.newsranger.com>,
>Jun Woong<myco...@hanmail.net> writes:
>|>
>|> The fact that the group with probably invalid #include directive must
>|> be skipped (with or without warning) is clear in both C90 and C99:
>
>Er, no. It may be true, but it is not clear.

It's clear at least in this case.
And could you tell me the case that you think that the Standard is not
clear about?

>
>|> The line is split into valid pp-tokens in TP3: # include < stdio . h
>|> thus, 6.10.2p4:
>|> "A preprocessing directive of the form
>|> # include pp-tokens new-line
>|> (THAT DOES NOT MATCH ONE OF THE TWO PREVIOUS FORMS) is permitted. The
>|> preprocessing tokens after include in the directive are processed
>|> just as in normal text." [emphasis mine]
>
>Why did you omit the rest of the paragraph:
>
> (that does not match one of the two previous forms) is
> permitted. The preprocessing tokens after include in the
> directive are processed just as in normal text. (Each
> identifier currently defined as a macro name is replaced by
> its replacement list of preprocessing tokens.) THE
> DIRECTIVE RESULTING AFTER ALL REPLACEMENTS SHALL MATCH ONE
> OF THE TWO PREVIOUS FORMS. The method by which a

When pp-directives including #include <stdio.h are in group skipped,
they are interpreted according to 6.10.1 (conditional inclusion), I
think.

[...]

Zack Weinberg

unread,
May 4, 2001, 12:14:34 PM5/4/01
to
Jun Woong <myco...@hanmail.net> writes:
>In article <9cs3f8$d75$1...@nntp.Stanford.EDU>, Zack Weinberg says...
>>Jun Woong <myco...@hanmail.net> writes:
>[...]
>>>
>>>I think gcc (3.0 pre) broken in this regard.
>>>If my interpretation of C90 and C99 is correct, the construction above
>>>must be skipped silently in both standards.
>>
>>Your original argument has expired from my news server. Can you explain
>>why you think this? I do not see any reason why the syntax of #include
>>should be relaxed inside #if 0.
>>
>
>I think, the given grammar for preprocessing is not suitable for
>parsing per se, however we can interpret the behavior of pp-parser
>unambiguously with the Standard's sections related to preprocessing
>(at least in this case).
>
>#if 0
>#include <stdio.h
>#endif
>
>The fact that the group with probably invalid #include directive must
>be skipped (with or without warning) is clear in both C90 and C99:
>
>The line is split into valid pp-tokens in TP3: # include < stdio . h

That is not the tokenization chosen by GCC. As I said upthread, this
line is tokenized as {#} {include} {<stdio.h} where the last token is
an invalid header-name (6.4.7). The invalidity of that third token then
causes an error.

The rest of your argument relates to choice of higher-level grammar rules,
not tokenization, and is therefore moot.

Now, if you can find evidence that that line must be tokenized as
{#} {include} {<} {stdio} {.} {h}
then we can debate that. Keep in mind that, since tokenization is
TP3, it is not aware of whether each individual line is skipped or not
(consider the hypothetical many-pass implementation that completes each
phase sequentially before proceeding to the next).

zw

Jun Woong

unread,
May 4, 2001, 1:29:07 PM5/4/01
to
In article <9cukha$g8j$1...@nntp.Stanford.EDU>, Zack Weinberg says...
[...]
>

>That is not the tokenization chosen by GCC. As I said upthread, this
>line is tokenized as {#} {include} {<stdio.h} where the last token is
>an invalid header-name (6.4.7). The invalidity of that third token then
>causes an error.
>
>The rest of your argument relates to choice of higher-level grammar rules,
>not tokenization, and is therefore moot.

Although it is related to higher-level grammar, it (specially, "...
(that does not match one of the two previous forms) is permitted")
implies that <stdio.h must not be tokenized into an invalid header
name, which imposes harder work on implementors, I think.

I agree with Clive's interpretation: The Standard does not permit a
conforming implementation to fail tokenization in TP3. Note that lone
" or ' causes undefined behavior AFTER tokenization into the last
category of pp-tokens.

Thomas Pornin

unread,
May 5, 2001, 8:39:33 AM5/5/01
to
According to Zack Weinberg <za...@stanford.edu>:
> A more disturbing thought is that implementations might be required to
> accept e.g.
>
> #define h h>
> #include <stdio.h
>
> i.e. that one is required to re-tokenize most of the line if initial
> matching as a header-name fails. This would require being able to back
> up an arbitrary distance. Implementing this in GCC would take major
> architectural changes.

I did that for my own preprocessor, and it turned out to be quite
easy. I believe the lexer aggregates characters as it reads them, so
the string `<stdio.h' must be somewhere in memory at the point the
lexer reaches the end of the line. Besides, I am pretty sure the GCC
preprocessor has a way to lex a string "out of band", to handle the
`##' operator for instance. So this should be doable.

Disclaimer: I have not looked at GCC code.


Yet I don't think it could be required from the implementation. The way
the tokens are merged to get a header name is implementation-defined.
All that is needed is to put some sort of disclaimer in the GCC
documentation.


--Thomas Pornin

Jun Woong

unread,
May 5, 2001, 9:41:44 AM5/5/01
to
In article <9d0sa5$1l37$1...@nef.ens.fr>, Thomas Pornin says...

Although it's up to the implementation whether tokens resulting from
macro expasion are merged to a header name, tokenization must not be
failed in TP3 and macro expansion must be perfomed. Besides, after
macro expansion, the tokens (< stdio . h >) must not be considered
as undefined behavior.

Thomas Pornin

unread,
May 8, 2001, 3:24:26 PM5/8/01
to
According to Jun Woong <myco...@hanmail.net>:

> I agree with Clive's interpretation: The Standard does not permit a
> conforming implementation to fail tokenization in TP3. Note that lone
> " or ' causes undefined behavior AFTER tokenization into the last
> category of pp-tokens.

There is no standard way to distinguish the different phases of
processing of the source file; as long as there is undefined behaviour,
the implementation could as well retroactively suppress what could have
looked like a diagnostic emitted before. The standard does not require
undefined behaviour to comply to the laws of physics. I would say that
not only your point is moot, but the wording of the standard about lone
" or ' was introduced explicitely to allow tokenization to fail on those
characters.


Yet I would say that on:

#include <stdio.h

idealy, a good preprocessor should tokenize, and macro-replace. Then
it has no obligation to merge the tokens {<} {stdio} {.} {h} {>} into
something which would have the same effect than a standard include of
<stdio.h>. You know, the classical song: as long as a diagnostic is
emitted, whatever a diagnostic may be... In that matter, I would say
that the standard achieves the astonishing goal of stating something
that was impossible to avoid, and yet makes it absolutely useless, by
all possible meanings of usefulness.

A very good preprocessor would merge the tokens, and find stdio.h, and
emit a "warning" (the concept of "warning" is not in the standard; and
yet it is usefull; maybe there is a connection there).


--Thomas Pornin

Jun Woong

unread,
May 9, 2001, 7:26:26 AM5/9/01
to
In article <9d9h5a$q45$1...@nef.ens.fr>, Thomas Pornin says...

>According to Jun Woong <myco...@hanmail.net>:
>> I agree with Clive's interpretation: The Standard does not permit a
>> conforming implementation to fail tokenization in TP3. Note that lone
>> " or ' causes undefined behavior AFTER tokenization into the last
>> category of pp-tokens.
>
>There is no standard way to distinguish the different phases of
>processing of the source file; as long as there is undefined behaviour,
>the implementation could as well retroactively suppress what could have
>looked like a diagnostic emitted before. The standard does not require
>undefined behaviour to comply to the laws of physics. I would say that
>not only your point is moot, but the wording of the standard about lone
>" or ' was introduced explicitely to allow tokenization to fail on those
>characters.
>

You seem to mean that an implementation can notice that ' or " is
unmatched without tokenization, which is impossible in the sense of
the Standard. The Standard says that undefined behavior occurs WHEN
' or " is tokenized to the last category of pp-tokens; namely, a
conforming implementation can't determine whether undefined behavior
can occur or not BEFORE the tokenization of lone ' or " is done. Of
course, undefined behavior due to lone ' or " can be seen as failure
of tokenization to users seemingly, however in the conceptional view-
point which the Standard describes, its tokenization must be performed
successfully in TP3 in order to detect if it belongs to the category.

Clive D. W. Feather

unread,
May 16, 2001, 9:18:00 AM5/16/01
to
In article <15zI6.1095$vg1....@www.newsranger.com>, Jun Woong
<myco...@hanmail.net> writes
>#if 0
>#include <stdio.h
>#endif
>
>The fact that the group with probably invalid #include directive must
>be skipped (with or without warning) is clear in both C90 and C99:
[...]

I agree.

The purpose of 6.10#2-4 is to clarify any ambiguities in the grammar and
explain how to actually use it.

>"When in a group that is skipped (6.10.1), the directive syntax is
>relaxed to allow any sequence of preprocessing tokens to occur
>between the directive name and the following new-line character."

The alternative would have been a syntax that has separate "skipped" and
"retained" cases for conditionals. That would have been far more
dangerous.

--
Clive D.W. Feather, writing for himself | Home: <cl...@davros.org>
Tel: +44 20 8371 1138 (work) | Web: <http://www.davros.org>
Fax: +44 20 8371 4037 (D-fax) | Work: <cl...@demon.net>
Written on my laptop; please observe the Reply-To address

Clive D. W. Feather

unread,
May 16, 2001, 9:06:54 AM5/16/01
to
In article <9con65$14c0$1...@nef.ens.fr>, Thomas Pornin
<por...@bolet.ens.fr> writes

>6.4/3 says:
><< The categories of preprocessing tokens are: header names, identifiers,
>preprocessing numbers, character constants, string literals, punctuators,
>and single non-white-space characters that do not lexically match the
>other preprocessing token categories. If a ' or a " matches the last
>category, the behaviour is undefined. >>

Oops, I missed that one.

>So I believe that a lone ' or ", even in code compiled out through a
>#if, triggers the dreadful "undefined behaviour".

Agreed.

>As for the:
>
>#include <stdio.h

>So it is an instance of the:
>
>#include pp-tokens new-line

Also agreed. So it tokenises as {<} {stdio} {.} {h}.

Clive D. W. Feather

unread,
May 16, 2001, 9:23:52 AM5/16/01
to
In article <9cs6vp$j5r$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren
<nm...@cus.cam.ac.uk> writes

>|> >|> >>#if 0
>|> >|> >>#include <stdio.h
>|> >|> >>#endif

That is legitimate code.

>Oh, lovely! Yet another syntactic ambiguity of a class that
>I have pointed out before :-(
>
>The relevant definitions are:
>
>group-part: if-section
> control-line
> text-line
> # non-directive
>
>control-line: # include pp-tokens new-line
> . . .
>
>text-line: pp-tokens-opt new-line
>
>non-directive: pp-tokens new-line
>
>Now, '#include <stdio.h' matches ALL of control-line, text-line
>and '# non-directive'.

Nope: 6.10 also says:

[#3] A text line shall not begin with a # preprocessing
token. A non-directive shall not begin with any of the
directive names appearing in the syntax.

This is "description", meant to explain how syntactic ambiguities are to
be resolved. It can *only* be a control-line. Just as 6.4#4 explains how
the modified-maximal-munch rule resolves the ambiguity of that syntax,
or 6.7.5.3#11 explains how to resolve another ambiguity.

The alternative was to have grammar like:

text-line: new-line
non-hash pp-tokens-opt new-line

non-directive: non-directive-name pp-tokens-opt new-line

non-hash: [any pp-token except #]

non-directive-name: [any pp-token except for if, ifdef, ifndef, ...]

WG14 didn't think this was a good idea.

Clive D. W. Feather

unread,
May 16, 2001, 9:30:55 AM5/16/01
to
In article <9cukha$g8j$1...@nntp.Stanford.EDU>, Zack Weinberg
<za...@stanford.edu> writes

>>The line is split into valid pp-tokens in TP3: # include < stdio . h
>
>That is not the tokenization chosen by GCC. As I said upthread, this
>line is tokenized as {#} {include} {<stdio.h} where the last token is
>an invalid header-name (6.4.7). The invalidity of that third token then
>causes an error.

Then GCC is plain and simply wrong. {<stdio.h} is not a pp-token, so GCC
*can't* legitimately tokenize it that way.

>Now, if you can find evidence that that line must be tokenized as
>{#} {include} {<} {stdio} {.} {h}
>then we can debate that.

That's a valid tokenization of that line. There is no other valid
tokenization. So what more evidence do you need ?

Clive D. W. Feather

unread,
May 16, 2001, 9:41:58 AM5/16/01
to
In article <GFdI6.8328$SZ5.6...@www.newsranger.com>, Jun Woong
<myco...@hanmail.net> writes

>Can a C99-conforming implementation define an additional preprocessing
>directive "#nonstandard" as extension?

Yes, so long as that does not affect any strictly-conforming program.

>>> #define nothing(x) /* nothing */
>>> nothing (
>>> #nonstandard
>>> )

>> But I would argue that a non-directive is allowed in this


>>situation. To argue otherwise would be to say that a non-directive would
>>"act as a preprocessing directive". On the third hand, I don't think
>>this was intended when we were cleaning up this bit.
>
>If the answer to my question above is "yes", in the implementation
>which defines the "#nonstandard" pp-directive, does #nonstandard be
>scanned as # <pp-directive> rather than as # <non-directive>?

It can't change meaning, since we know what effect it has on a system
that doesn't define it.

Clive D. W. Feather

unread,
May 16, 2001, 9:38:23 AM5/16/01
to
In article <9d9h5a$q45$1...@nef.ens.fr>, Thomas Pornin

>There is no standard way to distinguish the different phases of


>processing of the source file; as long as there is undefined behaviour,
>the implementation could as well retroactively suppress what could have
>looked like a diagnostic emitted before.

No: a mandatory diagnostic overrides undefined behaviour.

Nick Maclaren

unread,
May 16, 2001, 1:30:13 PM5/16/01
to

In article <IdJgUtJo...@romana.davros.org>,

"Clive D. W. Feather" <cl...@on-the-train.demon.co.uk> writes:
|> >
|> >Now, '#include <stdio.h' matches ALL of control-line, text-line
|> >and '# non-directive'.
|>
|> Nope: 6.10 also says:
|>
|> [#3] A text line shall not begin with a # preprocessing
|> token. A non-directive shall not begin with any of the
|> directive names appearing in the syntax.

I missed that! Yes, that fixes it.

Jun Woong

unread,
May 17, 2001, 3:24:50 AM5/17/01
to
In article <ZyaUvlLm...@romana.davros.org>, Clive D. W. Feather says...

>
>In article <GFdI6.8328$SZ5.6...@www.newsranger.com>, Jun Woong
><myco...@hanmail.net> writes
>>Can a C99-conforming implementation define an additional preprocessing
>>directive "#nonstandard" as extension?
>
>Yes, so long as that does not affect any strictly-conforming program.
>
>>>> #define nothing(x) /* nothing */
>>>> nothing (
>>>> #nonstandard
>>>> )
>
>>> But I would argue that a non-directive is allowed in this
>>>situation. To argue otherwise would be to say that a non-directive would
>>>"act as a preprocessing directive". On the third hand, I don't think
>>>this was intended when we were cleaning up this bit.
>>
>>If the answer to my question above is "yes", in the implementation
>>which defines the "#nonstandard" pp-directive, does #nonstandard be
>>scanned as # <pp-directive> rather than as # <non-directive>?
>
>It can't change meaning, since we know what effect it has on a system
>that doesn't define it.
>

Could you explain the full detail?
You seem to mean that even in C99, whether #nonstandard acts as a pp-
directive or not is undefined (or implementation-dependent) in the above
case, thus it does not change anything for a conforming implementation
to recognize "nonstandard" as pp-directive or as non-directive. Do I
understand your point correctly?

And, you said that "if #nonstandard survives in TP4, ..." in the previous
post, does this mean that a conforming implementation can define it
as a valid pp-token, recognize it, do some work, and remove it in TP4?

Thanks in advance...

James Kuyper Jr.

unread,
May 17, 2001, 9:44:34 AM5/17/01
to

Not quite. A conforming implementation can recognise #nonstandard as a
pp-directive; but if it does, that directive cannot have any effect that
isn't allowed to an implementation by the standard. The directive could
change which implementation-defined choice is chosen, so long as it's a
choice which can change within a single translation unit. Whenever the
standard says that the behavior is undefined, #nonstandard could control
that behavior. It could turn an optimization on or off, as long as that
optimization is covered by the as-if rule; the output from a strictly
conforming program cannot depend upon whether or not that optimization
is turned on.

Jun Woong

unread,
May 17, 2001, 1:07:51 PM5/17/01
to
In article <3B03D5C2...@wizard.net>, James Kuyper Jr. says...
>Jun Woong wrote:
[...]

>> >>
>> >>If the answer to my question above is "yes", in the implementation
>> >>which defines the "#nonstandard" pp-directive, does #nonstandard be
>> >>scanned as # <pp-directive> rather than as # <non-directive>?
>> >
>> >It can't change meaning, since we know what effect it has on a system
>> >that doesn't define it.
>> >
>>
>> Could you explain the full detail?
>> You seem to mean that even in C99, whether #nonstandard acts as a pp-
>> directive or not is undefined (or implementation-dependent) in the above
>> case, thus it does not change anything for a conforming implementation
>> to recognize "nonstandard" as pp-directive or as non-directive. Do I
>> understand your point correctly?
>
>Not quite. A conforming implementation can recognise #nonstandard as a
>pp-directive; but if it does, that directive cannot have any effect that
>isn't allowed to an implementation by the standard.

Thus, #nonstandard cannot have any non-conforming behavior (which affects
s.c.program's behavior) like interpreting the rest of the source file as
FORTRAN code. Right?

>The directive could
>change which implementation-defined choice is chosen, so long as it's a
>choice which can change within a single translation unit. Whenever the
>standard says that the behavior is undefined, #nonstandard could control
>that behavior.

You mean, #nonstandard itself is well-defined in C99 (as opposed to in
C90) and a conforming implementation does not have the same power as
when it'd cause undefined behavior?

>It could turn an optimization on or off, as long as that
>optimization is covered by the as-if rule; the output from a strictly
>conforming program cannot depend upon whether or not that optimization
>is turned on.

Therefore, if C99 says:

#define nothing(x)
nothing (
#nonstandard
)

results in no tokens (nothing) (- as I understand Clive's wording,
C99's intent says so), a conforming implementation that recognizes
it as a valid pp-directive and gives some meaning to it must have the
same result as the above and can have *only* some extension behavior
which does not affect any s.c.program's behavior (output).

Is this correct, now? :)

James Kuyper Jr.

unread,
May 18, 2001, 7:27:07 PM5/18/01
to
Jun Woong wrote:
>
> In article <3B03D5C2...@wizard.net>, James Kuyper Jr. says...
> >Jun Woong wrote:
...
> >Not quite. A conforming implementation can recognise #nonstandard as a
> >pp-directive; but if it does, that directive cannot have any effect that
> >isn't allowed to an implementation by the standard.
>
> Thus, #nonstandard cannot have any non-conforming behavior (which affects
> s.c.program's behavior) like interpreting the rest of the source file as
> FORTRAN code. Right?

Correct. Of course, you'd have a hard time writing code that looks both
like valid FORTRAN and valid C. It's not impossible - I've seen a short
program that was written to compile in six different languages without
modification, two of which were C and Fortran. I wish I knew where to
find a copy of that program; I could never write it myself.

> >The directive could
> >change which implementation-defined choice is chosen, so long as it's a
> >choice which can change within a single translation unit. Whenever the
> >standard says that the behavior is undefined, #nonstandard could control
> >that behavior.
>
> You mean, #nonstandard itself is well-defined in C99 (as opposed to in
> C90) and a conforming implementation does not have the same power as
> when it'd cause undefined behavior?

#nonstandard is not so much well defined, as tolerated. C99 specifically
defines a non-directive as one kind of group-part during the
pre-processing phase, and that's all it can be as far a strict
conformance is concerned. However, as a non-directive, it does nothing
in itself but wait to be recognised after the pre-processing phase as a
syntax error. If it doesn't survive past the pre-processing phase, as in
your case, all is fine. If, as an extension, it's recognised as
pp-directive before it disappears, it can control any decision that the
implementation was already free to make. By definition, that can't be
any decision that would affect the output of a strictly conforming
program (except insofar as it changes the characteristics of the ""
locale).

...


> Therefore, if C99 says:
>
> #define nothing(x)
> nothing (
> #nonstandard
> )
>
> results in no tokens (nothing) (- as I understand Clive's wording,
> C99's intent says so), a conforming implementation that recognizes
> it as a valid pp-directive and gives some meaning to it must have the
> same result as the above and can have *only* some extension behavior
> which does not affect any s.c.program's behavior (output).
>
> Is this correct, now? :)

Yes.

Zack Weinberg

unread,
May 18, 2001, 8:26:28 PM5/18/01
to
Clive D. W. Feather <cl...@davros.org> writes:
>In article <9cukha$g8j$1...@nntp.Stanford.EDU>, Zack Weinberg
><za...@stanford.edu> writes
>>>The line is split into valid pp-tokens in TP3: # include < stdio . h
>>
>>That is not the tokenization chosen by GCC. As I said upthread, this
>>line is tokenized as {#} {include} {<stdio.h} where the last token is
>>an invalid header-name (6.4.7). The invalidity of that third token then
>>causes an error.
>
>Then GCC is plain and simply wrong. {<stdio.h} is not a pp-token, so GCC
>*can't* legitimately tokenize it that way.

I should rephrase. It *attempts* to tokenize it that way, fails since
there is no close > on the line, and issues an error.

>>Now, if you can find evidence that that line must be tokenized as
>>{#} {include} {<} {stdio} {.} {h}
>>then we can debate that.
>
>That's a valid tokenization of that line. There is no other valid
>tokenization. So what more evidence do you need ?

Since 'there is no other valid tokenization' is not a reason for doing
anything anywhere _else_ in phase 3 (remember a+++++b) I don't
see why it should apply here. Everywhere else, phase 3 tokenization
proceeds by maximal munch. You never have to back up more than two
characters upon finding that some sequence is not completed. Here,
you're saying that phase 3 has to back up all the way to the < and
start over - potentially unlimited distance.

If the literal wording does require this, I'm inclined to consider it a
bug in the standard. The simplest way to fix the bug would be to insert
a new paragraph 6.10.2 p1.1:

If the first nonwhitespace character appearing in the argument
of an #include directive is either < or ", then the complete
argument shall match the syntax described in 6.10.2 paragraphs
2 or 3, respectively.

If one would rather not create a new constraint, then instead as 6.10.2 p3.1,

If the first nonwhitespace character appearing in the argument of
an #include directive is either < or ", but the complete argument
does not match the syntax described in 6.10.2 paragraphs 2 or 3,
respectively, the behavior is undefined.

zw
--
zw It may of course be possible that risks-awareness and extreme care are
developed in the course of dancing with the fuckup fairy in the pale
moonlight.
-- Anthony de Boer

Christian Bau

unread,
May 21, 2001, 4:00:20 AM5/21/01
to
"James Kuyper Jr." wrote:
>
> Correct. Of course, you'd have a hard time writing code that looks both
> like valid FORTRAN and valid C. It's not impossible - I've seen a short
> program that was written to compile in six different languages without
> modification, two of which were C and Fortran. I wish I knew where to
> find a copy of that program; I could never write it myself.

C and FORTRAN 77 is not very difficult. FORTRAN 77 ignores everything
beyond column 72, that is past the 72nd character in a line. Just write
a program that looks like

/*
PROGRAM MYPROGRAM */ int main (void) /*
etc */ { etc. /*
END */ }

Nick Maclaren

unread,
May 21, 2001, 4:09:35 AM5/21/01
to
In article <3B08CB14...@isltd.insignia.com>,

Er, no, it doesn't. Fortran 77 specifies that the program must lie
entirely in columns 1 to 72 - if there is any text outside that,
the behaviour is undefined. This caused considerable trouble with
some people assuming what you did (i.e. the 'IBM' camp) and some
systems allowing lines to exceed 72 characters (i.e. the 'DEC'
camp).

The correct approach is to start it:

C /*

This is a Fortran example of why I loathe C's catch-all undefined
behaviour so much. Without even a hint of whether it means the
program is erroneous or whether the situation is just too complex
to specify, many people assume specifications that don't exist.

Clive D. W. Feather

unread,
May 23, 2001, 10:04:12 PM5/23/01
to
In article <61LM6.5204$6j3.4...@www.newsranger.com>, Jun Woong
<myco...@hanmail.net> writes

>>>>> #define nothing(x) /* nothing */
>>>>> nothing (
>>>>> #nonstandard
>>>>> )

>>>If the answer to my question above is "yes", in the implementation


>>>which defines the "#nonstandard" pp-directive, does #nonstandard be
>>>scanned as # <pp-directive> rather than as # <non-directive>?
>>
>>It can't change meaning, since we know what effect it has on a system
>>that doesn't define it.
>
>Could you explain the full detail?

The above code will do nothing on an implementation that doesn't define
a #nonstandard pp-directive. Therefore any implementation that *does*
define one as an extension has to do so in a way that doesn't affect
that program.

Therefore the implementation must act AS IF the directive was
interpreted *after* translation phase 4. At that point there is no
meaning for #nonstandard in the Standard, so a syntax error will happen
and a diagnostic is required. Once this has happened, the implementer
can do what they like.

The other option would be to claim that this is a defect and the last
sentence of 6.10.3#11 ought to apply. I think I might write this up.
[That sentence says that, if we replace #nonstandard by #define or
#include in the above example, the behaviour is undefined.]

Clive D. W. Feather

unread,
May 23, 2001, 11:52:14 PM5/23/01
to
In article <9e4ejk$d06$1...@nntp.Stanford.EDU>, Zack Weinberg
<za...@stanford.edu> writes
>>>>The line is split into valid pp-tokens in TP3: # include < stdio . h
>>>That is not the tokenization chosen by GCC.
>I should rephrase. It *attempts* to tokenize it that way, fails since
>there is no close > on the line, and issues an error.

What if "h" is a macro expanding to something ending in ">" ?

>>>Now, if you can find evidence that that line must be tokenized as
>>>{#} {include} {<} {stdio} {.} {h}
>>>then we can debate that.
>>That's a valid tokenization of that line. There is no other valid
>>tokenization. So what more evidence do you need ?
>Since 'there is no other valid tokenization' is not a reason for doing
>anything anywhere _else_ in phase 3 (remember a+++++b) I don't
>see why it should apply here.

Huh ?

"a+++++b" must be tokenised as {a}{++}{++}{+}{b}; there is no other
valid tokenisation. Phase 3 doesn't have a failure mode: every possible
source has exactly one tokenisation.

>Everywhere else, phase 3 tokenization
>proceeds by maximal munch. You never have to back up more than two
>characters upon finding that some sequence is not completed.

Unterminated string and character constants require arbitrary backup as
well, but that's undefined behaviour.

One approach is to implement a normal #include as if header-name tokens
didn't exist:

{#}{include} {<}{stdio}{.}{h}{>}

then, once you've identified that you have a valid sequence, paste it
together.

>If the literal wording does require this, I'm inclined to consider it a
>bug in the standard.

None of this is new in C99; why has it suddenly become an issue ?

>The simplest way to fix the bug would be to insert
>a new paragraph 6.10.2 p1.1:
>
> If the first nonwhitespace character appearing in the argument
> of an #include directive is either < or ", then the complete
> argument shall match the syntax described in 6.10.2 paragraphs
> 2 or 3, respectively.

#define FOO stdio.h>
#include <FOO

is currently valid code (modulo the last sentence of #4).

Zack Weinberg

unread,
May 24, 2001, 1:52:21 AM5/24/01
to
Clive D. W. Feather <cl...@davros.org> writes:
>In article <9e4ejk$d06$1...@nntp.Stanford.EDU>, Zack Weinberg
><za...@stanford.edu> writes
>>>>>The line is split into valid pp-tokens in TP3: # include < stdio . h
>>>>That is not the tokenization chosen by GCC.
>>I should rephrase. It *attempts* to tokenize it that way, fails since
>>there is no close > on the line, and issues an error.
>
>What if "h" is a macro expanding to something ending in ">" ?

That's too bad.

>Huh ?
>
>"a+++++b" must be tokenised as {a}{++}{++}{+}{b}; there is no other
>valid tokenisation.

{a}{++}{+}{++}{b} is an equally valid-in-phase-3 sequence of tokens,
which could be the tokenization of that line. The standard requires
one to pick a different possibility. It requires that even though the
possibility it requires leads to a parse error in phase 7, and the
alternative doesn't.

I am arguing that this situation is conceptually no different from the
situation of # include <stdio.h
ignoring the artificial division of parsing into phases (one could perfectly
well write a phase 7 parser that operated on a stream of phase 2 source
characters, and parsed "a+++++b" as {a}{++}{+}{++}{b}...)

>Phase 3 doesn't have a failure mode: every possible source has exactly
>one tokenisation.
>
>>Everywhere else, phase 3 tokenization
>>proceeds by maximal munch. You never have to back up more than two
>>characters upon finding that some sequence is not completed.
>
>Unterminated string and character constants require arbitrary backup as
>well, but that's undefined behaviour.

... which therefore lets an implementation reject code in phase 3 if
you hit an unterminated string or character constant. How is this not
a failure mode?

>>If the literal wording does require this, I'm inclined to consider it a
>>bug in the standard.
>
>None of this is new in C99; why has it suddenly become an issue ?

It isn't an issue, in the sense that GCC has has the behavior I outline
for at least five years and no one has complained, including a couple of
validation suites.

It is an issue because one would like the standard to reflect reality.

>>The simplest way to fix the bug would be to insert
>>a new paragraph 6.10.2 p1.1:
>>
>> If the first nonwhitespace character appearing in the argument
>> of an #include directive is either < or ", then the complete
>> argument shall match the syntax described in 6.10.2 paragraphs
>> 2 or 3, respectively.
>
> #define FOO stdio.h>
> #include <FOO
>
>is currently valid code (modulo the last sentence of #4).

I don't think it should be.

--
zw But then one day I came up with a radical new paradigm for my business...
I decided that from now on I would only sell boring stuff that people
actually need.
-- Garry Trudeau, _Doonesbury_

Jun Woong

unread,
May 24, 2001, 3:14:32 AM5/24/01
to
In article <56c89eac...@romana.davros.org>, Clive D. W. Feather says...

>
>In article <61LM6.5204$6j3.4...@www.newsranger.com>, Jun Woong
><myco...@hanmail.net> writes
>>>>>> #define nothing(x) /* nothing */
>>>>>> nothing (
>>>>>> #nonstandard
>>>>>> )
>
>>>>If the answer to my question above is "yes", in the implementation
>>>>which defines the "#nonstandard" pp-directive, does #nonstandard be
>>>>scanned as # <pp-directive> rather than as # <non-directive>?
>>>
>>>It can't change meaning, since we know what effect it has on a system
>>>that doesn't define it.
>>
>>Could you explain the full detail?
>
>The above code will do nothing on an implementation that doesn't define
>a #nonstandard pp-directive. Therefore any implementation that *does*
>define one as an extension has to do so in a way that doesn't affect
>that program.

You mean, since #nonstandard does not survive TP4 (because of
expansion of nothing() macro), a conforming implementation can't
do any extension behavior even though the extension does *not*
change the output of any s.c.programs?

[To the best of my knowledge, "No."
I think, you are saying about extensions which affect s.c.
programs]

>
>Therefore the implementation must act AS IF the directive was
>interpreted *after* translation phase 4. At that point there is no
>meaning for #nonstandard in the Standard, so a syntax error will happen
>and a diagnostic is required. Once this has happened, the implementer
>can do what they like.

Can a conforming implementation turn on some extension behavior
ONLY when the non-directive (which is recognized as pp-directive
in that implementation) survives TP4 and count a syntax error
in the later phase, even if the implementation IS already free
to active the extension since it does not affect any s.c.program?

[To the best of my knowledge, "No."
I think, you are saying about extensions which affect s.c.
programs]

James Kuyper Jr.

unread,
May 24, 2001, 8:13:47 AM5/24/01
to
Zack Weinberg wrote:
>
> Clive D. W. Feather <cl...@davros.org> writes:
> >In article <9e4ejk$d06$1...@nntp.Stanford.EDU>, Zack Weinberg
> ><za...@stanford.edu> writes
> >>>>>The line is split into valid pp-tokens in TP3: # include < stdio . h
> >>>>That is not the tokenization chosen by GCC.
> >>I should rephrase. It *attempts* to tokenize it that way, fails since
> >>there is no close > on the line, and issues an error.
> >
> >What if "h" is a macro expanding to something ending in ">" ?
>
> That's too bad.

As far as I can see, the macro expansion is required:

"<stdio.h" can't be parsed as a single preprocessing-token, because it
doesn't match any of the permitted patterns. Rather, it must be parsed
as {<} {stdio} {.} {h}, which qualify as preprocessing tokens as
{punctuator} {identifier} {punctuator} {identifier}. With that parse, it
matches the form given in 6.10.2p4:

"A preprocessing directive of the form

# include pp-tokens new-line

(that does not match one of the two previous forms) is permitted. The


preprocessing
tokens after include in the directive are processed just as in normal
text. (Each
identifier currently defined as a macro name is replaced by its
replacement list of

preprocessing tokens.) The directive resulting after all replacements
shall match one of
the two previous forms.143) ..."


If, as postulated, h is a macro whose expansion terminates in a ">",
then after that replacement, the directive does indeed "match one of the
two previous forms".

...


> >"a+++++b" must be tokenised as {a}{++}{++}{+}{b}; there is no other
> >valid tokenisation.
>
> {a}{++}{+}{++}{b} is an equally valid-in-phase-3 sequence of tokens,
> which could be the tokenization of that line. The standard requires
> one to pick a different possibility. It requires that even though the

I thought we were using "valid" in this context, to refer to
tokenizations that conform to the C standard. You yourself point out
that the C standard requires a different tokenization, so if we're
talking about C, that isn't a valid tokenization.

> possibility it requires leads to a parse error in phase 7, and the
> alternative doesn't.
>
> I am arguing that this situation is conceptually no different from the
> situation of # include <stdio.h
> ignoring the artificial division of parsing into phases (one could perfectly
> well write a phase 7 parser that operated on a stream of phase 2 source
> characters, and parsed "a+++++b" as {a}{++}{+}{++}{b}...)

Well yes; you could also write a phase 7 parser that parsed the code as
Fortran. We are, however, talking about the C standard, and the division
of parsing into phases is NOT an artificial feature of the language. An
implementation doesn't have to actually carry out the phases separately,
in the specified order, but whenever it makes a difference (which is
frequently) it must produce the same effect as-if it had done them
separately, in the order specified.

...


> >None of this is new in C99; why has it suddenly become an issue ?
>
> It isn't an issue, in the sense that GCC has has the behavior I outline
> for at least five years and no one has complained, including a couple of
> validation suites.
>
> It is an issue because one would like the standard to reflect reality.

One would also like GCC to conform to the published standard.

...


> > #define FOO stdio.h>
> > #include <FOO
> >
> >is currently valid code (modulo the last sentence of #4).
>
> I don't think it should be.

That's your opinion, and you're entitled to it. However, as a simple
statement of opinion, it lacks force. Would you care to give a reason
why someone else should share that opinion? Why would it be a good idea
to change the standard to prohibit that?

Christian Bau

unread,
May 24, 2001, 8:59:04 AM5/24/01
to
"James Kuyper Jr." wrote:

>
> Zack Weinberg wrote:
> > > #define FOO stdio.h>
> > > #include <FOO
> > >
> > >is currently valid code (modulo the last sentence of #4).
> >
> > I don't think it should be.
>
> That's your opinion, and you're entitled to it. However, as a simple
> statement of opinion, it lacks force. Would you care to give a reason
> why someone else should share that opinion? Why would it be a good idea
> to change the standard to prohibit that?

The argument seems to be that it is MUCH too difficult to find out that
<FOO is not a valid preprocessing token (along with some comments that
everything else can be tokenised with at most two characters lookahead).
However, the gnu preprocessor at this point has the whole source file
available in an array of char so I cannot see why needing more than two
characters lookahead is any problem; special handling is necessary
anyway because <stdio.h> for example is only a preprocessing token in a
line starting with #include. And handling macros is required anyway, for
example to handle

#define FOO <stdio.h>
#include FOO

Once the preprocessor has recognized an "#include" directive, it must
check whether this is followed by a single "file.h" or <file.h>
preprocessing token or not, and if it is followed by anything else then
everything on the line is translated to preprocessing tokens in the
"standard" way without special handling of include file names, macro
evaluation takes place, and the line is interpreted again. The whole
argument is about the intellectual challenge involved in figuring out
that <FOO is not a preprocessing token. I think this check is quite
trivial; no more than 10 lines of code.

(Algorithm: If a < is found, skip characters that would be acceptable in
this preprocessing token until > or end-of-line is encountered.
end-of-line means no valid preprocessing token. If a > is found check
that the rest of the line is only white space, otherwise no
preprocessing token. )


There is no > before the end of line, therefore it is not a
preprocessing token.

Nick Maclaren

unread,
May 24, 2001, 9:03:48 AM5/24/01
to

In article <3B0D0598...@isltd.insignia.com>,

Christian Bau <christ...@isltd.insignia.com> writes:
|>
|> The argument seems to be that it is MUCH too difficult to find out that
|> <FOO is not a valid preprocessing token (along with some comments that
|> everything else can be tokenised with at most two characters lookahead).
|> ... I think this check is quite

|> trivial; no more than 10 lines of code.

Quite probably, but I don't think that IS the argument!

The argument isn't that it cannot be done, but it is code whose
sole purpose is to support a more-or-less completely useless and
definitely completely unused feature of the standard. And every
complication is a place for bugs to hide in - as every gardener
and programmer knows ....

Clive D. W. Feather

unread,
May 25, 2001, 5:06:17 AM5/25/01
to
In article <sx2P6.6000$r4.3...@www.newsranger.com>, Jun Woong
<myco...@hanmail.net> writes
>>>>>>> #define nothing(x) /* nothing */
>>>>>>> nothing (
>>>>>>> #nonstandard
>>>>>>> )

>>The above code will do nothing on an implementation that doesn't define


>>a #nonstandard pp-directive. Therefore any implementation that *does*
>>define one as an extension has to do so in a way that doesn't affect
>>that program.
>
>You mean, since #nonstandard does not survive TP4 (because of
>expansion of nothing() macro), a conforming implementation can't
>do any extension behavior even though the extension does *not*
>change the output of any s.c.programs?

Not quite. I am saying that the above is an s.c.program, and therefore
the implementation must not do anything special with it (exactly as if
the # sign was not there).

>Can a conforming implementation turn on some extension behavior
>ONLY when the non-directive (which is recognized as pp-directive
>in that implementation) survives TP4 and count a syntax error
>in the later phase,

That is right.

>even if the implementation IS already free
>to active the extension since it does not affect any s.c.program?

If #nonstandard has no effect on any s.c.program (for example, if it
changes the effects of signed integer overflow) then it *can* be
activated even if it would disappear in TP4.

For example, consider two possible directives:

#oflozero // all signed integer overflow results in 0
#divroundup // integer division rounds up

Then in the following code the assertions will all succeed:

#define nothing(x) // Hides x
nothing (
#divzero
)
assert (INTMAX * 2 == 0);
nothing (
#divroundup
)
assert (3 / 2 == 1);
#divroundup // Diagnostic required
assert (3 / 2 == 2);

Clive D. W. Feather

unread,
May 25, 2001, 5:12:34 AM5/25/01
to
In article <9ei7il$q82$1...@nntp.Stanford.EDU>, Zack Weinberg
<za...@stanford.edu> writes

>{a}{++}{+}{++}{b} is an equally valid-in-phase-3 sequence of tokens,
>which could be the tokenization of that line. The standard requires
>one to pick a different possibility. It requires that even though the
>possibility it requires leads to a parse error in phase 7, and the
>alternative doesn't.
>
>I am arguing that this situation is conceptually no different from the
>situation of # include <stdio.h

But it isn't, because that isn't automatically a parse error in phase 4
(in this case; such a line *can* appear in a correct program.

>(one could perfectly
>well write a phase 7 parser that operated on a stream of phase 2 source
>characters, and parsed "a+++++b" as {a}{++}{+}{++}{b}...)

But this would not correctly parse the language as defined.

>>Phase 3 doesn't have a failure mode: every possible source has exactly
>>one tokenisation.

>>Unterminated string and character constants require arbitrary backup as


>>well, but that's undefined behaviour.
>... which therefore lets an implementation reject code in phase 3 if
>you hit an unterminated string or character constant. How is this not
>a failure mode?

Because it happens in phase 4, I would argue. But, more importantly,
this is a case of certain token sequences invoking undefined behaviour;
it is not a case of a failure to tokenise. It has exactly the same
properties as the sequence "#if 1/0".

>It isn't an issue, in the sense that GCC has has the behavior I outline
>for at least five years and no one has complained, including a couple of
>validation suites.

So they didn't run into this situation.

>It is an issue because one would like the standard to reflect reality.

I would prefer GCC to actually implement C, rather that not-quite-C.

>>>The simplest way to fix the bug would be

[...]


>>is currently valid code (modulo the last sentence of #4).
>I don't think it should be.

I might even agree with you, but right now it is. Asking for a change in
the language is not fixing a bug.

Clive D. W. Feather

unread,
May 25, 2001, 5:15:53 AM5/25/01
to
In article <9ej0rk$d43$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren
<nm...@cus.cam.ac.uk> writes

>Quite probably, but I don't think that IS the argument!
>
>The argument isn't that it cannot be done, but it is code whose
>sole purpose is to support a more-or-less completely useless and
>definitely completely unused feature of the standard.

It might be useless, and I could well be convinced that it should be
changed. But it does the GCC people no credit to claim it's a "bug" when
the Standard is perfectly clear, nor to claim that it's a hard task when
it clearly isn't.

Zack Weinberg

unread,
May 25, 2001, 3:54:37 PM5/25/01
to
Clive D. W. Feather <cl...@davros.org> writes:
>In article <9ej0rk$d43$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren
><nm...@cus.cam.ac.uk> writes
>>Quite probably, but I don't think that IS the argument!
>>
>>The argument isn't that it cannot be done, but it is code whose
>>sole purpose is to support a more-or-less completely useless and
>>definitely completely unused feature of the standard.
>
>It might be useless, and I could well be convinced that it should be
>changed. But it does the GCC people no credit to claim it's a "bug" when
>the Standard is perfectly clear, nor to claim that it's a hard task when
>it clearly isn't.

This seems to be a debate over the definition of "bug"...

I am not arguing that the standard is unclear, or that GCC correctly
implements the language as specified. I am arguing that the
language-as-specified should be changed.

My justification for so arguing is first that the specification in this
instance is inconsistent with the rest of the specification for things
which act like string constants. String literals, character constants,
and "-delimited header names all behave one way; <>-delimited header names
are specified to behave a different way. Irrespective of whether or not
the way <>-headers are specified is *useful*, I hope you can see the value
in having them act as much like other string constants as possible. I do
think that the current spec is useless, but that's a secondary concern.

The issue of implementation difficulty is somewhat of a red herring, and
is certainly secondary to the question of inconsistency. Anything *can*
of course be implemented. I brought it up because in many other areas C
caters at great length to implementation convenience, but not in this one.
Further, your repeated assertion that this is easy to implement assumes
a particular architecture for the preprocessor which is not the one GCC
uses. If you're not prepared to believe me that this would take major
structural changes, I invite you to go look for yourself.

--
zw I'm on a spaceship full of college students.
-- Martin "PCHammer" Rose

Jun Woong

unread,
May 26, 2001, 8:47:40 PM5/26/01
to
In article <9eaifv$6lb$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren says...
[...]

>
>Er, no, it doesn't. Fortran 77 specifies that the program must lie
>entirely in columns 1 to 72 - if there is any text outside that,
>the behaviour is undefined. This caused considerable trouble with
>some people assuming what you did (i.e. the 'IBM' camp) and some
>systems allowing lines to exceed 72 characters (i.e. the 'DEC'
>camp).
>
>The correct approach is to start it:
>
>C /*
>

Could you elaborate on this?
I agree that it is one of those approachs, but can't understand why
CORRECT one.

Jun Woong

unread,
May 27, 2001, 2:29:47 AM5/27/01
to
In article <0CJ$jjpJCi...@romana.davros.org>, Clive D. W. Feather says...

>In article <sx2P6.6000$r4.3...@www.newsranger.com>, Jun Woong
>>Can a conforming implementation turn on some extension behavior
>>ONLY when the non-directive (which is recognized as pp-directive
>>in that implementation) survives TP4 and count a syntax error
>>in the later phase,
>
>That is right.
>
>>even if the implementation IS already free
>>to active the extension since it does not affect any s.c.program?
>
>If #nonstandard has no effect on any s.c.program (for example, if it
>changes the effects of signed integer overflow) then it *can* be
>activated even if it would disappear in TP4.
>
>For example, consider two possible directives:
>
> #oflozero // all signed integer overflow results in 0
> #divroundup // integer division rounds up
>
>Then in the following code the assertions will all succeed:
>
>#define nothing(x) // Hides x
>nothing (
>#divzero

I think, you meant to write "#oflozero".

>)
>assert (INTMAX * 2 == 0);
>nothing (
>#divroundup

As in #oflozero case above, it actives interger-division-rounds-up,
right? (see below)

>)
>assert (3 / 2 == 1);

Then, why did you write this thing?
assert (3 / 2 == 2) or assert (2 / 3 == 1) seem to be what you meant
to write.

>#divroundup // Diagnostic required
>assert (3 / 2 == 2);
>


--

Jun Woong

unread,
May 27, 2001, 3:52:02 AM5/27/01
to
In article <v91Q6.1412$rn5....@www.newsranger.com>, Jun Woong says...
[...]

>>
>>For example, consider two possible directives:
>>
>> #oflozero // all signed integer overflow results in 0
>> #divroundup // integer division rounds up
>>
>>Then in the following code the assertions will all succeed:
>>
>>#define nothing(x) // Hides x
>>nothing (
>>#divzero
>
>I think, you meant to write "#oflozero".
>
>>)
>>assert (INTMAX * 2 == 0);
>>nothing (
>>#divroundup
>
>As in #oflozero case above, it actives interger-division-rounds-up,
>right? (see below)
>

Ooops, nope.

>>)
>>assert (3 / 2 == 1);
>
>Then, why did you write this thing?
>assert (3 / 2 == 2) or assert (2 / 3 == 1) seem to be what you meant
>to write.

Nope; you are right on this, I awaked from an illusion after eating
between meals :)

Nick Maclaren

unread,
May 27, 2001, 4:57:28 AM5/27/01
to
In article <M8YP6.1207$rn5....@www.newsranger.com>,

Jun Woong <myco...@hanmail.net> wrote:
>In article <9eaifv$6lb$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren says...
>[...]
>>
>>Er, no, it doesn't. Fortran 77 specifies that the program must lie
>>entirely in columns 1 to 72 - if there is any text outside that,
>>the behaviour is undefined. This caused considerable trouble with
>>some people assuming what you did (i.e. the 'IBM' camp) and some
>>systems allowing lines to exceed 72 characters (i.e. the 'DEC'
>>camp).
>>
>>The correct approach is to start it:
>>
>>C /*
>
>Could you elaborate on this?
>I agree that it is one of those approachs, but can't understand why
>CORRECT one.

It is the only way to start a combined C90 and Fortran 77 program with
a comment in one of those two languages, that is correct in both
(unextended) languages.

Jun Woong

unread,
May 27, 2001, 5:03:57 AM5/27/01
to
In article <4yL8P7pC...@romana.davros.org>, Clive D. W. Feather says...

>
>Because it happens in phase 4, I would argue. But, more importantly,

That's important! *Conceptually*, applying the Standard literally to
the case of unterminated string literal (or character constant),

"It's unmatched string literal \n
"It's unmatched string literal's example \n
"/* ... */ is this comment or not?

in TP3, the above examples are tokenized as

{"} {It} {'} {s} {unmatched} {string} {literal} {\} {n}
{"} {It} {'unmatched string literal'} {s} {example} {\} {n}
{"} {is} {this} {comment} {or} {not} {?}

respectively.

Of course, since the first tokens in each are all " matched the last
category of pp-tokens, after TP3 (tokenization) they invoke undefined
behavior. Althoguh an implementation turn on undefined behavior during
TP3 (tokenizaiotn), however, it does not make significant difference
in real world ("undefined behavior" covers all). But the literal
wording of the Standard requires that there is no failure during TP3.

Jun Woong

unread,
May 27, 2001, 5:08:59 AM5/27/01
to
In article <9eqfho$pnc$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren says...

>In article <M8YP6.1207$rn5....@www.newsranger.com>,
>Jun Woong <myco...@hanmail.net> wrote:
>>In article <9eaifv$6lb$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren says...
>>[...]
>>>
>>>C /*
>>
>>Could you elaborate on this?
>>I agree that it is one of those approachs, but can't understand why
>>CORRECT one.
>
>It is the only way to start a combined C90 and Fortran 77 program with
>a comment in one of those two languages, that is correct in both
>(unextended) languages.
>

But, how is the first "C" replaced by a valid C code?
(The opening comment "/*" starts after "C")

Jun Woong

unread,
May 27, 2001, 6:11:11 AM5/27/01
to
In article <4yL8P7pC...@romana.davros.org>, Clive D. W. Feather says...
>In article <9ei7il$q82$1...@nntp.Stanford.EDU>, Zack Weinberg
><za...@stanford.edu> writes
>>>Unterminated string and character constants require arbitrary backup as
>>>well, but that's undefined behaviour.
>>... which therefore lets an implementation reject code in phase 3 if
>>you hit an unterminated string or character constant. How is this not
>>a failure mode?
>
>Because it happens in phase 4, I would argue. But, more importantly,

After re-reading the Standard, 6.4p3:
"If a ' or a " character matches the last category, the behavior is
undefined."

It can be interpreted as meaning that a conforming implementation can
invoke undefined behavior as soon as it detects a lone ' or "
character, when it does not finish the tokenization of the rest of
the source file *in TP3* yet. Therefore it can be argued that the
literal wording of the Standard permits undefined behavior to happen
in TP3 (tokenization).

If it is the very intent that there is no failure mode in TP3, why
is not a wording as "... is undefined (after translation phase 3)" or
"(in translation phase 4)" added? Is there other interpretation or
explicit description which makes the fact clear?


Thanks in advance...

Nick Maclaren

unread,
May 27, 2001, 9:56:18 AM5/27/01
to
In article <Lu3Q6.1476$rn5....@www.newsranger.com>,

Jun Woong <myco...@hanmail.net> wrote:
>In article <9eqfho$pnc$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren says...
>>In article <M8YP6.1207$rn5....@www.newsranger.com>,
>>Jun Woong <myco...@hanmail.net> wrote:
>>>In article <9eaifv$6lb$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren says...
>>>[...]
>>>>
>>>>C /*
>>>
>>>Could you elaborate on this?
>>>I agree that it is one of those approachs, but can't understand why
>>>CORRECT one.
>>
>>It is the only way to start a combined C90 and Fortran 77 program with
>>a comment in one of those two languages, that is correct in both
>>(unextended) languages.
>>
>
>But, how is the first "C" replaced by a valid C code?
>(The opening comment "/*" starts after "C")

C /* lots of stuff */ () { ... }

Valid in C90.

James Kuyper Jr.

unread,
May 27, 2001, 2:27:53 PM5/27/01
to
Nick Maclaren wrote:
>
> In article <Lu3Q6.1476$rn5....@www.newsranger.com>,
> Jun Woong <myco...@hanmail.net> wrote:
> >In article <9eqfho$pnc$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren says...
> >>In article <M8YP6.1207$rn5....@www.newsranger.com>,
> >>Jun Woong <myco...@hanmail.net> wrote:
> >>>In article <9eaifv$6lb$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren says...
> >>>[...]
> >>>>
> >>>>C /*
> >>>
> >>>Could you elaborate on this?
> >>>I agree that it is one of those approachs, but can't understand why
> >>>CORRECT one.
> >>
> >>It is the only way to start a combined C90 and Fortran 77 program with
> >>a comment in one of those two languages, that is correct in both
> >>(unextended) languages.
> >>
> >
> >But, how is the first "C" replaced by a valid C code?
> >(The opening comment "/*" starts after "C")
>
> C /* lots of stuff */ () { ... }
>
> Valid in C90.

So, having dropped implicit int in C99, have we thereby lost the ability
to write code that is both valid C and valid Fortran? I don't mean to
imply that this would be a bad thing!

If I'm reading the grammar correctly, after phase 4 the first
non-whitespace in a C program must be a declaration-specifier, and
without a typedef or #define, "C" doesn't qualify. On the other hand,
I'm no longer as familiar with Fortran 77 as I used to be, and know
almost nothing about Fortran 90. Is there an alternative approach?

Nick Maclaren

unread,
May 27, 2001, 2:28:39 PM5/27/01
to
In article <3B114729...@wizard.net>,

James Kuyper Jr. <kuy...@wizard.net> wrote:
>
>So, having dropped implicit int in C99, have we thereby lost the ability
>to write code that is both valid C and valid Fortran? I don't mean to
>imply that this would be a bad thing!

I believe so, but would not bet on it.

>If I'm reading the grammar correctly, after phase 4 the first
>non-whitespace in a C program must be a declaration-specifier, and
>without a typedef or #define, "C" doesn't qualify. On the other hand,
>I'm no longer as familiar with Fortran 77 as I used to be, and know
>almost nothing about Fortran 90. Is there an alternative approach?

I don't think so, but am not sure. Like you, I am not as familiar
with Fortran 90 as I was/am with Fortran 77. It is also possible
that columns 73 onwards are now specified to be ignored in Fortran
(i.e. 90), which would bring the other trick back :-)

Neil Booth

unread,
May 27, 2001, 5:28:09 PM5/27/01
to
Zack Weinberg <za...@stanford.edu> writes:

> My justification for so arguing is first that the specification in this
> instance is inconsistent with the rest of the specification for things
> which act like string constants. String literals, character constants,
> and "-delimited header names all behave one way; <>-delimited header names
> are specified to behave a different way. Irrespective of whether or not
> the way <>-headers are specified is *useful*, I hope you can see the value
> in having them act as much like other string constants as possible. I do
> think that the current spec is useless, but that's a secondary concern.

You can also get stupid things like

#define G >
#define foo maybe a macro containg a '>'

#include <foo /* G
*/

In this line preprocessing stage 3 doesn't know whether the /* is the
beginning of a comment or not, because it doesn't know whether it is
going to meet a '>' in the future, and if it doesn't meet such a '>'
in the future, being preprocessing phase 3, it still has *no idea*
whether foo expands to a macro containging a > and thus whether it's a
comment or not, or whether G might provide the '>'. So preprocessing
phase 3 gets conceptually entangled with phase 4. If you see what I
mean.

I think the above example demonstrates why the current rule has not
been thought about, and is unimplementable in an intelligent way. IMO
GCC does the right thing <tm>.

Neil.

Jun Woong

unread,
May 27, 2001, 8:43:54 PM5/27/01
to
In article <87snhqx...@monkey.daikokuya.demon.co.uk>, Neil Booth says...
[...]

>
>You can also get stupid things like
>
>#define G >
>#define foo maybe a macro containg a '>'
>
>#include <foo /* G
>*/
>
>In this line preprocessing stage 3 doesn't know whether the /* is the
>beginning of a comment or not, because it doesn't know whether it is
>going to meet a '>' in the future, and if it doesn't meet such a '>'
>in the future, being preprocessing phase 3, it still has *no idea*
>whether foo expands to a macro containging a > and thus whether it's a
>comment or not, or whether G might provide the '>'.

Can G provide the '>'?
Comments are processed in TP3 (i.e., during tokenization), thus the
#include line above is tokenized as

{#} {include} {<} {foo} [/* G */]

where [/* G */] indicates just token-separation by the comment and is
replaced by one space character *before* TP4 (the comment does not
survive TP3).

In TP4, the sequence of the pp-tokens is only

{#} {include} {<} {foo}

, and the resuting tokens of the invocation of foo are

{#} {include} {<} {maybe} {a} {macro} {containing} {a} {'>'}

After all, this does not match any one of two forms (<>, ""} and
causes undefined bahavior. [A conforming implementation can't even
merge the tokens into one <>-header name in the legal way (I mean,
without this undefined behavior)]

Wclodius

unread,
May 27, 2001, 11:34:32 PM5/27/01
to
In Fortran 90 there are now two source forms: fixed and free.

Fixed source form is roughly as before. 72 characters is defined as the maximum
length of a statement and not as the statement length. Compilers are required
to accept, but warn about, statements longer than 72 characters. Comments as in
F77 may start with a "C" or "*" in column 1, but may also start with a "c" in
column 1, or with a "!" in any position except column 6 (outside comments or
string literals). String literals are delimited by mathing "'" or '"'.
Statement lables appear only in positions 1-5. Any character other than a blank
in column 6 continues the previous statement. Blanks remain insignifican in
most contexts. END statements have special restrictions.

Free source form only has one form of comment which starts with an "!" and goes
to the end of the line, which can be up to 132 characters long. Blanks are
significant outside of character lierals and comments. Source, but not comment,
lines are continued with an "&" at the end of the line to be continued.
Multiple statements may appear on a line separated by ";". Most statements may
begin with a numeric label, separated from the main part of the statement by
one or more blanks. A blank line is treated as a comment.

It is possible, but a pain, to write code that is compatible with both source
forms, e.g., put an "&" if necessary beyond column 72 and another on in column
6 in order to continue statements and not use significant blanks.

William B. Clodius

Thomas...@cologne.de

unread,
May 28, 2001, 5:04:08 AM5/28/01
to
Nick Maclaren <nm...@cus.cam.ac.uk> wrote:

(valid for both C and Fortran)


>The correct approach is to start it:
>
>C /*

What about

*/

This is valid Fortran 77 (a * in the first column is a comment),
but is it valid C?

Thomas Pornin

unread,
May 28, 2001, 5:35:52 AM5/28/01
to
According to Neil Booth <ne...@daikokuya.demon.co.uk>:

> #include <foo /* G
> */

6.4.9/1 states that:
<< Except within a character constant, a string literal, or a comment,
the characters /* introduce a comment. The contents of such a comment
are examined only to identify multibyte characters and to find the
characters */ that terminate it. >>

However, the '# include <h-char-sequence> new-line' specified in
6.10.2/2 implies that, in your example:

#include <foo /* G

the '/*' cannot introduce a comment, since its contents would have been
examined (to look for a '>'). Therefore it must be tokenized that way:

{#} {include} {whitespace} {<} {foo} {whitespace} {/} {*} {whitespace} {G}

but, then again, this is contradictory with 6.4.9/1. Therefore this
sequence of characters is not compatible with the grammar as described
in the standard.


In my opinion, either a diagnostic is required, or the standard is
self-contradictory, which is a gruesome thought.


--Thomas Pornin

Nick Maclaren

unread,
May 28, 2001, 7:29:43 AM5/28/01
to
In article <9et4a8$b0o$1...@mvmap66.ciw.uni-karlsruhe.de>,

A good point - but Fortran 90 introduced that, I think. However, it
does provide a starting place for C99 and Fortran 90:

* /* ... */ int fred (void) { ; }

Jun Woong

unread,
May 28, 2001, 10:17:02 AM5/28/01
to
In article <9et65o$vlq$1...@nef.ens.fr>, Thomas Pornin says...

>
>According to Neil Booth <ne...@daikokuya.demon.co.uk>:
>> #include <foo /* G
>> */
>
>6.4.9/1 states that:
><< Except within a character constant, a string literal, or a comment,
>the characters /* introduce a comment. The contents of such a comment
>are examined only to identify multibyte characters and to find the
>characters */ that terminate it. >>
>
>However, the '# include <h-char-sequence> new-line' specified in
>6.10.2/2 implies that, in your example:
>
>#include <foo /* G
>
>the '/*' cannot introduce a comment, since its contents would have been
>examined (to look for a '>').

No, I don't think so. The '# include <h-char-sequence> new-line' does
not match the #include line above (in TP3), which means that the pp-
tokens "<foo /* G" following the #include directive are treated as
normal text. As I wrote in other posting, "/* G */" is recognized just
as a comment, thus performs the token-separation and does not survive
TP3.

Note that, in the following example, "stdio" is recognized as an
identifier, and invokes the expansion of "stdio".

#define nothing
#define stdio stdio.h
#include <stdio> nothing

Of course, the sequence of the resulting tokens {<} {stdio} {.} {h}
{>} matchs the one of two forms, but the way in which it is merged,
is implementation-defined as you know.

Niklas Matthies

unread,
May 28, 2001, 2:20:50 PM5/28/01
to
On 28 May 2001 11:29:43 GMT, Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
> In article <9et4a8$b0o$1...@mvmap66.ciw.uni-karlsruhe.de>,
> <Thomas...@cologne.de> wrote:
> >Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
> >
> >(valid for both C and Fortran)
> >>The correct approach is to start it:
> >>
> >>C /*
> >
> >What about
> >
> >*/
> >
> >This is valid Fortran 77 (a * in the first column is a comment),
> >but is it valid C?
>
> A good point - but Fortran 90 introduced that, I think. However, it
> does provide a starting place for C99 and Fortran 90:
>
> * /* ... */ int fred (void) { ; }

And how, exactly, is that valid C99?

-- Niklas Matthies

Nick Maclaren

unread,
May 28, 2001, 4:50:23 PM5/28/01
to
In article <slrn9h55mr.ja.news/comp....@ns.nmhq.net>,

Niklas Matthies <news/comp....@nmhq.net> wrote:
>
>And how, exactly, is that valid C99?

Er, you have a point :-) The type declarator syntax is confusing,
but I managed to get even more confused on my own!

Wclodius

unread,
May 28, 2001, 6:54:58 PM5/28/01
to
No a '*' in the first column indicating a comment was introduced in Fortran 77.
It was recommended in some texts as a way of indicating that the code was not
F66 compatible, and I believe was introduced in the language with that intent,
but was rarely adopted by programmers.

William B. Clodius

Christian Bau

unread,
May 29, 2001, 3:53:18 AM5/29/01
to
Neil Booth wrote:
>
> Zack Weinberg <za...@stanford.edu> writes:
>
> > My justification for so arguing is first that the specification in this
> > instance is inconsistent with the rest of the specification for things
> > which act like string constants. String literals, character constants,
> > and "-delimited header names all behave one way; <>-delimited header names
> > are specified to behave a different way. Irrespective of whether or not
> > the way <>-headers are specified is *useful*, I hope you can see the value
> > in having them act as much like other string constants as possible. I do
> > think that the current spec is useless, but that's a secondary concern.
>
> You can also get stupid things like
>
> #define G >
> #define foo maybe a macro containg a '>'
>
> #include <foo /* G
> */
>
> In this line preprocessing stage 3 doesn't know whether the /* is the
> beginning of a comment or not, because it doesn't know whether it is
> going to meet a '>' in the future, and if it doesn't meet such a '>'
> in the future, being preprocessing phase 3, it still has *no idea*
> whether foo expands to a macro containging a > and thus whether it's a
> comment or not, or whether G might provide the '>'. So preprocessing
> phase 3 gets conceptually entangled with phase 4. If you see what I
> mean.

You are quite wrong. Following a #include is either one of two special
preprocessing tokens that are only allowed in an #include directive
("file.h" or <header.h>) or an arbitrary sequence of preprocessing
tokens.

The line
#include <foo /* G
doesn't match the first syntax because there is no character sequence
that matches <header.h> or "file.h". Therefore everything is translated
into preprocessing tokens in the normal way, so the preprocessing tokens
are < and foo. It is absolutely clear that /* G */ is a comment. Now
since you have an #include preprocessing directive followed by arbitrary
preprocessing tokens, those preprocessing tokens are macro processed and
then they must match one of the forms "file.h" or <header.h>.

Thomas Pornin

unread,
May 29, 2001, 6:49:58 AM5/29/01
to
According to Christian Bau <christ...@isltd.insignia.com>:

> It is absolutely clear that /* G */ is a comment.

That's not that clear. To decide it had to be a comment, the
implementation must have scanned its contents (in case it would contain
a > and a new-line), and 6.4.9/1 is quite clear about the fact that the
contents of a /* ... */ comment are scanned only to identify multibyte
characters and find the sequence */.


--Thomas Pornin

Christian Bau

unread,
May 29, 2001, 7:02:23 AM5/29/01
to

I think you will find that if a header name contains // or /* then
behaviour is undefined.

For example:

#include <myheader/*>
or
#include <myheader/*comment*/.h>

So there is no need to examine what follows /* unless you want to
identify undefined behaviour.

Thomas Pornin

unread,
May 29, 2001, 7:19:47 AM5/29/01
to
According to Christian Bau <christ...@isltd.insignia.com>:
> I think you will find that if a header name contains // or /* then
> behaviour is undefined.

I think I will not find it, and actually I do not find it. The standard
states that, besides standard header names such as <stdio.h>, the way a
header is found from its name is implementation defined; but it is in no
way undefined behaviour. The implementation must either find a file or
issue a diagnostic (for violation of the 6.10.2/1 constraint).

Unless I have overlooked some sentence that you could show me in
the standard, of course.


--Thomas Pornin

Jun Woong

unread,
May 29, 2001, 11:09:49 AM5/29/01
to
por...@bolet.ens.fr (Thomas Pornin) wrote in message news:<9evusm$2t66$1...@nef.ens.fr>...

> According to Christian Bau <christ...@isltd.insignia.com>:
> > It is absolutely clear that /* G */ is a comment.
>
> That's not that clear.

In this case, that is.

#include <foo /* G
*/

is tokenized in TP3 as

{#} {include} {<} {foo} [/* G */]

And, [/* G */] is a comment, does only token-separation and
does not survive TP3 as I said in other posting.

Jun Woong

unread,
May 29, 2001, 12:02:50 PM5/29/01
to
Jun Woong<myco...@hanmail.net> wrote in message news:<y5tQ6.2564$rn5.1...@www.newsranger.com>...

Hmm, I'm very sure that the #include line above does not match "#
include <h-char-seq> new-line", but not sure that the Standrad allows
"stdio" to be macro-expanded. (I don't have C90/C99 handy now)

The line matches "# include pp-tokens new-line" and pp-tokens include
header names (<>, "" forms). Besides the Standard (C90+COR/C99) says,
a header name is recognized only in the #include directive line.

#define nothing
#define stdio string
#include <stdio.h> nothing

If an implementation merges {<} {string} {.} {h} {>} into a valid
header name <string.h>, is the Standard (C90/C99) clear on which
header of <stdio.h> and <string.h> is #included?

[GCC 2.95.3 #includes <string.h> in that case]

Max TenEyck Woodbury

unread,
May 29, 2001, 5:42:42 PM5/29/01
to Nick Maclaren
Nick Maclaren wrote:
>
> Er, no, it doesn't. Fortran 77 specifies that the program must lie
> entirely in columns 1 to 72 - if there is any text outside that,
> the behaviour is undefined. This caused considerable trouble with
> some people assuming what you did (i.e. the 'IBM' camp) and some
> systems allowing lines to exceed 72 characters (i.e. the 'DEC'
> camp).

This is getting off topic, but if the FORTRAN 77 standard specified
that no text could exist beyond column 72, it did NOT codify existing
practice. In the original FORTRAN punch card environment, the field
73-80 was frequently used for deck sequence numbers. Hard experience
taught many of us the necessity of sequence numbering production
decks. Specifically it was the difference between a mindless half hour
with a card sorting machine and days of mind wrenching redebugging the
scrambled deck. (Pray that you had a recent enough listing!)

As for DEC's allowing source text beyond column 72, this was
explicitly stated in their documentation as one of their extensions,
among quite a few other useful extensions, and could be turned off
if required.

mt...@cds.duke.edu

Nick Maclaren

unread,
May 29, 2001, 5:46:50 PM5/29/01
to
In article <3B1417D2...@cds.duke.edu>,

Max TenEyck Woodbury <mt...@cds.duke.edu> wrote:
>Nick Maclaren wrote:
>>
>> Er, no, it doesn't. Fortran 77 specifies that the program must lie
>> entirely in columns 1 to 72 - if there is any text outside that,
>> the behaviour is undefined. This caused considerable trouble with
>> some people assuming what you did (i.e. the 'IBM' camp) and some
>> systems allowing lines to exceed 72 characters (i.e. the 'DEC'
>> camp).
>
>This is getting off topic, but if the FORTRAN 77 standard specified
>that no text could exist beyond column 72, it did NOT codify existing
>practice. In the original FORTRAN punch card environment, the field
>73-80 was frequently used for deck sequence numbers. Hard experience
>taught many of us the necessity of sequence numbering production
>decks. Specifically it was the difference between a mindless half hour
>with a card sorting machine and days of mind wrenching redebugging the
>scrambled deck. (Pray that you had a recent enough listing!)

One of the great advantages of paper tape :-)

>As for DEC's allowing source text beyond column 72, this was
>explicitly stated in their documentation as one of their extensions,
>among quite a few other useful extensions, and could be turned off
>if required.

Quite so. But both the use of columns 73-80 for sequence numbers
and allowing longer lines were undefined behaviour according to
the Fortran 77 standard.

This is one of the examples where Fortran starting slipping down
the path that has caused so much trouble in C. It didn't make
it clear exactly what the constraints were on BOTH the programmer
AND the implementor, and different people assumed different things.

Max TenEyck Woodbury

unread,
May 29, 2001, 6:08:18 PM5/29/01
to Wclodius
A more flexible approach is to #include "file" the common code.
In that environment, you can define macros to stretch the 'C'
language to the point where it looked a lot more like FORTRAN
than it normally does. This does put significant restrictions
on how you write your code.

To start, you need:

#define C

to eliminate all those FORTRAN comment delimiters and this
implies that none of your variables can simply be 'c'.
More macros can be used to translate FORTRAN array references
to 'C'.

On the other hand, the same file can be INCLUDEd into a
FORTRAN template without modification too.

However, the whole process with all its restrictions is
incredibly painful, and not really generalizable.

mt...@cds.duke.edu

Thomas Pornin

unread,
May 30, 2001, 8:52:53 AM5/30/01
to
According to Jun Woong <myco...@hanmail.net>:

> #include <foo /* G
> */
>
> is tokenized in TP3 as
>
> {#} {include} {<} {foo} [/* G */]

An my point is that this tokenization is contradictory with 6.4.9/1.
6.4.9/1 states that the contents of a comment are not examined, except
for finding the */. And to decide that there is no trailing '>' on the
line containing '#include', the contents of /* G */ must be examined,
and for finding something else than the */. So this cannot be a comment.

And the whole thing happens during TP3. What happens afterwards is
irrelevant.


--Thomas Pornin

Jun Woong

unread,
May 30, 2001, 11:10:28 PM5/30/01
to
por...@bolet.ens.fr (Thomas Pornin) wrote in message news:<9f2qf5$24ge$1...@nef.ens.fr>...

> According to Jun Woong <myco...@hanmail.net>:
> > #include <foo /* G
> > */
> >
> > is tokenized in TP3 as
> >
> > {#} {include} {<} {foo} [/* G */]
>
> An my point is that this tokenization is contradictory with 6.4.9/1.

Yes, I was able to see your point.

> 6.4.9/1 states that the contents of a comment are not examined, except
> for finding the */. And to decide that there is no trailing '>' on the
> line containing '#include', the contents of /* G */ must be examined,
> and for finding something else than the */. So this cannot be a comment.

The Standard also says, if a header name contains /* in it,. the behavior
is undefined. Thus, an implementation can simply ignore to examine
whether the comment has > or not.

Thomas Pornin

unread,
May 31, 2001, 9:13:51 AM5/31/01
to
According to Jun Woong <myco...@hanmail.net>:
> The Standard also says, if a header name contains /* in it,. the behavior
> is undefined.

Where ? I could not find any such instance. Maybe I missed some crucial
paragraph.


--Thomas Pornin

Nick Maclaren

unread,
May 31, 2001, 9:32:43 AM5/31/01
to

6.4.7 #3 in C99. Also there in C90.

Jun Woong

unread,
May 31, 2001, 12:54:21 PM5/31/01
to
por...@bolet.ens.fr (Thomas Pornin) wrote in message news:<9f5g2f$3a7$1...@nef.ens.fr>...

6.4.7p3:
"If the characters ', \, ", //, or /* occur in the sequence between
the < and > delimiters, the behavior is undefined. Similarly, if the
characters ', \, //, or /* occur in the sequence between the "
delimiters, the behavior is undefined."

Note that the Standard uses the term "the sequence between ...
delimiters" rather than "... header name".


#define G >
#define foo stdio.h>


#include <foo /* G
*/

is tokenized as the followings *in TP3*. [there is no the > delimiter
in TP3]

{#} {include} {<} {foo} [/* G */]

And macro-expanded to

{#} {include} {<} {stdio} {.} {h} {>}

#define foo stdio.h>
#include <foo /* > */

has undefined behavior.
[GCC acts as /* > */ is just a comment]

#define foo stdio.h>
#include <foo /* > */ >

has undefined behavior.

#define foo stdio.h"
#include "foo /* comment
*/

has undefined behavior due to lone " character.

Both

#include "foo /* " */

and

#include "foo /* " */"

have undefined behavior.

Larry Jones

unread,
May 29, 2001, 12:28:50 PM5/29/01
to
Thomas Pornin (por...@bolet.ens.fr) wrote:
>
> I think I will not find it, and actually I do not find it. The standard
> states that, besides standard header names such as <stdio.h>, the way a
> header is found from its name is implementation defined; but it is in no
> way undefined behaviour. The implementation must either find a file or
> issue a diagnostic (for violation of the 6.10.2/1 constraint).

6.4.7p3:

If the characters ', \, ", //, or /* occur in the sequence
between the < and > delimiters, the behavior is undefined.
Similarly, if the characters ', \, //, or /* occur in the
sequence between the " delimiters, the behavior is undefined.

-Larry Jones

Good gravy, whose side are you on?! -- Calvin

Jun Woong

unread,
Jun 3, 2001, 11:47:34 PM6/3/01
to
myco...@hanmail.net (Jun Woong) wrote in message news:<94f0654c.01053...@posting.google.com>...

In the above examples, does it affect conformance of them
whether there is a new-line within /* .. */ or not?

Almost compilers which I've tested them with fail to compile
the followings that are strictly conforming (modulo the
implementation-defined merging of a header name), I believe.

#define G >
#define foo stdio.h>
#include <foo /*
G */

#define foo stdio.h>
#include <foo /*
> */

[note that the > delimiter does not appear on the same line
as #include directive]

Are they simply not conforming to the C90/C99 standard?

Clive D. W. Feather

unread,
Jun 5, 2001, 9:36:37 AM6/5/01
to
In article <9f2qf5$24ge$1...@nef.ens.fr>, Thomas Pornin
<por...@bolet.ens.fr> writes

>An my point is that this tokenization is contradictory with 6.4.9/1.
>6.4.9/1 states that the contents of a comment are not examined, except
>for finding the */. And to decide that there is no trailing '>' on the
>line containing '#include', the contents of /* G */ must be examined,
>and for finding something else than the */. So this cannot be a comment.

Not a problem. It is undefined behaviour for a header-name to contain a
/* sequence. So given a line like:

# include < stdio.h /* > */

it is unspecified whether it tokenises as:

{#} {include} {< stdio.h /* >} */
or:
{#} {include} {<} {stdio}{.}{h} {/* > */}

The former case invokes undefined behaviour, and so the whole line is
undefined.

--
Clive D.W. Feather, writing for himself | Home: <cl...@davros.org>
Tel: +44 20 8371 1138 (work) | Web: <http://www.davros.org>
Fax: +44 20 8371 4037 (D-fax) | Work: <cl...@demon.net>
Written on my laptop; please observe the Reply-To address

Clive D. W. Feather

unread,
Jun 5, 2001, 9:17:06 AM6/5/01
to
In article <9f15ca$or2$1...@pegasus.csx.cam.ac.uk>, Nick Maclaren
<nm...@cus.cam.ac.uk> writes

>>Specifically it was the difference between a mindless half hour
>>with a card sorting machine and days of mind wrenching redebugging the
>>scrambled deck. (Pray that you had a recent enough listing!)
>
>One of the great advantages of paper tape :-)

Only if you had a goodly supply of mylar patches for those times you
dropped the (tightly-wound) reel and it got itself knotted while
untangling.

Clive D. W. Feather

unread,
Jun 5, 2001, 9:42:31 AM6/5/01
to
In article <94f0654c.01060...@posting.google.com>, Jun Woong
<myco...@hanmail.net> writes

>Almost compilers which I've tested them with fail to compile
>the followings that are strictly conforming (modulo the
>implementation-defined merging of a header name), I believe.
>
>#define G >
>#define foo stdio.h>
>#include <foo /*
>G */

I think you meant to have G outside the comment there.

>Are they simply not conforming to the C90/C99 standard?

That would be my belief, yes.

Clive D. W. Feather

unread,
Jun 5, 2001, 9:29:43 AM6/5/01
to
In article <9emd9t$d3k$1...@nntp.Stanford.EDU>, Zack Weinberg
<za...@stanford.edu> writes
>I am not arguing that the standard is unclear, or that GCC correctly
>implements the language as specified. I am arguing that the
>language-as-specified should be changed.

Okay. Others appeared to be claiming that GCC was correct.

>Further, your repeated assertion that this is easy to implement assumes
>a particular architecture for the preprocessor which is not the one GCC
>uses. If you're not prepared to believe me that this would take major
>structural changes, I invite you to go look for yourself.

I based that assertion on another posting in this thread.

Jun Woong

unread,
Jun 6, 2001, 6:50:37 AM6/6/01
to
"Clive D. W. Feather" <cl...@on-the-train.demon.co.uk> wrote in message news:<ELFDR22H...@romana.davros.org>...

> In article <94f0654c.01060...@posting.google.com>, Jun Woong
> <myco...@hanmail.net> writes
> >Almost compilers which I've tested them with fail to compile
> >the followings that are strictly conforming (modulo the
> >implementation-defined merging of a header name), I believe.
> >
> >#define G >
> >#define foo stdio.h>
> >#include <foo /*
> >G */
>
> I think you meant to have G outside the comment there.

No. As I described above,

#define foo stdio.h>
#include <foo /* > */

has undefined behavior. But, I think

#define foo stdio.h>
#include <foo /*

> */

does not have undefined behavior because the > delimiter
appears on the next line (which is not the same line as
the < delimiter). Note that a header name token does not
contain new-line character.

>
> >Are they simply not conforming to the C90/C99 standard?
>
> That would be my belief, yes.

Do you mean that the compilers are not conforming?
I used "they" to refer to the compilers, not to the examples
which I wrote.

Jun Woong

unread,
Jun 7, 2001, 1:04:28 AM6/7/01
to
"Clive D. W. Feather" <cl...@on-the-train.demon.co.uk> wrote in message news:<OrtBZE2l...@romana.davros.org>...

> In article <9f2qf5$24ge$1...@nef.ens.fr>, Thomas Pornin
> <por...@bolet.ens.fr> writes
> >An my point is that this tokenization is contradictory with 6.4.9/1.
> >6.4.9/1 states that the contents of a comment are not examined, except
> >for finding the */. And to decide that there is no trailing '>' on the
> >line containing '#include', the contents of /* G */ must be examined,
> >and for finding something else than the */. So this cannot be a comment.
>
> Not a problem. It is undefined behaviour for a header-name to contain a
> /* sequence. So given a line like:
>
> # include < stdio.h /* > */
>
> it is unspecified whether it tokenises as:
>
> {#} {include} {< stdio.h /* >} */
> or:
> {#} {include} {<} {stdio}{.}{h} {/* > */}
>
> The former case invokes undefined behaviour, and so the whole line is
> undefined.

This is trivial question, but when interpreting the Standard strictly,
is a conforming implementation required to tokenize it only as one of
two? The Standard does not use the term "header name" in describing
the undefined behavior in this case, it uses "the < and > delimiters".

I think, its undefined behavior is not related to the way of
tokenization; one can detect that undefined behavior by simply
checking existence of /* sequence between the < and > delimiters
within #include directives without tokenization of header names.

0 new messages