__LINE__ and backslash-newline

126 views
Skip to first unread message

Daniel Villeneuve

unread,
Jan 19, 1999, 3:00:00 AM1/19/99
to
In the section 6.10.4, the line number is defined as follows: ``The line
number of the current source line is one greater than the number of
new-line characters read or introduced in translation phase 1 (5.1.1.2)
while processing the source file to the current token.''

In order to keep track of this number, the lexer has to identify the end
of the current token (this was the subject of another discussion some
while ago). Given that the line number of the first source line is 1,
what should be the value of i in the following?
int i = __LINE__\
;

The backslash-newline pair being ``non-existant'', could it be
``attached'' to __LINE__, thus producing the value `2'? Obviously, the
value `1' is reasonable as well. Does the Standard make the effect of
such a statement implementation-defined?

--
Daniel Villeneuve
Graduate student in Operations Research
GERAD/Mathématiques Appliquées
École Polytechnique de Montréal

Pete Becker

unread,
Jan 19, 1999, 3:00:00 AM1/19/99
to
Daniel Villeneuve wrote:
>
> In the section 6.10.4, the line number is defined as follows: ``The line
> number of the current source line is one greater than the number of
> new-line characters read or introduced in translation phase 1 (5.1.1.2)
> while processing the source file to the current token.''
>
> In order to keep track of this number, the lexer has to identify the end
> of the current token (this was the subject of another discussion some
> while ago). Given that the line number of the first source line is 1,
> what should be the value of i in the following?
> int i = __LINE__\
> ;
>
> The backslash-newline pair being ``non-existant'', could it be
> ``attached'' to __LINE__, thus producing the value `2'? Obviously, the
> value `1' is reasonable as well. Does the Standard make the effect of
> such a statement implementation-defined?

The standard is quite clear: its value is 1. Read about "phases of
translation", and note that the line number is generated after phase 1,
but backslash newline is not processed until phase 2.

--
Pete Becker
Dinkumware, Ltd.
http://www.dinkumware.com

Clive D.W. Feather

unread,
Jan 19, 1999, 3:00:00 AM1/19/99
to
In article <36A4BA62...@crt.umontreal.ca>, Daniel Villeneuve
<dan...@crt.umontreal.ca> writes

>In order to keep track of this number, the lexer has to identify the end
>of the current token (this was the subject of another discussion some
>while ago). Given that the line number of the first source line is 1,
>what should be the value of i in the following?
>int i = __LINE__\
>;
>
>The backslash-newline pair being ``non-existant'', could it be
>``attached'' to __LINE__, thus producing the value `2'? Obviously, the
>value `1' is reasonable as well. Does the Standard make the effect of
>such a statement implementation-defined?

I would argue (weakly) that it is to the start of the current token, and
thus 1. However, if you want to play safe you should assume it is to an
unspecified point between the start and end of the current token, so in
that case it would be 2.

[This was addressed in a UK Defect Report ages ago, but I forget what
the answer was.]

--
Clive D.W. Feather | Director of | Work: <cl...@demon.net>
Tel: +44 181 371 1138 | Software Development | Home: <cl...@davros.org>
Fax: +44 181 371 1037 | Demon Internet Ltd. | Web: <http://www.davros.org>
Written on my laptop; please observe the Reply-To address

David R Tribble

unread,
Jan 19, 1999, 3:00:00 AM1/19/99
to
Daniel Villeneuve wrote:
>
> In the section 6.10.4, the line number is defined as follows:
> ``The line number of the current source line is one greater than the
> number of new-line characters read or introduced in translation phase
> 1 (5.1.1.2) while processing the source file to the current token.''
>
> In order to keep track of this number, the lexer has to identify the
> end of the current token (this was the subject of another discussion
> some while ago). Given that the line number of the first source line
> is 1, what should be the value of i in the following?
> int i = __LINE__\
> ;
>
> The backslash-newline pair being ``non-existant'', could it be
> ``attached'' to __LINE__, thus producing the value `2'? Obviously,
> the value `1' is reasonable as well. Does the Standard make the
> effect of such a statement implementation-defined?

I'm not sure, but I think the standard doesn't have much to say
about this.

I personally believe it's just as reasonable for a compiler to keep
track of the *beginning* of each token instead. (The lexers I
write do just that, and I haven't had any complaints.) It's
about the same amount of work for the lexer in either case.

With that in mind, what is the result of this code:

1: int i = __LI\
2: NE__;

The standard would probably classify this as implementation-defined,
if it says anything about it at all. (But clearly there are only
two valid answers: 'i' is either 1 or 2.)

-- David R. Tribble, dtri...@technologist.com --

Paul Eggert

unread,
Jan 20, 1999, 3:00:00 AM1/20/99
to
Daniel Villeneuve <dan...@crt.umontreal.ca> writes:

>int i = __LINE__\
>;

>The backslash-newline pair being ``non-existant'', could it be
>``attached'' to __LINE__, thus producing the value `2'?

I don't see why not. The standard doesn't specify exactly what happens
here, so the implementation has some wiggle room. Similarly,

int i = __LINE__\
\
;

might cause i to have the value 1, 2, or 3.

Daniel Villeneuve

unread,
Jan 20, 1999, 3:00:00 AM1/20/99
to
Clive D.W. Feather wrote:
> I would argue (weakly) that it is to the start of the current token, and
> thus 1.

This has the merit of being uniquely defined. However, my question
arose because, in a previous discussion which implied the use of
__LINE__ as an argument of a function-like macro, ...

Clive D.W. Feather wrote:
> In this case, I wasn't proposing a change to make it be standardised. I
> was pointing out that that's how I read the definition of __LINE__. Look
> at 6.8.4 paragraph 2 - the line number is based on the number of
> newlines up to the "current token". When processing a macro expansion,
> this surely is the ) at the end of the invocation.

So sometimes, it is expected that it is the _end_ of a construct that
defines the associated value of __LINE__.

Maybe it's simpler to agree on the start of the current token, and that
for function-like macros, this means the start of the identifier token
that introduces the macro.

Clive D.W. Feather

unread,
Jan 20, 1999, 3:00:00 AM1/20/99
to
In article <36A4D7EF...@acm.org>, Pete Becker <peteb...@acm.org>
writes

>> int i = __LINE__\
>> ;

>The standard is quite clear: its value is 1. Read about "phases of


>translation", and note that the line number is generated after phase 1,
>but backslash newline is not processed until phase 2.

That's irrelevant. Labelling each character with its source line number
you get:

int i = __LINE__;
11111111111111112

The question is: is the line number of a token the line number of the
first character, the last character, the first character not part of the
token, or an unspecified character within the token ? Or what ?

Pete Becker

unread,
Jan 20, 1999, 3:00:00 AM1/20/99
to
Clive D.W. Feather wrote:
>
> In article <36A4D7EF...@acm.org>, Pete Becker <peteb...@acm.org>
> writes
>
> >> int i = __LINE__\
> >> ;
>
> >The standard is quite clear: its value is 1. Read about "phases of
> >translation", and note that the line number is generated after phase 1,
> >but backslash newline is not processed until phase 2.
>
> That's irrelevant. Labelling each character with its source line number
> you get:
>
> int i = __LINE__;
> 11111111111111112
>
> The question is: is the line number of a token the line number of the
> first character, the last character, the first character not part of the
> token, or an unspecified character within the token ? Or what ?

Since every character in the token occurs in line 1, how can its line
number possibly be anything other than 1?

Peter Seebach

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <36A69211...@acm.org>,

Pete Becker <peteb...@acm.org> wrote:
>> int i = __LINE__;
>> 11111111111111112

>Since every character in the token occurs in line 1, how can its line


>number possibly be anything other than 1?

blah blah blah
... \n ...
++line
...
...
end of token
unpush character
...
val = line;

:)

It's a real question. No one has ever adequately explained how __LINE__
works, and I believe the committee has decided it's not worth deciding.

-s
--
Copyright 1999, All rights reserved. Peter Seebach / se...@plethora.net
C/Unix wizard, Pro-commerce radical, Spam fighter. Boycott Spamazon!
Send me money - get cool programs and hardware! No commuting, please.
Visit my new ISP <URL:http://www.plethora.net/> --- More Net, Less Spam!

Clive D.W. Feather

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <36A5ED25...@crt.umontreal.ca>, Daniel Villeneuve
<dan...@crt.umontreal.ca> writes
>Clive D.W. Feather wrote:
[something]

>This has the merit of being uniquely defined. However, my question
>arose because, in a previous discussion which implied the use of
>__LINE__ as an argument of a function-like macro, ...
>
>Clive D.W. Feather wrote:

[something inconsistent]

>Maybe it's simpler to agree on the start of the current token, and that
>for function-like macros, this means the start of the identifier token
>that introduces the macro.

I suspect that the best answer is to say that it is the line number of
some unspecified point between the first character of the token and the
character immediately following the token.

Clive D.W. Feather

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <dizp2.228$Oi....@ptah.visi.com>, Peter Seebach
<se...@plethora.net> writes

>It's a real question. No one has ever adequately explained how __LINE__
>works, and I believe the committee has decided it's not worth deciding.

It's not even that: the DR that asked the question never got answered.

Pete Becker

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
Peter Seebach wrote:
>
> In article <36A69211...@acm.org>,
> Pete Becker <peteb...@acm.org> wrote:
> >> int i = __LINE__;
> >> 11111111111111112
>
> >Since every character in the token occurs in line 1, how can its line
> >number possibly be anything other than 1?
>
> blah blah blah
> ... \n ...
> ++line
> ...
> ...
> end of token
> unpush character
> ...
> val = line;
>
> :)
>
> It's a real question. No one has ever adequately explained how __LINE__
> works, and I believe the committee has decided it's not worth deciding.

I don't understand the point of this example. Yes, if a token begins on
one line and ends on another, there's an ambiguity about which line it's
on. That's not the case in the example that I responded to: every
character in __LINE__ is on the same line. Is this supposed to
illustrate that a naive implementation can get this wrong? If so, that's
not particularly relevant. What is the issue here?

Larry Jones

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
Pete Becker wrote:
>
> > >> int i = __LINE__\
> > >> ;

>
> Since every character in the token occurs in line 1, how can its line
> number possibly be anything other than 1?

Because the implementation doesn't know it's at the end of the token
until it's on line 2.-Larry Jones

I suppose if I had two X chromosomes, I'd feel hostile too. -- Calvin

Peter Seebach

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <xkwFdkMc...@romana.davros.org>,

Clive D.W. Feather <cl...@davros.org> wrote:
>In article <dizp2.228$Oi....@ptah.visi.com>, Peter Seebach
><se...@plethora.net> writes
>>It's a real question. No one has ever adequately explained how __LINE__
>>works, and I believe the committee has decided it's not worth deciding.

>It's not even that: the DR that asked the question never got answered.

Yes. The committee decided that it was not worth answering. In a vote.
I think it may even have been a formal vote, but it may have just been
a show of hands.

Peter Seebach

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <36A72523...@acm.org>,

Pete Becker <peteb...@acm.org> wrote:
>I don't understand the point of this example. Yes, if a token begins on
>one line and ends on another, there's an ambiguity about which line it's
>on. That's not the case in the example that I responded to: every
>character in __LINE__ is on the same line. Is this supposed to
>illustrate that a naive implementation can get this wrong? If so, that's
>not particularly relevant. What is the issue here?

My point is that there's nothing unlikely about counting the lines first,
and *then* noticing that the next character (which was on the next line)
was really no longer part of this token.

Basically, given
__LINE__\
;
you can't tell, until you've seen the newline, that you're done with __LINE__,
so you're on line 2 when you definitely see that you're done with the token,
so I'd consider that a reasonable answer to the question "what line were you
on when you saw this token".

David R Tribble

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
Clive D.W. Feather wrote:
> I suspect that the best answer is to say that it is the line number of
> some unspecified point between the first character of the token and
> the character immediately following the token.

This sounds reasonable, and it probably covers most existing
implementations.

Thus this fragment:

1: int i =\
2: \
3: __L\
4: IN\
5: E__\
6: \
7: ;

which results in these characters and source lines:

int i =__LINE__;
1111111333445557

would result in one of the following acceptable values of 'i':
3, 4, 5, 6, or 7.
I consider any other values as broken.

Clive D.W. Feather

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <eULp2.16$NN....@ptah.visi.com>, Peter Seebach
<se...@plethora.net> writes

>>It's not even that: the DR that asked the question never got answered.
>Yes. The committee decided that it was not worth answering. In a vote.
>I think it may even have been a formal vote, but it may have just been
>a show of hands.

You might be right, though I don't remember it.

On the other hand, I note that none of DRs 173 to 178 ever got answered
offically, though it looks like the other 5 have all been solved in C9X.

James Kuyper

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
Pete Becker wrote:
>
> Peter Seebach wrote:
> >
> > In article <36A69211...@acm.org>,

> > Pete Becker <peteb...@acm.org> wrote:
> > >> int i = __LINE__;
> > >> 11111111111111112
> >
> > >Since every character in the token occurs in line 1, how can its line
> > >number possibly be anything other than 1?
> >
> > blah blah blah
> > ... \n ...
> > ++line
> > ...
> > ...
> > end of token
> > unpush character
> > ...
> > val = line;
> >
> > :)
> >
> > It's a real question. No one has ever adequately explained how __LINE__
> > works, and I believe the committee has decided it's not worth deciding.
>
> I don't understand the point of this example. Yes, if a token begins on
> one line and ends on another, there's an ambiguity about which line it's
> on. That's not the case in the example that I responded to: every
> character in __LINE__ is on the same line. Is this supposed to
> illustrate that a naive implementation can get this wrong? If so, that's
> not particularly relevant. What is the issue here?

issue 1: does the standard really specify that this naive implementation
is wrong? I'm agnostic on that issue.

issue 2: should the standard be so specific that it does prohibit this
naive implementation? I can't come up with a good argument why __LINE__
needs to be that well defined. AFAIK, it's main use is as a debugging
aid, intended to point the programmer at the right piece of code. Anyone
who writes code such that __LINE__'s value leaves them unclear about the
place at which it was evaluated, deserves what they get.

Douglas A. Gwyn

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
"Clive D.W. Feather" wrote:
> [This was addressed in a UK Defect Report ages ago, but I forget what
> the answer was.]

When the __LINE__ token occurs entirely *on* one line,
that is obviously the appropriate line number.
When __LINE__ spans multiple lines (via \new-line etc.),
as I recall we didn't care to specify which of the
possibilities it had to be, since it doesn't matter in practice.

Dennis Ritchie

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
For what it's worth, my cpp (the one packaged with lcc,
and used on Plan 9 when Ken's build-in and deliberately
enfeebled version won't do) treats Tribble's example

>
> 1: int i =\
> 2: \
> 3: __L\
> 4: IN\
> 5: E__\
> 6: \
> 7: ;
>
as expanding to
int i =1;

It is resolute in considering the line number of
\-pasted things as belonging to the first line with
the \. (In trying the example, I trimmed the initial
[0-9}: and white-space, of course.)

Dennis

Andreas Schwab

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to

I which way is this:

int i = __LINE__\
;

different from this:

int i =\
\
__L\
IN\
E__\
\
;

(GCC emits `int i = 2' for the first and `int i = 7;' for the second).
Although in the first example __LINE__ occurs entirely on one line the
compiler still has to look ahead beyond \ to find the end of the token.

--
Andreas Schwab "And now for something
sch...@issan.cs.uni-dortmund.de completely different"
sch...@gnu.org

Paul Eggert

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
Andreas Schwab <sch...@issan.cs.uni-dortmund.de> writes:
>"Douglas A. Gwyn" <DAG...@null.net> writes:
>|> When the __LINE__ token occurs entirely *on* one line,
>|> that is obviously the appropriate line number.

>int i = __LINE__\
>;

>... GCC emits `int i = 2'

Conversely, for

int i = \
__LINE__;

Ritchie's preprocessor presumably emits `int i = 1;'.
So we have two practical counterexamples to Gwyn's suggestion.

I think both implementations conform to the standard
in this amusingly trivial matter.

John Hauser

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
Dennis Ritchie wrote:
> For what it's worth, my cpp [...]

> is resolute in considering the line number of
> \-pasted things as belonging to the first line with
> the \.

That's what the preprocessor I wrote does, too. It's become clear that
I'm going to have to change it, and the change will involve keeping
additional line number information around just to satisfy this detail
for `__LINE__'. I hope it's going to be worth the trouble.

- John Hauser

John Hauser

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
Paul Eggert wrote:

>
> Andreas Schwab wrote:
> >
> >int i = __LINE__\
> >;
> >
> >... GCC emits `int i = 2'
>
> Conversely, for
>
> int i = \
> __LINE__;
>
> Ritchie's preprocessor presumably emits `int i = 1;'.

> I think both implementations conform to the standard


> in this amusingly trivial matter.

I don't think so. Section 6.8.8 states that `__LINE__' expands to
``the line number of the current source line'', and Section 6.8.4
defines the line number of the current line as ``one greater than the


number of new-line characters read or introduced in translation phase 1

while processing the source file to the current token''. There is
clearly 1 new-line before the `__LINE__' token in the second example
above, so `__LINE__' cannot validly expand to `1'; it has to be
at least `2', and in this case exactly `2', surely. So Ritchie's
preprocessor (and my own, too, as I've noted) is not conforming in this
regard.

- John Hauser

Douglas A. Gwyn

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
Andreas Schwab wrote:
> int i = __LINE__\
> ;
> (GCC emits `int i = 2' ...)

That demonstrates that lookahead needs to track line number
*and so does pushback of an unused lookahead*. GCC has it wrong.

Pete Becker

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
Peter Seebach wrote:
>
>
> My point is that there's nothing unlikely about counting the lines first,
> and *then* noticing that the next character (which was on the next line)
> was really no longer part of this token.

That's a legitimate rationalization, but I don't see any words in the
standard that support it.

>
> Basically, given
> __LINE__\
> ;
> you can't tell, until you've seen the newline, that you're done with __LINE__,
> so you're on line 2 when you definitely see that you're done with the token,
> so I'd consider that a reasonable answer to the question "what line were you
> on when you saw this token".

My copy of the standard says that __LINE__ is "the line number of the
current source line", not "the line number of the current source line,
or maybe one more, depending on whether its followed by an escaped
newline." <g>

I agree that this isn't an earth-shattering issue from the perspective
of standards conformance, but it is distressing to read what look like
clear words in the standard, and to be told that they don't mean what
they say, and get such a vague explanation.

Peter Seebach

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
In article <36A962FD...@cs.berkeley.edu>,

John Hauser <jha...@cs.berkeley.edu> wrote:
>That's what the preprocessor I wrote does, too. It's become clear that
>I'm going to have to change it, and the change will involve keeping
>additional line number information around just to satisfy this detail
>for `__LINE__'. I hope it's going to be worth the trouble.

I'm not convinced that your decision is wrong... But on the other hand,
it seems to me that in

/* line 1 */
int \
j; /* line 3 */
int i = __LINE__; /* line 4 */

i has to be 4. So, you have to count those backslash-newlines *somewhere*.

On the other hand, if you said
/* line 1 */
i\
nt i = __LINE__\
; int j = __LINE__;

I would think anything from 2-4 would be okay for i, and j would be 4.

Peter Seebach

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
In article <36A9CE8E...@acm.org>,

Pete Becker <peteb...@acm.org> wrote:
>Peter Seebach wrote:
>> My point is that there's nothing unlikely about counting the lines first,
>> and *then* noticing that the next character (which was on the next line)
>> was really no longer part of this token.

>That's a legitimate rationalization, but I don't see any words in the
>standard that support it.

I also don't see anything contradicting it.

Ugh.

I just realized,

[someone quoted]


>one greater than the
>number of new-line characters read or introduced in translation phase 1
>while processing the source file to the current token

But since you're allowed to do all of TP1, then all of TP2, etcetera...

You could make the case that either *every* newline in the file was "read or
introduced in translation phase 1" while processing the source, or *none*
were. More likely "all". Because you processed them all in TP1, then you
started TP2, and thus, to process to the current token, you read 'em all.

>My copy of the standard says that __LINE__ is "the line number of the
>current source line", not "the line number of the current source line,
>or maybe one more, depending on whether its followed by an escaped
>newline." <g>

The problem is "what's current"? Is current where you were "inside" the
token, or where you where when you got confirmation that you had a token?

>I agree that this isn't an earth-shattering issue from the perspective
>of standards conformance, but it is distressing to read what look like
>clear words in the standard, and to be told that they don't mean what
>they say, and get such a vague explanation.

Here's my thinking:

int i = __LINE_\
_;

it's fairly clear that either of these lines could be considered "current".
My claim is that, when the 'cursor' is just *past* a token, that's also
"current". Or could be, legitimately.

Pete Becker

unread,
Jan 24, 1999, 3:00:00 AM1/24/99
to
Peter Seebach wrote:
>
> Here's my thinking:
>
> int i = __LINE_\
> _;
>
> it's fairly clear that either of these lines could be considered "current".
> My claim is that, when the 'cursor' is just *past* a token, that's also
> "current". Or could be, legitimately.

Well, I think that's a stretch, influenced by knowing how compilers are
often written. I don't think a programmer who hasn't worked with the
innards of compilers would read it that way.

David R Tribble

unread,
Jan 25, 1999, 3:00:00 AM1/25/99
to
John Hauser wrote:
>
> Dennis Ritchie wrote:
> > For what it's worth, my cpp [...]
> > is resolute in considering the line number of
> > \-pasted things as belonging to the first line with
> > the \.
>
> That's what the preprocessor I wrote does, too. It's become clear
> that I'm going to have to change it, and the change will involve
> keeping additional line number information around just to satisfy this
> detail for `__LINE__'. I hope it's going to be worth the trouble.

FWIW, the preprocessors I write keep track of the beginning of each
token. They do this by calling a low-level "getchar" function that
returns the character code along with its (physical) line number
(and column position and include-file index as well). I stick the
position info of the first character of a new token into the token
info before I collect characters for the rest of the token.

Obviously, this is only one approach; it's perfectly reasonable
for a token's line number to be the position of its last character.
I would argue, though, that's it's misguided to associate a line
number with a token in which none of the characters of the token
actually appear on that line. But then there's the special case
of splicing lines together separated by \-newlines in a separate
pass, which would seem to make it okay to treat such meta-lines
as a single source line (just as long as the next line resumes
with the correct line number, such as by generating extra newlines
in the preprocessed output).

But I wouldn't sweat it; it's a minor issue anyway. It's something
that's best designed in from the beginning, before you write your
lexer, rather than adding later.

And don't forget the other issues that our lexers must deal with
if they are going to properly handle C9X-compliant (and C++
compliant) source. Things like trigraphs, digraphs, UCNs,
alternate punctuation keywords, wide characters, and hex float
literals, to name a few.

Douglas A. Gwyn

unread,
Jan 26, 1999, 3:00:00 AM1/26/99
to
David R Tribble wrote:
> I would argue, though, that's it's misguided to associate a line
> number with a token in which none of the characters of the token
> actually appear on that line.

I agree. The arguments about lookahead etc. could just as well
be applied to
__LINE__
as to
__LINE__\
We sure didn't want the first case to expand to a line number
other than that of the line that the token is embedded within.

Larry Jones

unread,
Jan 26, 1999, 3:00:00 AM1/26/99
to
Douglas A. Gwyn wrote:
>
> I agree. The arguments about lookahead etc. could just as well
> be applied to
> __LINE__
> as to
> __LINE__\

I disagree. In the first case, the token clearly ends before the
newline. In the second case, the backslash-newline does *not* end
the token, you have to look at the first character on the next line
before you know whether you've got an entire token or not, and thus
the token ends on the next line even if no characters from that line
are actually part of it.

> We sure didn't want the first case to expand to a line number
> other than that of the line that the token is embedded within.

That, I'll certainly agree with.

-Larry Jones

I don't want to be THIS good! -- Calvin

Douglas A. Gwyn

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
Larry Jones wrote:
> Douglas A. Gwyn wrote:
> > ... The arguments about lookahead etc. could just as well

> > be applied to
> > __LINE__
> > as to
> > __LINE__\
> I disagree. In the first case, the token clearly ends before the
> newline. In the second case, the backslash-newline does *not* end
> the token, you have to look at the first character on the next line
> before you know whether you've got an entire token or not, and thus
> the token ends on the next line even if no characters from that line
> are actually part of it.

You (the LR parser) don't know that the first token has ended *until
after you have read the newline*.

Pete Becker

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to

My copy of the standard defines the meaning of __LINE__ in terms of the
source line on which it occurs, not when an LR parser might recognize
that the end of a token occurs. Does yours say something different?

Larry Jones

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
Douglas A. Gwyn wrote:
>
> Larry Jones wrote:
> > Douglas A. Gwyn wrote:
> > > ... The arguments about lookahead etc. could just as well
> > > be applied to
> > > __LINE__
> > > as to
> > > __LINE__\
> > I disagree. In the first case, the token clearly ends before the
> > newline. In the second case, the backslash-newline does *not* end
> > the token, you have to look at the first character on the next line
> > before you know whether you've got an entire token or not, and thus
> > the token ends on the next line even if no characters from that line
> > are actually part of it.
>
> You (the LR parser) don't know that the first token has ended *until
> after you have read the newline*.

Yes, but it's easy enough to delay processing that newline until after
you've associated the current line number with the token. In the other
case, you have to read the backslash, the newline, and at least one
following character (more if that character is also a backslash), and
it *isn't* easy to delay processing an arbitrarily long sequence of
characters.

-Larry Jones

OK, what's the NEXT amendment say? I know it's in here someplace. --
Calvin

Pete Becker

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to

But there is no "arbitrarily long sequence of characters" here. There's
only backslash followed by an escaped character. Once it's determined
that this can't be part of __LINE__ you know you're at the end of the
token.

--

Clive D.W. Feather

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
In article <36ACC850...@technologist.com>, David R Tribble
<dtri...@technologist.com> writes

>And don't forget the other issues that our lexers must deal with
>if they are going to properly handle C9X-compliant (and C++
>compliant) source. Things like trigraphs,

were in C89

>digraphs,

just new tokens for the list in the lexer's source code

>UCNs,

hmm

>alternate punctuation keywords,

what ?

>wide characters,

were in C89 (though "xxx" L"yyy" is new, I'll admit)

>and hex float
>literals,

New, I agree.

Clive D.W. Feather

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
In article <36B0849B...@sdrc.com>, Larry Jones
<larry...@sdrc.com> writes

[...]
>> > > __LINE__
and
>> > > __LINE__\

>> You (the LR parser) don't know that the first token has ended *until
>> after you have read the newline*.
>
>Yes, but it's easy enough to delay processing that newline until after
>you've associated the current line number with the token. In the other
>case, you have to read the backslash, the newline, and at least one
>following character (more if that character is also a backslash), and
>it *isn't* easy to delay processing an arbitrarily long sequence of
>characters.

Furthermore, in the first case the newline has survived to the relevant
phase of translation. In the second it's already disappeared.

Clive D.W. Feather

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
In article <36B0A199...@acm.org>, Pete Becker <peteb...@acm.org>
writes

>But there is no "arbitrarily long sequence of characters" here. There's
>only backslash followed by an escaped character. Once it's determined
>that this can't be part of __LINE__ you know you're at the end of the
>token.

__LINE__\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
TWO__

[How long is your lexer input buffer ?]

David R Tribble

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
"Clive D.W. Feather" wrote:
>
> David R Tribble <dtri...@technologist.com> writes
> >And don't forget the other issues that our lexers must deal with
> >if they are going to properly handle C9X-compliant (and C++
> >compliant) source. Things like trigraphs,
...

> >alternate punctuation keywords,
>
> what ?

These are the alternate spellings found in <iso646.h> (such as 'and'
and 'or'). They aren't a problem in C (C89 or C9X) since they're
just macros, but they are reserved/predefined keywords in C++
(which are meaningful in the preprocessor phase and beyond).
Which means that they must be dealt with (at some level) if your
lexer is used for both C and C++.

David R Tribble

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
"Douglas A. Gwyn" wrote:
>
> Larry Jones wrote:
> > Douglas A. Gwyn wrote:
> > > ... The arguments about lookahead etc. could just as well
> > > be applied to
> > > __LINE__
> > > as to
> > > __LINE__\
> > I disagree. In the first case, the token clearly ends before the
> > newline. In the second case, the backslash-newline does *not* end
> > the token, you have to look at the first character on the next line
> > before you know whether you've got an entire token or not, and thus
> > the token ends on the next line even if no characters from that line
> > are actually part of it.
>
> You (the LR parser) don't know that the first token has ended *until
> after you have read the newline*.

But is there a difficulty in saving the source line number for the
last non-white (and non-backslash-newline) source character that was
read? Once you're past the backslash-newline(s), you know the line
number of the last (non-white) character of the token, right?

Pete Becker

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
Clive D.W. Feather wrote:
>
> In article <36B0A199...@acm.org>, Pete Becker <peteb...@acm.org>
> writes
> >But there is no "arbitrarily long sequence of characters" here. There's
> >only backslash followed by an escaped character. Once it's determined
> >that this can't be part of __LINE__ you know you're at the end of the
> >token.
>
> __LINE__\
> \
> \
> \
> \

You're right, of course. But it's still a false issue: it's simple to
keep track of the line number where the backslash-newline sequence
began. The length of the sequence doesn't matter.

Paul Eggert

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
"Clive D.W. Feather" <cl...@on-the-train.demon.co.uk> writes:

>>digraphs,

>just new tokens for the list in the lexer's source code

That depends on the implementation. For a character-based
preprocessor, digraphs can take quite a bit more work than that.

For example, when I added digraph support to GCC's preprocessor, I used
the fact that `%:' is not a digraph if preceded by an odd number of
`<'s, because the code scans backwards at that point! The preprocessor
normally needn't do anything special about `<' when expanding macros,
but it must do so in the presence of digraphs, and it's more efficent
for it to worry about this only when a potential digraph is discovered
than to worry about it whenever `<' is discovered. Therefore GCC's
preprocessor scans backwards through the input when %: is discovered,
counting `<'s as it goes, to see whether the %: is really a #.

This is not the only bit of hairy digraph code that appears in the GCC
preprocessor. I admit that my thoughts at the time were less than kind
about the people on the standardization committee who foisted digraphs
on the rest of us. I wouldn't mind so much if digraphs were actually
used in practice, but they aren't.

mar...@my-dejanews.com

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <36B06692...@null.net>,

"Douglas A. Gwyn" <DAG...@null.net> wrote:
> Larry Jones wrote:
> > Douglas A. Gwyn wrote:
> > > ... The arguments about lookahead etc. could just as well
> > > be applied to
> > > __LINE__
> > > as to
> > > __LINE__\
> > I disagree. In the first case, the token clearly ends before the
> > newline. In the second case, the backslash-newline does *not* end
> > the token, you have to look at the first character on the next line
> > before you know whether you've got an entire token or not, and thus
> > the token ends on the next line even if no characters from that line
> > are actually part of it.
>
> You (the LR parser) don't know that the first token has ended *until
> after you have read the newline*.

Yes, but the newline character is on the same line as the token. You have not
yet read any characters from the _next_ line. In the second case you have
actually read a character from the next line before you know you are done.

Mark Williams

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

mar...@my-dejanews.com

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <36B0A199...@acm.org>,

Pete Becker <peteb...@acm.org> wrote:
> Larry Jones wrote:
> >
> > Douglas A. Gwyn wrote:
> > >
> > > Larry Jones wrote:
> > > > Douglas A. Gwyn wrote:
> > > > > ... The arguments about lookahead etc. could just as well
> > > > > be applied to
> > > > > __LINE__
> > > > > as to
> > > > > __LINE__\
> > > > I disagree. In the first case, the token clearly ends before the
> > > > newline. In the second case, the backslash-newline does *not* end
> > > > the token, you have to look at the first character on the next line
> > > > before you know whether you've got an entire token or not, and thus
> > > > the token ends on the next line even if no characters from that line
> > > > are actually part of it.
> > >
> > > You (the LR parser) don't know that the first token has ended *until
> > > after you have read the newline*.
> >
> > Yes, but it's easy enough to delay processing that newline until after
> > you've associated the current line number with the token. In the other
> > case, you have to read the backslash, the newline, and at least one
> > following character (more if that character is also a backslash), and
> > it *isn't* easy to delay processing an arbitrarily long sequence of
> > characters.
>
> But there is no "arbitrarily long sequence of characters" here. There's
> only backslash followed by an escaped character. Once it's determined
> that this can't be part of __LINE__ you know you're at the end of the
> token.

__LINE__\
\
\
\
\
\
\
\
\
\
\


I could go on :-)

Clive D.W. Feather

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <36B1090F...@technologist.com>, David R Tribble
<dtri...@technologist.com> writes

>> You (the LR parser) don't know that the first token has ended *until
>> after you have read the newline*.
>
>But is there a difficulty in saving the source line number for the
>last non-white (and non-backslash-newline) source character that was
>read? Once you're past the backslash-newline(s), you know the line
>number of the last (non-white) character of the token, right?

There's not a particular difficulty (I think), but the question is what
behaviours should and should not be allowed.

Douglas A. Gwyn

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
Pete Becker wrote:
> My copy of the standard defines the meaning of __LINE__ in terms of the
> source line on which it occurs, not when an LR parser might recognize
> that the end of a token occurs. Does yours say something different?

Please don't jump into a conversation that you have not been tracking.

Clive D.W. Feather

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <36B1083D...@technologist.com>, David R Tribble
<dtri...@technologist.com> writes

>>>alternate punctuation keywords,
>> what ?
>These are the alternate spellings found in <iso646.h> (such as 'and'
>and 'or'). They aren't a problem in C (C89 or C9X) since they're
>just macros, but they are reserved/predefined keywords in C++

Ow. I forgot that particular C++ gratuitous change.

Clive D.W. Feather

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <78rnqq$bor$1...@shade.twinsun.com>, Paul Eggert
<egg...@twinsun.com> writes

>For example, when I added digraph support to GCC's preprocessor, I used
>the fact that `%:' is not a digraph if preceded by an odd number of
>`<'s, because the code scans backwards at that point!

Yuk. What's wrong with maximal munch in the forward direction ?

> The preprocessor
>normally needn't do anything special about `<' when expanding macros,
>but it must do so in the presence of digraphs, and it's more efficent
>for it to worry about this only when a potential digraph is discovered
>than to worry about it whenever `<' is discovered.

I don't even understand this comment. Can you please give an example ?

>This is not the only bit of hairy digraph code that appears in the GCC
>preprocessor. I admit that my thoughts at the time were less than kind
>about the people on the standardization committee who foisted digraphs
>on the rest of us.

You aren't the only one. [My personal take on their history is: "Denmark
forced them on us, USA refused to fight".]

Douglas A. Gwyn

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
David R Tribble wrote:
> But is there a difficulty in saving the source line number for the
> last non-white (and non-backslash-newline) source character that was
> read? Once you're past the backslash-newline(s), you know the line
> number of the last (non-white) character of the token, right?

In effect, that's what the lexer has to do to get it right.

...
if ( alphanumeric( c = getnext() ) )
token_line_number = line_number;
...

instead of just

c = getnext();

somewhere deep inside the parser.

Douglas A. Gwyn

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
Paul Eggert wrote:
> ... I admit that my thoughts at the time were less than kind

> about the people on the standardization committee who foisted digraphs
> on the rest of us. I wouldn't mind so much if digraphs were actually
> used in practice, but they aren't.

How do you know that they aren't?
Both trigraphs and digraphs were forced on us to obtain the support
of the Danish delegation to WG14, who insisted that they had a lot of
keyboards that didn't provide any convenient way to enter certain C
source characters.

Paul Eggert

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
"Clive D.W. Feather" <cl...@on-the-train.demon.co.uk> writes:

>Yuk. What's wrong with maximal munch in the forward direction ?

Nothing's _incorrect_ about maximal munch. It's just harder to write,
and slower, that's all.

>In article <78rnqq$bor$1...@shade.twinsun.com>, Paul Eggert

>> The preprocessor
>>normally needn't do anything special about `<' when expanding macros,
>>but it must do so in the presence of digraphs, and it's more efficent
>>for it to worry about this only when a potential digraph is discovered
>>than to worry about it whenever `<' is discovered.

>I don't even understand this comment. Can you please give an example ?

Normally, when the preprocessor is processing a macro, it just copies
the definiens, looking for identifiers and `#'; the other characters
can just be copied through as-is, without tokenization. I'm omitting
details about whitespace, backslash-newline, strings, trigraphs,
multibyte characters, and so on; but the basic idea is that a
character-oriented preprocessor needn't worry about tokenizing when
it's analyzing the definiens of, say, `#define f(a,b) (a--<<--b)';
it can just copy the `--<<--' through without worrying about token
boundaries. This simplifies the writing of the preprocessor, and
makes it a tad faster.

The same basic idea can be used even in the presence of digraphs like
`%:', but it gets trickier. When you're analyzing
`#define f(a,b) (a<<<<<<<<<<<<%:b)', you must count the number of
`<'s before the `%:' to see whether the `%:' is really a `#'. Ugh.

I have the distinct impression that the people who proposed digraphs
never implemented a character-oriented preprocessor for them, and I
have the sneaking suspicion that they didn't build a token-based
preprocessor either. I'm afraid that digraphs were an example of
specify now, implement later -- which is backwards from what the C
standard ought to be.

Paul Eggert

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
"Douglas A. Gwyn" <DAG...@null.net> writes:

>Paul Eggert wrote:
>> I wouldn't mind so much if digraphs were actually
>> used in practice, but they aren't.

>How do you know that they aren't?

I know because I get the GCC bug reports, and nobody ever complains
about digraphs not working. :-)

We've had discussions before like this.

I recall your claiming that trigraphs don't break existing programs,
except for perhaps a ``handful'' of chess programs.
I showed several instances of breakage in widely used programs,
including GDB, f2c, and the JPEG library.
You said that I didn't provide enough examples,
and anyway the breakages weren't all that big a deal.

I also recall claiming that `long long' system types could break a lot
of existing code. You expressed skepticism, and said ``show me''.
So I showed you, with several examples of widely used code, including Apache.
I distinctly recall your pooh-poohing the evidence,
and saying that it was no big deal.

And this was for discussions where I was proving a positive,
and could show hard evidence. Now you're asking me to prove a negative!

I doubt whether you would be convinced by any evidence that I can supply.
I could inspect all the source code at my site for digraphs and
come up empty (as I'm sure that I would, except for the GCC test cases),
and you'd still say that perhaps some Dane somewhere
might be using digraphs in the corner of his garage.

Jerry Coffin

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to
In article <78sfam$fso$1...@nnrp1.dejanews.com>, mar...@my-dejanews.com
says...

[ ... ]

> Yes, but the newline character is on the same line as the token. You have not
> yet read any characters from the _next_ line. In the second case you have

> actually read a character from the next line before you know you are done.

I hate to say it, but I'm starting to wonder if there's any real point
to this discussion at all. The reality is that the compiler is free
to define what constitutes the end of a line, and it doesn't have to
bear any particularly close relationship with anything else on earth.

About the only limitation (that I can think of) is that preprocessor
lines really have to be treated as LINES, not just text. e.g. a
``#define'' that isn't at the beginning of a line (excluding white
space) isn't treated as a preprocessor directive. In addition, I
suppose the compiler is obliged to follow #line directives.

Other than that, the compiler is free to say that _nothing_
constitutes a new line, and simply not insert (or delete, at its
option) any new-line characters in the input, so all normal code is
treated as one long line.

At the opposite extreme, the compiler would be free to define every
semicolon outside of a string/character constant as being the end of a
"line" and count its lines that way.

As such, it's perfectly legal to have a __LINE__ on what most of us
would think of as the 250th line of a program, and have the compiler
tell us that it's, say, line 5 or line 360. In short, the compiler
can choose nearly ANY value it feels like for nearly any __LINE__, and
there's really no way to say it's legal or illegal. About the only
thing you can say about the value of __LINE__ is that they have to be
assigned in a non-decreasing order in the absence of #line directives.
There are undoubtedly a FEW other restrictions if a __LINE__ is at or
_VERY_ close to the beginning of a translation unit, but under most
circumstances, nearly ANY value can be assigned legally. In short,
it's an arbitrary number, and nearly the ONLY control is quality of
implementation.

sa...@bear.com

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to
In article <78t375$c38$1...@shade.twinsun.com>,

egg...@twinsun.com (Paul Eggert) wrote:
> "Clive D.W. Feather" <cl...@on-the-train.demon.co.uk> writes:
>
> >Yuk. What's wrong with maximal munch in the forward direction ?
>
> Nothing's _incorrect_ about maximal munch. It's just harder to write,
> and slower, that's all.
>

Sorry, you should first think of the correctness, and only then speed.

> >In article <78rnqq$bor$1...@shade.twinsun.com>, Paul Eggert
> >> The preprocessor
> >>normally needn't do anything special about `<' when expanding macros,
> >>but it must do so in the presence of digraphs, and it's more efficent
> >>for it to worry about this only when a potential digraph is discovered
> >>than to worry about it whenever `<' is discovered.
>
> >I don't even understand this comment. Can you please give an example ?
>
> Normally, when the preprocessor is processing a macro, it just copies
> the definiens, looking for identifiers and `#'; the other characters
> can just be copied through as-is, without tokenization. I'm omitting
> details about whitespace, backslash-newline, strings, trigraphs,
> multibyte characters, and so on; but the basic idea is that a
> character-oriented preprocessor needn't worry about tokenizing when
> it's analyzing the definiens of, say, `#define f(a,b) (a--<<--b)';
> it can just copy the `--<<--' through without worrying about token
> boundaries. This simplifies the writing of the preprocessor, and
> makes it a tad faster.
>

I think that a preprocessor *should* tokenize for correct parsing. Of course,
it can delay tokenizing until needed. That means you tokenize upto '#define
f(' to recognize that it is a function macro, but can keep the rest of the
line as a string. I also think that a compiler should be integrated with the
preprocessor for speed (instead of reading the output of preprocessor). Then
it can convert a pp-token to token without tokenizing again and first time
tokenization is not wasted. I know that there is place for a stand-alone
preprocessor, but then it must first *correctly* parse.

> The same basic idea can be used even in the presence of digraphs like
> `%:', but it gets trickier. When you're analyzing
> `#define f(a,b) (a<<<<<<<<<<<<%:b)', you must count the number of
> `<'s before the `%:' to see whether the `%:' is really a `#'. Ugh.
>

Here the problem is that you are searching for a token (%:) without
tokenizing!! This kind of short-cuts (so called optimizations) make the
maintenance of a lexer/parser impossible. If a language evolves, then you
have to visit the entire list of such short-cuts to see if they are still
valid. For example, C may add a new operator <<< (like Java) or alternate
spelling for a punctuator character (did anybody say digraph)? I know you are
a seasoned programmer and I do not want to sound condescending, but I
strongly feel about this from my experience.

> I have the distinct impression that the people who proposed digraphs
> never implemented a character-oriented preprocessor for them, and I
> have the sneaking suspicion that they didn't build a token-based
> preprocessor either.

I do not agree with you! Whether digraphs are useful to anybody is a
separate topic, but I do think that a well-written lexer should be capable
to handle a digraph (just another token).

> I'm afraid that digraphs were an example of
> specify now, implement later -- which is backwards from what the C
> standard ought to be.
>

-- Saroj Mahapatra

Pete Becker

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to

Thank you for the vacuous advice. Since I have actively participated in
this thread since its beginning, I don't see that it applies here.

Paul Eggert

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to
sa...@bear.com writes:

>I also think that a compiler should be integrated with the
>preprocessor for speed (instead of reading the output of preprocessor).

GCC currently has a build-time option that will let you substitute an
integrated preprocessor that does tokenization. Unfortunately, if you
select that option, GCC becomes buggier and slower. (So much for theory. :-)

If you'd like to help rectify this situation, I can put you in touch
with the maintainer for the integrated-preprocessor option. He's
gradually making it faster and more reliable. I'm sure that he
would appreciate some help.

>Here the problem is that you are searching for a token (%:) without
>tokenizing!!

I would say that the problem is that the people who added digraphs
didn't understand how a character-based preprocessor works.

Clearly you're in the ``all preprocessors should tokenize'' camp,
so you don't care whether someone changes the standard to render
character-based preprocessors infeasible. However, the standard
should be more catholic -- it should cater to existing practice,
and this includes both kinds of preprocessors.

Douglas A. Gwyn

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
Jerry Coffin wrote:
> As such, it's perfectly legal to have a __LINE__ on what most of us
> would think of as the 250th line of a program, and have the compiler
> tell us that it's, say, line 5 or line 360. In short, the compiler
> can choose nearly ANY value it feels like for nearly any __LINE__, and
> there's really no way to say it's legal or illegal.

But such a compiler clearly does not conform to the requirements of
the C standard, which *does* define what constitutes an input line.

Douglas A. Gwyn

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
Pete Becker wrote:
> Douglas A. Gwyn wrote:
> > Pete Becker wrote:
> > > My copy of the standard defines the meaning of __LINE__ in terms of the
> > > source line on which it occurs, not when an LR parser might recognize
> > > that the end of a token occurs. Does yours say something different?
> > Please don't jump into a conversation that you have not been tracking.
> Thank you for the vacuous advice. Since I have actively participated in
> this thread since its beginning, I don't see that it applies here.

Well, then, please pay attention.

Douglas A. Gwyn

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
Paul Eggert wrote:

> sa...@bear.com writes:
> >Here the problem is that you are searching for a token (%:) without
> >tokenizing!!
> I would say that the problem is that the people who added digraphs
> didn't understand how a character-based preprocessor works.

Sure we did. In fact we debated the very point, and the proponents
of digraphs-as-tokens prevailed. I'm sorry you weren't participating,
as the outcome might then have been more to your liking.

Personally I don't think there was *ever* a need for trigraphs,
digraphs, or \u-escapes in the C standard. How input characters
are coded should never have been a C language issue.

Pete Becker

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
Douglas A. Gwyn wrote:
>
> Well, then, please pay attention.

Thank you for your input. I'll give it the consideration it deserves.

Michael Rubenstein

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
On Sun, 31 Jan 1999 07:08:15 GMT, "Douglas A. Gwyn" <DAG...@null.net>
wrote:

>Jerry Coffin wrote:

Where? I can't find this definition. What I can find is (5.2.1)

In source characters there shall be some way of indicating the

end of each line of text; this International standard treats
such an end-of-line indicator as if it were a single new-line
character.

My implementation for the Kludge 9000 super-duper computer is that the
end of line character is a line-feed character if

it occurs on a line that begins with any number of spaces and
tabs followed by a #

or

it is followed by any number of spaces and tabs followed by a
#

In other contexts line-feed in a source file is translated to space.

Thus the program

#include <stdio.h>
int main(void)
{
printf("%d\n", __LINE__);
return 0;
}

prints 2.
--
Michael M Rubenstein

Jerry Coffin

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
In article <36B4013A...@null.net>, DAG...@null.net says...

> Jerry Coffin wrote:
> > As such, it's perfectly legal to have a __LINE__ on what most of us
> > would think of as the 250th line of a program, and have the compiler
> > tell us that it's, say, line 5 or line 360. In short, the compiler
> > can choose nearly ANY value it feels like for nearly any __LINE__, and
> > there's really no way to say it's legal or illegal.
>
> But such a compiler clearly does not conform to the requirements of
> the C standard, which *does* define what constitutes an input line.

Please re-read section 5.1.1.2 of the standard, with an emphasis on
phase 1 of translation. Pay close attention to the fact that the
standard says new-line characters will be introduced as a substitution
for "end-of-line indicators", but carefully does NOT define what
constitutes an end-of-line indicator. After doing so, attempt to find
a single part of the standard that is violated by a phase-one mapping
such as the following:

\\$ -> <nothing>
\n[ \t\r\n]*#\(.*\)$ -> \n#\1
\n -> <nothing>

I've looked several times for a limitation on the mapping done in
phase one of translation, and I can't find one. Absent such a
limitation, I believe the mapping given above is legal. It results in
each preprocessor line being a line by itself, and virtually
everything else appearing as one really LONG line.


Philip Lantz

unread,
Feb 1, 1999, 3:00:00 AM2/1/99
to
Jerry Coffin wrote:
>
> \\$ -> <nothing>
> \n[ \t\r\n]*#\(.*\)$ -> \n#\1
> \n -> <nothing>

I think that third line had better be
\n -> <space>

prl