Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

__LINE__ and backslash-newline

146 views
Skip to first unread message

Daniel Villeneuve

unread,
Jan 19, 1999, 3:00:00 AM1/19/99
to
In the section 6.10.4, the line number is defined as follows: ``The line
number of the current source line is one greater than the number of
new-line characters read or introduced in translation phase 1 (5.1.1.2)
while processing the source file to the current token.''

In order to keep track of this number, the lexer has to identify the end
of the current token (this was the subject of another discussion some
while ago). Given that the line number of the first source line is 1,
what should be the value of i in the following?
int i = __LINE__\
;

The backslash-newline pair being ``non-existant'', could it be
``attached'' to __LINE__, thus producing the value `2'? Obviously, the
value `1' is reasonable as well. Does the Standard make the effect of
such a statement implementation-defined?

--
Daniel Villeneuve
Graduate student in Operations Research
GERAD/Mathématiques Appliquées
École Polytechnique de Montréal

Pete Becker

unread,
Jan 19, 1999, 3:00:00 AM1/19/99
to
Daniel Villeneuve wrote:
>
> In the section 6.10.4, the line number is defined as follows: ``The line
> number of the current source line is one greater than the number of
> new-line characters read or introduced in translation phase 1 (5.1.1.2)
> while processing the source file to the current token.''
>
> In order to keep track of this number, the lexer has to identify the end
> of the current token (this was the subject of another discussion some
> while ago). Given that the line number of the first source line is 1,
> what should be the value of i in the following?
> int i = __LINE__\
> ;
>
> The backslash-newline pair being ``non-existant'', could it be
> ``attached'' to __LINE__, thus producing the value `2'? Obviously, the
> value `1' is reasonable as well. Does the Standard make the effect of
> such a statement implementation-defined?

The standard is quite clear: its value is 1. Read about "phases of
translation", and note that the line number is generated after phase 1,
but backslash newline is not processed until phase 2.

--
Pete Becker
Dinkumware, Ltd.
http://www.dinkumware.com

Clive D.W. Feather

unread,
Jan 19, 1999, 3:00:00 AM1/19/99
to
In article <36A4BA62...@crt.umontreal.ca>, Daniel Villeneuve
<dan...@crt.umontreal.ca> writes

>In order to keep track of this number, the lexer has to identify the end
>of the current token (this was the subject of another discussion some
>while ago). Given that the line number of the first source line is 1,
>what should be the value of i in the following?
>int i = __LINE__\
>;
>
>The backslash-newline pair being ``non-existant'', could it be
>``attached'' to __LINE__, thus producing the value `2'? Obviously, the
>value `1' is reasonable as well. Does the Standard make the effect of
>such a statement implementation-defined?

I would argue (weakly) that it is to the start of the current token, and
thus 1. However, if you want to play safe you should assume it is to an
unspecified point between the start and end of the current token, so in
that case it would be 2.

[This was addressed in a UK Defect Report ages ago, but I forget what
the answer was.]

--
Clive D.W. Feather | Director of | Work: <cl...@demon.net>
Tel: +44 181 371 1138 | Software Development | Home: <cl...@davros.org>
Fax: +44 181 371 1037 | Demon Internet Ltd. | Web: <http://www.davros.org>
Written on my laptop; please observe the Reply-To address

David R Tribble

unread,
Jan 19, 1999, 3:00:00 AM1/19/99
to
Daniel Villeneuve wrote:
>
> In the section 6.10.4, the line number is defined as follows:
> ``The line number of the current source line is one greater than the
> number of new-line characters read or introduced in translation phase
> 1 (5.1.1.2) while processing the source file to the current token.''
>
> In order to keep track of this number, the lexer has to identify the
> end of the current token (this was the subject of another discussion
> some while ago). Given that the line number of the first source line
> is 1, what should be the value of i in the following?
> int i = __LINE__\
> ;
>
> The backslash-newline pair being ``non-existant'', could it be
> ``attached'' to __LINE__, thus producing the value `2'? Obviously,
> the value `1' is reasonable as well. Does the Standard make the
> effect of such a statement implementation-defined?

I'm not sure, but I think the standard doesn't have much to say
about this.

I personally believe it's just as reasonable for a compiler to keep
track of the *beginning* of each token instead. (The lexers I
write do just that, and I haven't had any complaints.) It's
about the same amount of work for the lexer in either case.

With that in mind, what is the result of this code:

1: int i = __LI\
2: NE__;

The standard would probably classify this as implementation-defined,
if it says anything about it at all. (But clearly there are only
two valid answers: 'i' is either 1 or 2.)

-- David R. Tribble, dtri...@technologist.com --

Paul Eggert

unread,
Jan 20, 1999, 3:00:00 AM1/20/99
to
Daniel Villeneuve <dan...@crt.umontreal.ca> writes:

>int i = __LINE__\
>;

>The backslash-newline pair being ``non-existant'', could it be
>``attached'' to __LINE__, thus producing the value `2'?

I don't see why not. The standard doesn't specify exactly what happens
here, so the implementation has some wiggle room. Similarly,

int i = __LINE__\
\
;

might cause i to have the value 1, 2, or 3.

Daniel Villeneuve

unread,
Jan 20, 1999, 3:00:00 AM1/20/99
to
Clive D.W. Feather wrote:
> I would argue (weakly) that it is to the start of the current token, and
> thus 1.

This has the merit of being uniquely defined. However, my question
arose because, in a previous discussion which implied the use of
__LINE__ as an argument of a function-like macro, ...

Clive D.W. Feather wrote:
> In this case, I wasn't proposing a change to make it be standardised. I
> was pointing out that that's how I read the definition of __LINE__. Look
> at 6.8.4 paragraph 2 - the line number is based on the number of
> newlines up to the "current token". When processing a macro expansion,
> this surely is the ) at the end of the invocation.

So sometimes, it is expected that it is the _end_ of a construct that
defines the associated value of __LINE__.

Maybe it's simpler to agree on the start of the current token, and that
for function-like macros, this means the start of the identifier token
that introduces the macro.

Clive D.W. Feather

unread,
Jan 20, 1999, 3:00:00 AM1/20/99
to
In article <36A4D7EF...@acm.org>, Pete Becker <peteb...@acm.org>
writes

>> int i = __LINE__\
>> ;

>The standard is quite clear: its value is 1. Read about "phases of


>translation", and note that the line number is generated after phase 1,
>but backslash newline is not processed until phase 2.

That's irrelevant. Labelling each character with its source line number
you get:

int i = __LINE__;
11111111111111112

The question is: is the line number of a token the line number of the
first character, the last character, the first character not part of the
token, or an unspecified character within the token ? Or what ?

Pete Becker

unread,
Jan 20, 1999, 3:00:00 AM1/20/99
to
Clive D.W. Feather wrote:
>
> In article <36A4D7EF...@acm.org>, Pete Becker <peteb...@acm.org>
> writes
>
> >> int i = __LINE__\
> >> ;
>
> >The standard is quite clear: its value is 1. Read about "phases of
> >translation", and note that the line number is generated after phase 1,
> >but backslash newline is not processed until phase 2.
>
> That's irrelevant. Labelling each character with its source line number
> you get:
>
> int i = __LINE__;
> 11111111111111112
>
> The question is: is the line number of a token the line number of the
> first character, the last character, the first character not part of the
> token, or an unspecified character within the token ? Or what ?

Since every character in the token occurs in line 1, how can its line
number possibly be anything other than 1?

Peter Seebach

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <36A69211...@acm.org>,

Pete Becker <peteb...@acm.org> wrote:
>> int i = __LINE__;
>> 11111111111111112

>Since every character in the token occurs in line 1, how can its line


>number possibly be anything other than 1?

blah blah blah
... \n ...
++line
...
...
end of token
unpush character
...
val = line;

:)

It's a real question. No one has ever adequately explained how __LINE__
works, and I believe the committee has decided it's not worth deciding.

-s
--
Copyright 1999, All rights reserved. Peter Seebach / se...@plethora.net
C/Unix wizard, Pro-commerce radical, Spam fighter. Boycott Spamazon!
Send me money - get cool programs and hardware! No commuting, please.
Visit my new ISP <URL:http://www.plethora.net/> --- More Net, Less Spam!

Clive D.W. Feather

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <36A5ED25...@crt.umontreal.ca>, Daniel Villeneuve
<dan...@crt.umontreal.ca> writes
>Clive D.W. Feather wrote:
[something]

>This has the merit of being uniquely defined. However, my question
>arose because, in a previous discussion which implied the use of
>__LINE__ as an argument of a function-like macro, ...
>
>Clive D.W. Feather wrote:

[something inconsistent]

>Maybe it's simpler to agree on the start of the current token, and that
>for function-like macros, this means the start of the identifier token
>that introduces the macro.

I suspect that the best answer is to say that it is the line number of
some unspecified point between the first character of the token and the
character immediately following the token.

Clive D.W. Feather

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <dizp2.228$Oi....@ptah.visi.com>, Peter Seebach
<se...@plethora.net> writes

>It's a real question. No one has ever adequately explained how __LINE__
>works, and I believe the committee has decided it's not worth deciding.

It's not even that: the DR that asked the question never got answered.

Pete Becker

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
Peter Seebach wrote:
>
> In article <36A69211...@acm.org>,
> Pete Becker <peteb...@acm.org> wrote:
> >> int i = __LINE__;
> >> 11111111111111112
>
> >Since every character in the token occurs in line 1, how can its line
> >number possibly be anything other than 1?
>
> blah blah blah
> ... \n ...
> ++line
> ...
> ...
> end of token
> unpush character
> ...
> val = line;
>
> :)
>
> It's a real question. No one has ever adequately explained how __LINE__
> works, and I believe the committee has decided it's not worth deciding.

I don't understand the point of this example. Yes, if a token begins on
one line and ends on another, there's an ambiguity about which line it's
on. That's not the case in the example that I responded to: every
character in __LINE__ is on the same line. Is this supposed to
illustrate that a naive implementation can get this wrong? If so, that's
not particularly relevant. What is the issue here?

Larry Jones

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
Pete Becker wrote:
>
> > >> int i = __LINE__\
> > >> ;

>
> Since every character in the token occurs in line 1, how can its line
> number possibly be anything other than 1?

Because the implementation doesn't know it's at the end of the token
until it's on line 2.-Larry Jones

I suppose if I had two X chromosomes, I'd feel hostile too. -- Calvin

Peter Seebach

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <xkwFdkMc...@romana.davros.org>,

Clive D.W. Feather <cl...@davros.org> wrote:
>In article <dizp2.228$Oi....@ptah.visi.com>, Peter Seebach
><se...@plethora.net> writes
>>It's a real question. No one has ever adequately explained how __LINE__
>>works, and I believe the committee has decided it's not worth deciding.

>It's not even that: the DR that asked the question never got answered.

Yes. The committee decided that it was not worth answering. In a vote.
I think it may even have been a formal vote, but it may have just been
a show of hands.

Peter Seebach

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <36A72523...@acm.org>,

Pete Becker <peteb...@acm.org> wrote:
>I don't understand the point of this example. Yes, if a token begins on
>one line and ends on another, there's an ambiguity about which line it's
>on. That's not the case in the example that I responded to: every
>character in __LINE__ is on the same line. Is this supposed to
>illustrate that a naive implementation can get this wrong? If so, that's
>not particularly relevant. What is the issue here?

My point is that there's nothing unlikely about counting the lines first,
and *then* noticing that the next character (which was on the next line)
was really no longer part of this token.

Basically, given
__LINE__\
;
you can't tell, until you've seen the newline, that you're done with __LINE__,
so you're on line 2 when you definitely see that you're done with the token,
so I'd consider that a reasonable answer to the question "what line were you
on when you saw this token".

David R Tribble

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
Clive D.W. Feather wrote:
> I suspect that the best answer is to say that it is the line number of
> some unspecified point between the first character of the token and
> the character immediately following the token.

This sounds reasonable, and it probably covers most existing
implementations.

Thus this fragment:

1: int i =\
2: \
3: __L\
4: IN\
5: E__\
6: \
7: ;

which results in these characters and source lines:

int i =__LINE__;
1111111333445557

would result in one of the following acceptable values of 'i':
3, 4, 5, 6, or 7.
I consider any other values as broken.

Clive D.W. Feather

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
In article <eULp2.16$NN....@ptah.visi.com>, Peter Seebach
<se...@plethora.net> writes

>>It's not even that: the DR that asked the question never got answered.
>Yes. The committee decided that it was not worth answering. In a vote.
>I think it may even have been a formal vote, but it may have just been
>a show of hands.

You might be right, though I don't remember it.

On the other hand, I note that none of DRs 173 to 178 ever got answered
offically, though it looks like the other 5 have all been solved in C9X.

James Kuyper

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
Pete Becker wrote:
>
> Peter Seebach wrote:
> >
> > In article <36A69211...@acm.org>,

> > Pete Becker <peteb...@acm.org> wrote:
> > >> int i = __LINE__;
> > >> 11111111111111112
> >
> > >Since every character in the token occurs in line 1, how can its line
> > >number possibly be anything other than 1?
> >
> > blah blah blah
> > ... \n ...
> > ++line
> > ...
> > ...
> > end of token
> > unpush character
> > ...
> > val = line;
> >
> > :)
> >
> > It's a real question. No one has ever adequately explained how __LINE__
> > works, and I believe the committee has decided it's not worth deciding.
>
> I don't understand the point of this example. Yes, if a token begins on
> one line and ends on another, there's an ambiguity about which line it's
> on. That's not the case in the example that I responded to: every
> character in __LINE__ is on the same line. Is this supposed to
> illustrate that a naive implementation can get this wrong? If so, that's
> not particularly relevant. What is the issue here?

issue 1: does the standard really specify that this naive implementation
is wrong? I'm agnostic on that issue.

issue 2: should the standard be so specific that it does prohibit this
naive implementation? I can't come up with a good argument why __LINE__
needs to be that well defined. AFAIK, it's main use is as a debugging
aid, intended to point the programmer at the right piece of code. Anyone
who writes code such that __LINE__'s value leaves them unclear about the
place at which it was evaluated, deserves what they get.

Douglas A. Gwyn

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
"Clive D.W. Feather" wrote:
> [This was addressed in a UK Defect Report ages ago, but I forget what
> the answer was.]

When the __LINE__ token occurs entirely *on* one line,
that is obviously the appropriate line number.
When __LINE__ spans multiple lines (via \new-line etc.),
as I recall we didn't care to specify which of the
possibilities it had to be, since it doesn't matter in practice.

Dennis Ritchie

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
For what it's worth, my cpp (the one packaged with lcc,
and used on Plan 9 when Ken's build-in and deliberately
enfeebled version won't do) treats Tribble's example

>
> 1: int i =\
> 2: \
> 3: __L\
> 4: IN\
> 5: E__\
> 6: \
> 7: ;
>
as expanding to
int i =1;

It is resolute in considering the line number of
\-pasted things as belonging to the first line with
the \. (In trying the example, I trimmed the initial
[0-9}: and white-space, of course.)

Dennis

Andreas Schwab

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to

I which way is this:

int i = __LINE__\
;

different from this:

int i =\
\
__L\
IN\
E__\
\
;

(GCC emits `int i = 2' for the first and `int i = 7;' for the second).
Although in the first example __LINE__ occurs entirely on one line the
compiler still has to look ahead beyond \ to find the end of the token.

--
Andreas Schwab "And now for something
sch...@issan.cs.uni-dortmund.de completely different"
sch...@gnu.org

Paul Eggert

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
Andreas Schwab <sch...@issan.cs.uni-dortmund.de> writes:
>"Douglas A. Gwyn" <DAG...@null.net> writes:
>|> When the __LINE__ token occurs entirely *on* one line,
>|> that is obviously the appropriate line number.

>int i = __LINE__\
>;

>... GCC emits `int i = 2'

Conversely, for

int i = \
__LINE__;

Ritchie's preprocessor presumably emits `int i = 1;'.
So we have two practical counterexamples to Gwyn's suggestion.

I think both implementations conform to the standard
in this amusingly trivial matter.

John Hauser

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
Dennis Ritchie wrote:
> For what it's worth, my cpp [...]

> is resolute in considering the line number of
> \-pasted things as belonging to the first line with
> the \.

That's what the preprocessor I wrote does, too. It's become clear that
I'm going to have to change it, and the change will involve keeping
additional line number information around just to satisfy this detail
for `__LINE__'. I hope it's going to be worth the trouble.

- John Hauser

John Hauser

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
Paul Eggert wrote:

>
> Andreas Schwab wrote:
> >
> >int i = __LINE__\
> >;
> >
> >... GCC emits `int i = 2'
>
> Conversely, for
>
> int i = \
> __LINE__;
>
> Ritchie's preprocessor presumably emits `int i = 1;'.

> I think both implementations conform to the standard


> in this amusingly trivial matter.

I don't think so. Section 6.8.8 states that `__LINE__' expands to
``the line number of the current source line'', and Section 6.8.4
defines the line number of the current line as ``one greater than the


number of new-line characters read or introduced in translation phase 1

while processing the source file to the current token''. There is
clearly 1 new-line before the `__LINE__' token in the second example
above, so `__LINE__' cannot validly expand to `1'; it has to be
at least `2', and in this case exactly `2', surely. So Ritchie's
preprocessor (and my own, too, as I've noted) is not conforming in this
regard.

- John Hauser

Douglas A. Gwyn

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
Andreas Schwab wrote:
> int i = __LINE__\
> ;
> (GCC emits `int i = 2' ...)

That demonstrates that lookahead needs to track line number
*and so does pushback of an unused lookahead*. GCC has it wrong.

Pete Becker

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
Peter Seebach wrote:
>
>
> My point is that there's nothing unlikely about counting the lines first,
> and *then* noticing that the next character (which was on the next line)
> was really no longer part of this token.

That's a legitimate rationalization, but I don't see any words in the
standard that support it.

>
> Basically, given
> __LINE__\
> ;
> you can't tell, until you've seen the newline, that you're done with __LINE__,
> so you're on line 2 when you definitely see that you're done with the token,
> so I'd consider that a reasonable answer to the question "what line were you
> on when you saw this token".

My copy of the standard says that __LINE__ is "the line number of the
current source line", not "the line number of the current source line,
or maybe one more, depending on whether its followed by an escaped
newline." <g>

I agree that this isn't an earth-shattering issue from the perspective
of standards conformance, but it is distressing to read what look like
clear words in the standard, and to be told that they don't mean what
they say, and get such a vague explanation.

Peter Seebach

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
In article <36A962FD...@cs.berkeley.edu>,

John Hauser <jha...@cs.berkeley.edu> wrote:
>That's what the preprocessor I wrote does, too. It's become clear that
>I'm going to have to change it, and the change will involve keeping
>additional line number information around just to satisfy this detail
>for `__LINE__'. I hope it's going to be worth the trouble.

I'm not convinced that your decision is wrong... But on the other hand,
it seems to me that in

/* line 1 */
int \
j; /* line 3 */
int i = __LINE__; /* line 4 */

i has to be 4. So, you have to count those backslash-newlines *somewhere*.

On the other hand, if you said
/* line 1 */
i\
nt i = __LINE__\
; int j = __LINE__;

I would think anything from 2-4 would be okay for i, and j would be 4.

Peter Seebach

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
In article <36A9CE8E...@acm.org>,

Pete Becker <peteb...@acm.org> wrote:
>Peter Seebach wrote:
>> My point is that there's nothing unlikely about counting the lines first,
>> and *then* noticing that the next character (which was on the next line)
>> was really no longer part of this token.

>That's a legitimate rationalization, but I don't see any words in the
>standard that support it.

I also don't see anything contradicting it.

Ugh.

I just realized,

[someone quoted]


>one greater than the
>number of new-line characters read or introduced in translation phase 1
>while processing the source file to the current token

But since you're allowed to do all of TP1, then all of TP2, etcetera...

You could make the case that either *every* newline in the file was "read or
introduced in translation phase 1" while processing the source, or *none*
were. More likely "all". Because you processed them all in TP1, then you
started TP2, and thus, to process to the current token, you read 'em all.

>My copy of the standard says that __LINE__ is "the line number of the
>current source line", not "the line number of the current source line,
>or maybe one more, depending on whether its followed by an escaped
>newline." <g>

The problem is "what's current"? Is current where you were "inside" the
token, or where you where when you got confirmation that you had a token?

>I agree that this isn't an earth-shattering issue from the perspective
>of standards conformance, but it is distressing to read what look like
>clear words in the standard, and to be told that they don't mean what
>they say, and get such a vague explanation.

Here's my thinking:

int i = __LINE_\
_;

it's fairly clear that either of these lines could be considered "current".
My claim is that, when the 'cursor' is just *past* a token, that's also
"current". Or could be, legitimately.

Pete Becker

unread,
Jan 24, 1999, 3:00:00 AM1/24/99
to
Peter Seebach wrote:
>
> Here's my thinking:
>
> int i = __LINE_\
> _;
>
> it's fairly clear that either of these lines could be considered "current".
> My claim is that, when the 'cursor' is just *past* a token, that's also
> "current". Or could be, legitimately.

Well, I think that's a stretch, influenced by knowing how compilers are
often written. I don't think a programmer who hasn't worked with the
innards of compilers would read it that way.

David R Tribble

unread,
Jan 25, 1999, 3:00:00 AM1/25/99
to
John Hauser wrote:
>
> Dennis Ritchie wrote:
> > For what it's worth, my cpp [...]
> > is resolute in considering the line number of
> > \-pasted things as belonging to the first line with
> > the \.
>
> That's what the preprocessor I wrote does, too. It's become clear
> that I'm going to have to change it, and the change will involve
> keeping additional line number information around just to satisfy this
> detail for `__LINE__'. I hope it's going to be worth the trouble.

FWIW, the preprocessors I write keep track of the beginning of each
token. They do this by calling a low-level "getchar" function that
returns the character code along with its (physical) line number
(and column position and include-file index as well). I stick the
position info of the first character of a new token into the token
info before I collect characters for the rest of the token.

Obviously, this is only one approach; it's perfectly reasonable
for a token's line number to be the position of its last character.
I would argue, though, that's it's misguided to associate a line
number with a token in which none of the characters of the token
actually appear on that line. But then there's the special case
of splicing lines together separated by \-newlines in a separate
pass, which would seem to make it okay to treat such meta-lines
as a single source line (just as long as the next line resumes
with the correct line number, such as by generating extra newlines
in the preprocessed output).

But I wouldn't sweat it; it's a minor issue anyway. It's something
that's best designed in from the beginning, before you write your
lexer, rather than adding later.

And don't forget the other issues that our lexers must deal with
if they are going to properly handle C9X-compliant (and C++
compliant) source. Things like trigraphs, digraphs, UCNs,
alternate punctuation keywords, wide characters, and hex float
literals, to name a few.

Douglas A. Gwyn

unread,
Jan 26, 1999, 3:00:00 AM1/26/99
to
David R Tribble wrote:
> I would argue, though, that's it's misguided to associate a line
> number with a token in which none of the characters of the token
> actually appear on that line.

I agree. The arguments about lookahead etc. could just as well
be applied to
__LINE__
as to
__LINE__\
We sure didn't want the first case to expand to a line number
other than that of the line that the token is embedded within.

Larry Jones

unread,
Jan 26, 1999, 3:00:00 AM1/26/99
to
Douglas A. Gwyn wrote:
>
> I agree. The arguments about lookahead etc. could just as well
> be applied to
> __LINE__
> as to
> __LINE__\

I disagree. In the first case, the token clearly ends before the
newline. In the second case, the backslash-newline does *not* end
the token, you have to look at the first character on the next line
before you know whether you've got an entire token or not, and thus
the token ends on the next line even if no characters from that line
are actually part of it.

> We sure didn't want the first case to expand to a line number
> other than that of the line that the token is embedded within.

That, I'll certainly agree with.

-Larry Jones

I don't want to be THIS good! -- Calvin

Douglas A. Gwyn

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
Larry Jones wrote:
> Douglas A. Gwyn wrote:
> > ... The arguments about lookahead etc. could just as well

> > be applied to
> > __LINE__
> > as to
> > __LINE__\
> I disagree. In the first case, the token clearly ends before the
> newline. In the second case, the backslash-newline does *not* end
> the token, you have to look at the first character on the next line
> before you know whether you've got an entire token or not, and thus
> the token ends on the next line even if no characters from that line
> are actually part of it.

You (the LR parser) don't know that the first token has ended *until
after you have read the newline*.

Pete Becker

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to

My copy of the standard defines the meaning of __LINE__ in terms of the
source line on which it occurs, not when an LR parser might recognize
that the end of a token occurs. Does yours say something different?

Larry Jones

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
Douglas A. Gwyn wrote:
>
> Larry Jones wrote:
> > Douglas A. Gwyn wrote:
> > > ... The arguments about lookahead etc. could just as well
> > > be applied to
> > > __LINE__
> > > as to
> > > __LINE__\
> > I disagree. In the first case, the token clearly ends before the
> > newline. In the second case, the backslash-newline does *not* end
> > the token, you have to look at the first character on the next line
> > before you know whether you've got an entire token or not, and thus
> > the token ends on the next line even if no characters from that line
> > are actually part of it.
>
> You (the LR parser) don't know that the first token has ended *until
> after you have read the newline*.

Yes, but it's easy enough to delay processing that newline until after
you've associated the current line number with the token. In the other
case, you have to read the backslash, the newline, and at least one
following character (more if that character is also a backslash), and
it *isn't* easy to delay processing an arbitrarily long sequence of
characters.

-Larry Jones

OK, what's the NEXT amendment say? I know it's in here someplace. --
Calvin

Pete Becker

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to

But there is no "arbitrarily long sequence of characters" here. There's
only backslash followed by an escaped character. Once it's determined
that this can't be part of __LINE__ you know you're at the end of the
token.

--

Clive D.W. Feather

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
In article <36ACC850...@technologist.com>, David R Tribble
<dtri...@technologist.com> writes

>And don't forget the other issues that our lexers must deal with
>if they are going to properly handle C9X-compliant (and C++
>compliant) source. Things like trigraphs,

were in C89

>digraphs,

just new tokens for the list in the lexer's source code

>UCNs,

hmm

>alternate punctuation keywords,

what ?

>wide characters,

were in C89 (though "xxx" L"yyy" is new, I'll admit)

>and hex float
>literals,

New, I agree.

Clive D.W. Feather

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
In article <36B0849B...@sdrc.com>, Larry Jones
<larry...@sdrc.com> writes

[...]
>> > > __LINE__
and
>> > > __LINE__\

>> You (the LR parser) don't know that the first token has ended *until
>> after you have read the newline*.
>
>Yes, but it's easy enough to delay processing that newline until after
>you've associated the current line number with the token. In the other
>case, you have to read the backslash, the newline, and at least one
>following character (more if that character is also a backslash), and
>it *isn't* easy to delay processing an arbitrarily long sequence of
>characters.

Furthermore, in the first case the newline has survived to the relevant
phase of translation. In the second it's already disappeared.

Clive D.W. Feather

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
In article <36B0A199...@acm.org>, Pete Becker <peteb...@acm.org>
writes

>But there is no "arbitrarily long sequence of characters" here. There's
>only backslash followed by an escaped character. Once it's determined
>that this can't be part of __LINE__ you know you're at the end of the
>token.

__LINE__\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
TWO__

[How long is your lexer input buffer ?]

David R Tribble

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
"Clive D.W. Feather" wrote:
>
> David R Tribble <dtri...@technologist.com> writes
> >And don't forget the other issues that our lexers must deal with
> >if they are going to properly handle C9X-compliant (and C++
> >compliant) source. Things like trigraphs,
...

> >alternate punctuation keywords,
>
> what ?

These are the alternate spellings found in <iso646.h> (such as 'and'
and 'or'). They aren't a problem in C (C89 or C9X) since they're
just macros, but they are reserved/predefined keywords in C++
(which are meaningful in the preprocessor phase and beyond).
Which means that they must be dealt with (at some level) if your
lexer is used for both C and C++.

David R Tribble

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
"Douglas A. Gwyn" wrote:
>
> Larry Jones wrote:
> > Douglas A. Gwyn wrote:
> > > ... The arguments about lookahead etc. could just as well
> > > be applied to
> > > __LINE__
> > > as to
> > > __LINE__\
> > I disagree. In the first case, the token clearly ends before the
> > newline. In the second case, the backslash-newline does *not* end
> > the token, you have to look at the first character on the next line
> > before you know whether you've got an entire token or not, and thus
> > the token ends on the next line even if no characters from that line
> > are actually part of it.
>
> You (the LR parser) don't know that the first token has ended *until
> after you have read the newline*.

But is there a difficulty in saving the source line number for the
last non-white (and non-backslash-newline) source character that was
read? Once you're past the backslash-newline(s), you know the line
number of the last (non-white) character of the token, right?

Pete Becker

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
Clive D.W. Feather wrote:
>
> In article <36B0A199...@acm.org>, Pete Becker <peteb...@acm.org>
> writes
> >But there is no "arbitrarily long sequence of characters" here. There's
> >only backslash followed by an escaped character. Once it's determined
> >that this can't be part of __LINE__ you know you're at the end of the
> >token.
>
> __LINE__\
> \
> \
> \
> \

You're right, of course. But it's still a false issue: it's simple to
keep track of the line number where the backslash-newline sequence
began. The length of the sequence doesn't matter.

Paul Eggert

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
"Clive D.W. Feather" <cl...@on-the-train.demon.co.uk> writes:

>>digraphs,

>just new tokens for the list in the lexer's source code

That depends on the implementation. For a character-based
preprocessor, digraphs can take quite a bit more work than that.

For example, when I added digraph support to GCC's preprocessor, I used
the fact that `%:' is not a digraph if preceded by an odd number of
`<'s, because the code scans backwards at that point! The preprocessor
normally needn't do anything special about `<' when expanding macros,
but it must do so in the presence of digraphs, and it's more efficent
for it to worry about this only when a potential digraph is discovered
than to worry about it whenever `<' is discovered. Therefore GCC's
preprocessor scans backwards through the input when %: is discovered,
counting `<'s as it goes, to see whether the %: is really a #.

This is not the only bit of hairy digraph code that appears in the GCC
preprocessor. I admit that my thoughts at the time were less than kind
about the people on the standardization committee who foisted digraphs
on the rest of us. I wouldn't mind so much if digraphs were actually
used in practice, but they aren't.

mar...@my-dejanews.com

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <36B06692...@null.net>,

"Douglas A. Gwyn" <DAG...@null.net> wrote:
> Larry Jones wrote:
> > Douglas A. Gwyn wrote:
> > > ... The arguments about lookahead etc. could just as well
> > > be applied to
> > > __LINE__
> > > as to
> > > __LINE__\
> > I disagree. In the first case, the token clearly ends before the
> > newline. In the second case, the backslash-newline does *not* end
> > the token, you have to look at the first character on the next line
> > before you know whether you've got an entire token or not, and thus
> > the token ends on the next line even if no characters from that line
> > are actually part of it.
>
> You (the LR parser) don't know that the first token has ended *until
> after you have read the newline*.

Yes, but the newline character is on the same line as the token. You have not
yet read any characters from the _next_ line. In the second case you have
actually read a character from the next line before you know you are done.

Mark Williams

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

mar...@my-dejanews.com

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <36B0A199...@acm.org>,

Pete Becker <peteb...@acm.org> wrote:
> Larry Jones wrote:
> >
> > Douglas A. Gwyn wrote:
> > >
> > > Larry Jones wrote:
> > > > Douglas A. Gwyn wrote:
> > > > > ... The arguments about lookahead etc. could just as well
> > > > > be applied to
> > > > > __LINE__
> > > > > as to
> > > > > __LINE__\
> > > > I disagree. In the first case, the token clearly ends before the
> > > > newline. In the second case, the backslash-newline does *not* end
> > > > the token, you have to look at the first character on the next line
> > > > before you know whether you've got an entire token or not, and thus
> > > > the token ends on the next line even if no characters from that line
> > > > are actually part of it.
> > >
> > > You (the LR parser) don't know that the first token has ended *until
> > > after you have read the newline*.
> >
> > Yes, but it's easy enough to delay processing that newline until after
> > you've associated the current line number with the token. In the other
> > case, you have to read the backslash, the newline, and at least one
> > following character (more if that character is also a backslash), and
> > it *isn't* easy to delay processing an arbitrarily long sequence of
> > characters.
>
> But there is no "arbitrarily long sequence of characters" here. There's
> only backslash followed by an escaped character. Once it's determined
> that this can't be part of __LINE__ you know you're at the end of the
> token.

__LINE__\
\
\
\
\
\
\
\
\
\
\


I could go on :-)

Clive D.W. Feather

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <36B1090F...@technologist.com>, David R Tribble
<dtri...@technologist.com> writes

>> You (the LR parser) don't know that the first token has ended *until
>> after you have read the newline*.
>
>But is there a difficulty in saving the source line number for the
>last non-white (and non-backslash-newline) source character that was
>read? Once you're past the backslash-newline(s), you know the line
>number of the last (non-white) character of the token, right?

There's not a particular difficulty (I think), but the question is what
behaviours should and should not be allowed.

Douglas A. Gwyn

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
Pete Becker wrote:
> My copy of the standard defines the meaning of __LINE__ in terms of the
> source line on which it occurs, not when an LR parser might recognize
> that the end of a token occurs. Does yours say something different?

Please don't jump into a conversation that you have not been tracking.

Clive D.W. Feather

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <36B1083D...@technologist.com>, David R Tribble
<dtri...@technologist.com> writes

>>>alternate punctuation keywords,
>> what ?
>These are the alternate spellings found in <iso646.h> (such as 'and'
>and 'or'). They aren't a problem in C (C89 or C9X) since they're
>just macros, but they are reserved/predefined keywords in C++

Ow. I forgot that particular C++ gratuitous change.

Clive D.W. Feather

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <78rnqq$bor$1...@shade.twinsun.com>, Paul Eggert
<egg...@twinsun.com> writes

>For example, when I added digraph support to GCC's preprocessor, I used
>the fact that `%:' is not a digraph if preceded by an odd number of
>`<'s, because the code scans backwards at that point!

Yuk. What's wrong with maximal munch in the forward direction ?

> The preprocessor
>normally needn't do anything special about `<' when expanding macros,
>but it must do so in the presence of digraphs, and it's more efficent
>for it to worry about this only when a potential digraph is discovered
>than to worry about it whenever `<' is discovered.

I don't even understand this comment. Can you please give an example ?

>This is not the only bit of hairy digraph code that appears in the GCC
>preprocessor. I admit that my thoughts at the time were less than kind
>about the people on the standardization committee who foisted digraphs
>on the rest of us.

You aren't the only one. [My personal take on their history is: "Denmark
forced them on us, USA refused to fight".]

Douglas A. Gwyn

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
David R Tribble wrote:
> But is there a difficulty in saving the source line number for the
> last non-white (and non-backslash-newline) source character that was
> read? Once you're past the backslash-newline(s), you know the line
> number of the last (non-white) character of the token, right?

In effect, that's what the lexer has to do to get it right.

...
if ( alphanumeric( c = getnext() ) )
token_line_number = line_number;
...

instead of just

c = getnext();

somewhere deep inside the parser.

Douglas A. Gwyn

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
Paul Eggert wrote:
> ... I admit that my thoughts at the time were less than kind

> about the people on the standardization committee who foisted digraphs
> on the rest of us. I wouldn't mind so much if digraphs were actually
> used in practice, but they aren't.

How do you know that they aren't?
Both trigraphs and digraphs were forced on us to obtain the support
of the Danish delegation to WG14, who insisted that they had a lot of
keyboards that didn't provide any convenient way to enter certain C
source characters.

Paul Eggert

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
"Clive D.W. Feather" <cl...@on-the-train.demon.co.uk> writes:

>Yuk. What's wrong with maximal munch in the forward direction ?

Nothing's _incorrect_ about maximal munch. It's just harder to write,
and slower, that's all.

>In article <78rnqq$bor$1...@shade.twinsun.com>, Paul Eggert

>> The preprocessor
>>normally needn't do anything special about `<' when expanding macros,
>>but it must do so in the presence of digraphs, and it's more efficent
>>for it to worry about this only when a potential digraph is discovered
>>than to worry about it whenever `<' is discovered.

>I don't even understand this comment. Can you please give an example ?

Normally, when the preprocessor is processing a macro, it just copies
the definiens, looking for identifiers and `#'; the other characters
can just be copied through as-is, without tokenization. I'm omitting
details about whitespace, backslash-newline, strings, trigraphs,
multibyte characters, and so on; but the basic idea is that a
character-oriented preprocessor needn't worry about tokenizing when
it's analyzing the definiens of, say, `#define f(a,b) (a--<<--b)';
it can just copy the `--<<--' through without worrying about token
boundaries. This simplifies the writing of the preprocessor, and
makes it a tad faster.

The same basic idea can be used even in the presence of digraphs like
`%:', but it gets trickier. When you're analyzing
`#define f(a,b) (a<<<<<<<<<<<<%:b)', you must count the number of
`<'s before the `%:' to see whether the `%:' is really a `#'. Ugh.

I have the distinct impression that the people who proposed digraphs
never implemented a character-oriented preprocessor for them, and I
have the sneaking suspicion that they didn't build a token-based
preprocessor either. I'm afraid that digraphs were an example of
specify now, implement later -- which is backwards from what the C
standard ought to be.

Paul Eggert

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
"Douglas A. Gwyn" <DAG...@null.net> writes:

>Paul Eggert wrote:
>> I wouldn't mind so much if digraphs were actually
>> used in practice, but they aren't.

>How do you know that they aren't?

I know because I get the GCC bug reports, and nobody ever complains
about digraphs not working. :-)

We've had discussions before like this.

I recall your claiming that trigraphs don't break existing programs,
except for perhaps a ``handful'' of chess programs.
I showed several instances of breakage in widely used programs,
including GDB, f2c, and the JPEG library.
You said that I didn't provide enough examples,
and anyway the breakages weren't all that big a deal.

I also recall claiming that `long long' system types could break a lot
of existing code. You expressed skepticism, and said ``show me''.
So I showed you, with several examples of widely used code, including Apache.
I distinctly recall your pooh-poohing the evidence,
and saying that it was no big deal.

And this was for discussions where I was proving a positive,
and could show hard evidence. Now you're asking me to prove a negative!

I doubt whether you would be convinced by any evidence that I can supply.
I could inspect all the source code at my site for digraphs and
come up empty (as I'm sure that I would, except for the GCC test cases),
and you'd still say that perhaps some Dane somewhere
might be using digraphs in the corner of his garage.

Jerry Coffin

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to
In article <78sfam$fso$1...@nnrp1.dejanews.com>, mar...@my-dejanews.com
says...

[ ... ]

> Yes, but the newline character is on the same line as the token. You have not
> yet read any characters from the _next_ line. In the second case you have

> actually read a character from the next line before you know you are done.

I hate to say it, but I'm starting to wonder if there's any real point
to this discussion at all. The reality is that the compiler is free
to define what constitutes the end of a line, and it doesn't have to
bear any particularly close relationship with anything else on earth.

About the only limitation (that I can think of) is that preprocessor
lines really have to be treated as LINES, not just text. e.g. a
``#define'' that isn't at the beginning of a line (excluding white
space) isn't treated as a preprocessor directive. In addition, I
suppose the compiler is obliged to follow #line directives.

Other than that, the compiler is free to say that _nothing_
constitutes a new line, and simply not insert (or delete, at its
option) any new-line characters in the input, so all normal code is
treated as one long line.

At the opposite extreme, the compiler would be free to define every
semicolon outside of a string/character constant as being the end of a
"line" and count its lines that way.

As such, it's perfectly legal to have a __LINE__ on what most of us
would think of as the 250th line of a program, and have the compiler
tell us that it's, say, line 5 or line 360. In short, the compiler
can choose nearly ANY value it feels like for nearly any __LINE__, and
there's really no way to say it's legal or illegal. About the only
thing you can say about the value of __LINE__ is that they have to be
assigned in a non-decreasing order in the absence of #line directives.
There are undoubtedly a FEW other restrictions if a __LINE__ is at or
_VERY_ close to the beginning of a translation unit, but under most
circumstances, nearly ANY value can be assigned legally. In short,
it's an arbitrary number, and nearly the ONLY control is quality of
implementation.

sa...@bear.com

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to
In article <78t375$c38$1...@shade.twinsun.com>,

egg...@twinsun.com (Paul Eggert) wrote:
> "Clive D.W. Feather" <cl...@on-the-train.demon.co.uk> writes:
>
> >Yuk. What's wrong with maximal munch in the forward direction ?
>
> Nothing's _incorrect_ about maximal munch. It's just harder to write,
> and slower, that's all.
>

Sorry, you should first think of the correctness, and only then speed.

> >In article <78rnqq$bor$1...@shade.twinsun.com>, Paul Eggert
> >> The preprocessor
> >>normally needn't do anything special about `<' when expanding macros,
> >>but it must do so in the presence of digraphs, and it's more efficent
> >>for it to worry about this only when a potential digraph is discovered
> >>than to worry about it whenever `<' is discovered.
>
> >I don't even understand this comment. Can you please give an example ?
>
> Normally, when the preprocessor is processing a macro, it just copies
> the definiens, looking for identifiers and `#'; the other characters
> can just be copied through as-is, without tokenization. I'm omitting
> details about whitespace, backslash-newline, strings, trigraphs,
> multibyte characters, and so on; but the basic idea is that a
> character-oriented preprocessor needn't worry about tokenizing when
> it's analyzing the definiens of, say, `#define f(a,b) (a--<<--b)';
> it can just copy the `--<<--' through without worrying about token
> boundaries. This simplifies the writing of the preprocessor, and
> makes it a tad faster.
>

I think that a preprocessor *should* tokenize for correct parsing. Of course,
it can delay tokenizing until needed. That means you tokenize upto '#define
f(' to recognize that it is a function macro, but can keep the rest of the
line as a string. I also think that a compiler should be integrated with the
preprocessor for speed (instead of reading the output of preprocessor). Then
it can convert a pp-token to token without tokenizing again and first time
tokenization is not wasted. I know that there is place for a stand-alone
preprocessor, but then it must first *correctly* parse.

> The same basic idea can be used even in the presence of digraphs like
> `%:', but it gets trickier. When you're analyzing
> `#define f(a,b) (a<<<<<<<<<<<<%:b)', you must count the number of
> `<'s before the `%:' to see whether the `%:' is really a `#'. Ugh.
>

Here the problem is that you are searching for a token (%:) without
tokenizing!! This kind of short-cuts (so called optimizations) make the
maintenance of a lexer/parser impossible. If a language evolves, then you
have to visit the entire list of such short-cuts to see if they are still
valid. For example, C may add a new operator <<< (like Java) or alternate
spelling for a punctuator character (did anybody say digraph)? I know you are
a seasoned programmer and I do not want to sound condescending, but I
strongly feel about this from my experience.

> I have the distinct impression that the people who proposed digraphs
> never implemented a character-oriented preprocessor for them, and I
> have the sneaking suspicion that they didn't build a token-based
> preprocessor either.

I do not agree with you! Whether digraphs are useful to anybody is a
separate topic, but I do think that a well-written lexer should be capable
to handle a digraph (just another token).

> I'm afraid that digraphs were an example of
> specify now, implement later -- which is backwards from what the C
> standard ought to be.
>

-- Saroj Mahapatra

Pete Becker

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to

Thank you for the vacuous advice. Since I have actively participated in
this thread since its beginning, I don't see that it applies here.

Paul Eggert

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to
sa...@bear.com writes:

>I also think that a compiler should be integrated with the
>preprocessor for speed (instead of reading the output of preprocessor).

GCC currently has a build-time option that will let you substitute an
integrated preprocessor that does tokenization. Unfortunately, if you
select that option, GCC becomes buggier and slower. (So much for theory. :-)

If you'd like to help rectify this situation, I can put you in touch
with the maintainer for the integrated-preprocessor option. He's
gradually making it faster and more reliable. I'm sure that he
would appreciate some help.

>Here the problem is that you are searching for a token (%:) without
>tokenizing!!

I would say that the problem is that the people who added digraphs
didn't understand how a character-based preprocessor works.

Clearly you're in the ``all preprocessors should tokenize'' camp,
so you don't care whether someone changes the standard to render
character-based preprocessors infeasible. However, the standard
should be more catholic -- it should cater to existing practice,
and this includes both kinds of preprocessors.

Douglas A. Gwyn

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
Jerry Coffin wrote:
> As such, it's perfectly legal to have a __LINE__ on what most of us
> would think of as the 250th line of a program, and have the compiler
> tell us that it's, say, line 5 or line 360. In short, the compiler
> can choose nearly ANY value it feels like for nearly any __LINE__, and
> there's really no way to say it's legal or illegal.

But such a compiler clearly does not conform to the requirements of
the C standard, which *does* define what constitutes an input line.

Douglas A. Gwyn

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
Pete Becker wrote:
> Douglas A. Gwyn wrote:
> > Pete Becker wrote:
> > > My copy of the standard defines the meaning of __LINE__ in terms of the
> > > source line on which it occurs, not when an LR parser might recognize
> > > that the end of a token occurs. Does yours say something different?
> > Please don't jump into a conversation that you have not been tracking.
> Thank you for the vacuous advice. Since I have actively participated in
> this thread since its beginning, I don't see that it applies here.

Well, then, please pay attention.

Douglas A. Gwyn

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
Paul Eggert wrote:

> sa...@bear.com writes:
> >Here the problem is that you are searching for a token (%:) without
> >tokenizing!!
> I would say that the problem is that the people who added digraphs
> didn't understand how a character-based preprocessor works.

Sure we did. In fact we debated the very point, and the proponents
of digraphs-as-tokens prevailed. I'm sorry you weren't participating,
as the outcome might then have been more to your liking.

Personally I don't think there was *ever* a need for trigraphs,
digraphs, or \u-escapes in the C standard. How input characters
are coded should never have been a C language issue.

Pete Becker

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
Douglas A. Gwyn wrote:
>
> Well, then, please pay attention.

Thank you for your input. I'll give it the consideration it deserves.

Michael Rubenstein

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
On Sun, 31 Jan 1999 07:08:15 GMT, "Douglas A. Gwyn" <DAG...@null.net>
wrote:

>Jerry Coffin wrote:

Where? I can't find this definition. What I can find is (5.2.1)

In source characters there shall be some way of indicating the

end of each line of text; this International standard treats
such an end-of-line indicator as if it were a single new-line
character.

My implementation for the Kludge 9000 super-duper computer is that the
end of line character is a line-feed character if

it occurs on a line that begins with any number of spaces and
tabs followed by a #

or

it is followed by any number of spaces and tabs followed by a
#

In other contexts line-feed in a source file is translated to space.

Thus the program

#include <stdio.h>
int main(void)
{
printf("%d\n", __LINE__);
return 0;
}

prints 2.
--
Michael M Rubenstein

Jerry Coffin

unread,
Jan 31, 1999, 3:00:00 AM1/31/99
to
In article <36B4013A...@null.net>, DAG...@null.net says...

> Jerry Coffin wrote:
> > As such, it's perfectly legal to have a __LINE__ on what most of us
> > would think of as the 250th line of a program, and have the compiler
> > tell us that it's, say, line 5 or line 360. In short, the compiler
> > can choose nearly ANY value it feels like for nearly any __LINE__, and
> > there's really no way to say it's legal or illegal.
>
> But such a compiler clearly does not conform to the requirements of
> the C standard, which *does* define what constitutes an input line.

Please re-read section 5.1.1.2 of the standard, with an emphasis on
phase 1 of translation. Pay close attention to the fact that the
standard says new-line characters will be introduced as a substitution
for "end-of-line indicators", but carefully does NOT define what
constitutes an end-of-line indicator. After doing so, attempt to find
a single part of the standard that is violated by a phase-one mapping
such as the following:

\\$ -> <nothing>
\n[ \t\r\n]*#\(.*\)$ -> \n#\1
\n -> <nothing>

I've looked several times for a limitation on the mapping done in
phase one of translation, and I can't find one. Absent such a
limitation, I believe the mapping given above is legal. It results in
each preprocessor line being a line by itself, and virtually
everything else appearing as one really LONG line.


Philip Lantz

unread,
Feb 1, 1999, 3:00:00 AM2/1/99
to
Jerry Coffin wrote:
>
> \\$ -> <nothing>
> \n[ \t\r\n]*#\(.*\)$ -> \n#\1
> \n -> <nothing>

I think that third line had better be
\n -> <space>

prl

John Hauser

unread,
Feb 1, 1999, 3:00:00 AM2/1/99
to
Jerry Coffin wrote:
> Pay close attention to the fact that the
> standard says new-line characters will be introduced as a substitution
> for "end-of-line indicators", but carefully does NOT define what
> constitutes an end-of-line indicator. After doing so, attempt to find
> a single part of the standard that is violated by a phase-one mapping
> such as the following:
>
> \\$ -> <nothing>
> \n[ \t\r\n]*#\(.*\)$ -> \n#\1
> \n -> <nothing>

If I understand all this syntax, I think you've made it harder to
continue a preprocessor directive than you likely intended. Consider
the following:

/* First \n follows this comment. */
#define MAIN \
int main()
MAIN { return 0; }

As I understand it, this becomes logically equivalent to a single macro
definition:

#define MAIN int main() MAIN { return 0; }

I don't claim that that invalidates your mapping, but I do expect it
would be infuriating to actual programmers, which is not, I think,
what you wanted.

(Also, there's a minor problem with preprocessor directives appearing
on the first source line, which I leave to you to discover.)

- John Hauser

James Kuyper

unread,
Feb 1, 1999, 3:00:00 AM2/1/99
to
Jerry Coffin wrote:
...

> Please re-read section 5.1.1.2 of the standard, with an emphasis on
> phase 1 of translation. Pay close attention to the fact that the

> standard says new-line characters will be introduced as a substitution
> for "end-of-line indicators", but carefully does NOT define what
> constitutes an end-of-line indicator. After doing so, attempt to find
> a single part of the standard that is violated by a phase-one mapping
> such as the following:
>
> \\$ -> <nothing>
> \n[ \t\r\n]*#\(.*\)$ -> \n#\1
> \n -> <nothing>
>
> I've looked several times for a limitation on the mapping done in
> phase one of translation, and I can't find one. Absent such a
> limitation, I believe the mapping given above is legal. It results in
> each preprocessor line being a line by itself, and virtually
> everything else appearing as one really LONG line.

Phase 1 translation is pretty open-ended. It's so open-ended that as a
practical matter, it's not useful to talk about what the standard
requires in terms of what goes into phase 1, except for purposes of
discussing phase 1. The compiler can choose to translate just about
anything it wants into the raw characters used for later phases. For all
later phases, such as the phase 4 expansion of __LINE__, we should be
talking in terms of text after it's been processed through phase 1.
Thus, when I say that my program contains the following text:

int x = __LI\
NE__;

what I mean is that the newlines in my displayed text should be
considered to represent whatever sequence of real characters my
compiler's phase 1 requires me to insert, so that it can translate them
into newlines.

David R Tribble

unread,
Feb 1, 1999, 3:00:00 AM2/1/99
to
"Douglas A. Gwyn" wrote:
> Personally I don't think there was *ever* a need for trigraphs,
> digraphs, or \u-escapes in the C standard. How input characters
> are coded should never have been a C language issue.

It would be nice if C allowed ISO-8859-1 characters as its source
character set. This would doubtless make many Europeans happy, and
would have the side effect of making their SBCS source code portable
to other countries like the USA.

But C doesn't do this. So the only portable way to allow source
code containing identifiers like grün and número is to use some
encoding that is limited to the 94 or so ISO-646 characters, i.e.,
UCNs.

Personally, I'd like for all text-based systems (including C
compilers) to adopt Unicode files as their input/output norm.
Of course, I'd also like to see European keyboards have the '['
and ']' keys on them, too (not because I use a Danish keyboard, but
because I'd like to see these arguments over trigraphs and digraphs
become unnecessary). But I'm not holding my breath for either of
these.

-- David R. Tribble, dtri...@technologist.com --
Old headlines: British Left Waffles on Faulklands

Jerry Coffin

unread,
Feb 2, 1999, 3:00:00 AM2/2/99
to
In article <36B62420...@cs.berkeley.edu>,
jha...@cs.berkeley.edu says...

> Jerry Coffin wrote:
> > Pay close attention to the fact that the
> > standard says new-line characters will be introduced as a substitution
> > for "end-of-line indicators", but carefully does NOT define what
> > constitutes an end-of-line indicator. After doing so, attempt to find
> > a single part of the standard that is violated by a phase-one mapping
> > such as the following:
> >
> > \\$ -> <nothing>
> > \n[ \t\r\n]*#\(.*\)$ -> \n#\1
> > \n -> <nothing>
>
> If I understand all this syntax, I think you've made it harder to
> continue a preprocessor directive than you likely intended. Consider
> the following:

I probably didn't state things as well as I should have -- I spent a
total of maybe five minutes putting together an obvious mapping that
would demonstrate a general idea: that the vast majority of a file can
end up as a single "line" and still be in accordance with the
standard.

After a bit more thought, it looks like there would probably have to
be about five rules instead of the three I used. The third should map
to a space instead of nothing, and one would need to be added to deal
with a preprocessor directive at the beginning of the file. I'm not
sure, but I _think_ it might also be necessary to deal specially with
an include file that doesn't end in an end-of-line indicator as well,
though I haven't looked up whether that gives defined results or not.

With careful examination of the standard, it would probably be
possible to come up with a couple of other things that might require a
bit more work to process as well. However, I'm quite certain that
none of these would change the basic fact that a phase one translator
on the same general order as the one above would be legal. Such a
thing would result in nearly NO relationship between the value of
__LINE__ and what most of us would think of as a line in the input
file.

Jerry Coffin

unread,
Feb 2, 1999, 3:00:00 AM2/2/99
to
In article <36B61E...@ibeam.intel.com>, p...@ibeam.intel.com
says...

> Jerry Coffin wrote:
> >
> > \\$ -> <nothing>
> > \n[ \t\r\n]*#\(.*\)$ -> \n#\1
> > \n -> <nothing>
>
> I think that third line had better be
> \n -> <space>

Yes, you're absolutely right -- we wouldn't want tokens getting
spliced.

Antoine Leca

unread,
Feb 2, 1999, 3:00:00 AM2/2/99
to

> "Douglas A. Gwyn" wrote:
> > Personally I don't think there was *ever* a need for trigraphs,
> > digraphs, or \u-escapes in the C standard. How input characters
> > are coded should never have been a C language issue.

That is one thing.

In article <36B65A82...@technologist.com>, David R Tribble wrote:
>
> It would be nice if C allowed ISO-8859-1 characters as its source
> character set.

That is another thing.

And the current trend is exactly to this, PLUS almost any other
encoding, allowing to use almost any language. Like French (that
cannot be written correctly using 8859-1), or Russian, or Japanese
(no preference order shown here).

The problem that Doug expressed (as I see it), is that the solution
found to coope with this (and the keyboards as well), is/was to
overspecify the syntax of C sources, down to a level that he
believed should not belong to a language standard.
Doug, please correct me if I do not represent your mind.


> But C doesn't do this.

Huh?
Are you kidding? Or what am I missing in your point?

C9X (I assume you do not speak about C90) try to find a broader
solution, so it is not *restricted* to the sole 8859-1, but to any
character set (even if it is not entirely compatible with Unicode).

Then, you are right that obtuse implementations can refuse to
handle a source written using ISO/CEI 8859-1.

That makes sense if you are eg. behind a IBM/3270 screen, or
perhaps if you are addicted to Microsoft, and willing to use code
page 1252 idiosyncrasies or automatic locales adaptation...
That also makes sense if you are building an international library,
intended to be use by others that may have difficulties with 8859-1
encoded identifiers.

OTOH, I doubt Unix vendors with paying customers in Europe will refuse
to customize an option to accept sources written in ISO/CEI 8859-1.
Likewise for generic tools like gcc (which in fact will not care
about the encoding at all, as long as it is an extension of US-ASCII:
the current specifications are intended to allow this behaviour).


Of course, sources written in ISO/CEI 8859-1 cannot pretend to be
fully portable. Neither are sources written in US-ASCII. Because
these are just encodings, not "*the* source character sets" of their own.


> So the only portable way to allow source code containing
> identifiers like grün and número is to use some encoding that is
> limited to the 94 or so ISO-646 characters, i.e., UCNs.

You, as a programmer, are certainly not required to use these forms:
they are here to allow correct exchange of files between platforms.
Like the original *intent* of the trigraphs.
I have even seen an editor that do the online conversion from/to
strict US-ASCII with UCNs in files to/from multinational sources
at the screen. Cool! <URL:http://www.sharmahd.com/unipad>

The trigraphs failed, but they did not bring any functionnalities
to the vast majority of users (note I did not write programmers).

OTOH, UCNs do provide a new functionnality (although perhaps awkward,
at least in the beginning, before the whole sets of tools are updated
to these name mangling).


> Personally, I'd like for all text-based systems (including C
> compilers) to adopt Unicode files as their input/output norm.

And you want little-endian, or big-endian ?
UCS-2, or UTF-16, or UCS-4 ?

Or either UTF-8...
I am sure the move from US-ASCII to UTF-8 is a very simple one, and
if a handful of programmers is doing this very move on key tools like
the GCC preprocessor (no offence intended ;-))

> Of course, I'd also like to see European keyboards have the '['
> and ']' keys on them, too

No chance.
The keyboards have standardized on ~47 alphanumeric keys, and I feel
that is more than enough. The EC talks to add us a new key, specific
for the euro sign, and I really do not want to have ~6 keys more
to accomodate all the strange signs like [] {} ~ # I never use except
for typing in C sources. OTOH, I love to have all the unnecessary
(to you) keys with the accentuated characters that make my language
much more readable to me...
And that is not going to change.


> But I'm not holding my breath for either of these.

So you are not Obelix ;-)


Antoine

Paul Eggert

unread,
Feb 2, 1999, 3:00:00 AM2/2/99
to
Antoine Leca <Antoin...@renault.fr> writes:

>I am sure the move from US-ASCII to UTF-8 is a very simple one, and
>if a handful of programmers is doing this very move on key tools like
>the GCC preprocessor (no offence intended ;-))

The biggest problem with UTF-8 from the GCC point of view is external
identifiers, as most assemblers and linkers won't allow UTF-8 identifiers.
The GNU assembler does allow UTF-8 (indeed, it allows any string of
non-null bytes, which is important for non-UTF-8 encodings), but GCC
must port to other assemblers.

We're still disccusing about how to mangle identifiers into an A-Za-z0-9_
form that is acceptable to traditional assemblers and linkers, without
usurping the traditional name space. I have a proposal that

* allows arbitrary byte strings in identifiers,
* leaves traditional (and Standard C90) identifiers alone, and
* uses at most 3 bytes per UTF-8 Unicode character,
plus 2 overhead bytes for the whole identifier;

which I can forward to this forum if there's interest.

>gcc (which in fact will not care
>about the encoding at all, as long as it is an extension of US-ASCII:
>the current specifications are intended to allow this behaviour).

Unfortunately, some common encodings are not an extension of US-ASCII.
It's not just the Danes; it's also Shift-JIS. If GCC is to support
Shift-JIS (which is the most popular encoding for Japanese text, alas)
it will have to care about encodings. The latest EGCS snapshots have
limited Shift-JIS support, but it is in a hacky form that will probably
be revised, once a more general scheme is in place.

Douglas A. Gwyn wrote:

> Personally I don't think there was *ever* a need for trigraphs,
> digraphs, or \u-escapes in the C standard. How input characters
> are coded should never have been a C language issue.

I tend to agree, particularly after having to implement digraphs
and \u-escapes (thankfully, someone else did trigraphs).

Ross Ridge

unread,
Feb 4, 1999, 3:00:00 AM2/4/99
to
Paul Eggert <egg...@twinsun.com> wrote:
>We're still disccusing about how to mangle identifiers into an A-Za-z0-9_
>form that is acceptable to traditional assemblers and linkers, without
>usurping the traditional name space. I have a proposal that
>
>* allows arbitrary byte strings in identifiers,
>* leaves traditional (and Standard C90) identifiers alone, and
>* uses at most 3 bytes per UTF-8 Unicode character,
> plus 2 overhead bytes for the whole identifier;

Unicode's surrogate chracters might be problem. Two of them form one
UTF-8 character. They effectively extend the range of UTF-8 characters
to more than 16-bits, though I can't remember by how much. (You can
also use UTF-8 to encode ISO 10646-1's 31-bit characters...)

>>gcc (which in fact will not care
>>about the encoding at all, as long as it is an extension of US-ASCII:
>>the current specifications are intended to allow this behaviour).
>
>Unfortunately, some common encodings are not an extension of US-ASCII.
>It's not just the Danes; it's also Shift-JIS. If GCC is to support
>Shift-JIS (which is the most popular encoding for Japanese text, alas)
>it will have to care about encodings.

I'm not sure what you mean here. All Japanese encodings aren't strict
extensions of US-ASCII, as they all replace the backslash with the yen
sign character. However, this is something most applications should
ignore, especially compilers. ShiftJIS is a multi-byte character set,
but so are all the other Japanese encodings. Do you mean the the
problem where the second byte of a ShiftJIS character can have the
value of a printable ASCII character? (A big problem, but it could be
worse...)

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rri...@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/u/rridge/
db //

Colin Plumb

unread,
Feb 5, 1999, 3:00:00 AM2/5/99
to
While it's certainly possible to do all kinds of tricky definitions
of "what is a line" to fight the standard over the correct
substitution value for __LINE__, let's step back and look at the
reason that anyone cares about line numbers in the source code in the
first place.

Other than IOCCC entries, the purpose is debugging. That's why lexers
kept track of the data before __LINE__ was even considered; they wanted
to report the error so you can put your editor near the right place.

Ideally, on the right line, but it may be a little bit off in the presence
of contrived input.

Almost any definition is certainly implementable, but I think that
tight definitions in pathological cases just complicate the lexer, which
decreases speed, increases bugs and shortens compiler writers' tempers.
And I fail to see how it improves C programmers' lives.

So I'd like to argue for a loose definition. In particular, any
one of the lines in a \-continued stream will do. The rule adopted by
dmr's compiler seems the most logical and intuitive to me,
so I certainly don't want that ruled out.
--
-Colin


/* Line 1 */
i\
n\
t\
\
i\
=\
_\
_\
L\
I\
N\
E\
_\
_\
,\
j\
;\

/* Line 20 */
I think that any value for i from 2 to 19 is acceptable.

Antoine Leca

unread,
Feb 5, 1999, 3:00:00 AM2/5/99
to
In article <F6nJB...@undergrad.math.uwaterloo.ca>,

rri...@calum.csclub.uwaterloo.ca (Ross Ridge) wrote:
> Paul Eggert <egg...@twinsun.com> wrote:
> >We're still disccusing about how to mangle identifiers into an A-Za-z0-9_
> >form that is acceptable to traditional assemblers and linkers, without
> >usurping the traditional name space. I have a proposal that
> >
> >* allows arbitrary byte strings in identifiers,
> >* leaves traditional (and Standard C90) identifiers alone, and
> >* uses at most 3 bytes per UTF-8 Unicode character,
> > plus 2 overhead bytes for the whole identifier;
>
> Unicode's surrogate chracters might be problem. Two of them form one
> UTF-8 character.

Correct.
Also to be noted is that none of these characters are currently defined
(except for private use).

> They effectively extend the range of UTF-8 characters
> to more than 16-bits, though I can't remember by how much.

Up to U-0010FFFF (that is a little more than 20 bits).


> (You can also use UTF-8 to encode ISO 10646-1's 31-bit characters...)

Also correct.
However, except for the private use planes and groups, I do not
expect any character to be encoded outside the space described above
(up to U-0010FFFF).
I may be wrong in a couple of years, if new scripts with thousands
of characters are discovered.


> >>gcc (which in fact will not care
> >>about the encoding at all, as long as it is an extension of US-ASCII:
> >>the current specifications are intended to allow this behaviour).
> >
> >Unfortunately, some common encodings are not an extension of US-ASCII.
> >It's not just the Danes; it's also Shift-JIS. If GCC is to support
> >Shift-JIS (which is the most popular encoding for Japanese text, alas)
> >it will have to care about encodings.
>
> I'm not sure what you mean here. All Japanese encodings aren't strict
> extensions of US-ASCII, as they all replace the backslash with the yen
> sign character. However, this is something most applications should
> ignore, especially compilers.

Yes, but I believe this is not the point.

> ShiftJIS is a multi-byte character set,
> but so are all the other Japanese encodings. Do you mean the the
> problem where the second byte of a ShiftJIS character can have the
> value of a printable ASCII character? (A big problem, but it could be
> worse...)

This one is very delicate if the compiler do *not* know that it is given
a Shift-JIS source (and I believe this is the problem Paul had in mind).

For example, see
#if defined ENGLISH
#define MSG_ERROR error message in English
#elif defined JAPANESE
#define MSG_ERROR ...untransmittable in US-ASCII, ending with: <DE>\
#endif

<DE>\ is (AFAIK) a valid character in Shift-JIS.
If the parser is not aware that the source is written in Shift-JIS,
for example if it is compiling with ENGLISH defined ;-),
I fear it will have big problems to handle the above snippet...

BTW, this is not an example of good i18n programming style!

Paul Eggert

unread,
Feb 5, 1999, 3:00:00 AM2/5/99
to
rri...@calum.csclub.uwaterloo.ca (Ross Ridge) writes:

>Unicode's surrogate chracters might be problem. Two of them form one

>UTF-8 character. They effectively extend the range of UTF-8 characters
>to more than 16-bits, though I can't remember by how much. (You can


>also use UTF-8 to encode ISO 10646-1's 31-bit characters...)

My proposal encodes any UTF-8 sequence in an identifier (actually, it's
more general: it encodes any byte sequence) so these shouldn't be a
problem. E.g. a 6-byte UTF-8 sequence (which encodes an ISO 10646
character in the currently-unassigned range 04000000-7FFFFFFF) is
represented by 7 Ascii bytes; the first byte is `Y' and the other 6
bytes are each in the set `A-Za-z0-9_'.

>>If GCC is to support Shift-JIS ... it will have to care about encodings.

>Do you mean the the problem where the second byte of a ShiftJIS character
>can have the value of a printable ASCII character?
>(A big problem, but it could be worse...)

Yes, that's what I was talking about. Mainly it is a problem in
strings, in Japanese-only programs. It can also cause comments
to be botched, too. E.g.

// [Some Shift-JIS text whose last byte looks like an ASCII backslash]
i++;

silently comments out the `i++;', unless the compiler knows how to
handle Shift-JIS correctly.

0 new messages