Ignore break line sometimes

Geovani de Souza

unread,

Feb 11, 2012, 9:56:17 AM2/11/12

to

Hi all!

I'm trying write an parser to my compiler, and I'm interessed to ignore the break line (\n) sometimes. E.g:

if true then [\n]
foo(); [\n]
end; [\n]

So, in the first line, the '\n' after 'then' isn't important, but in the second "foo();" could replace the need of the semicolon to conclude the statement, or still, in the 'end'.

Too ignore '\n' in the white lines.

How can I do this?

Hans-Peter Diettrich

unread,

Feb 11, 2012, 11:28:36 AM2/11/12

to

Geovani de Souza schrieb:

> I'm trying write an parser to my compiler, and I'm interessed to
> ignore the break line (\n) sometimes. E.g:
>
> if true then [\n] foo(); [\n] end; [\n]
>
> So, in the first line, the '\n' after 'then' isn't important, but in
> the second "foo();" could replace the need of the semicolon to
> conclude the statement, or still, in the 'end'.

That's why many (compiled) languages ignore line ends and other
whitespace, and require explicit statement termination, e.g. by a
semicolon. Interpreters instead often prefer the "one statement per
line" approach, with the option to concatenate statements by e.g. a colon.

IMO you should make a decision about the meaning of whitespace in
general, and of line endings in detail, in your language.

Please give an example that would compile differently when linefeeds are
removed, and then answer yourself the question whether this really
will make sense.

DoDi

George Neuner

unread,

Feb 11, 2012, 12:59:57 PM2/11/12

to

IMO making the newlines significant is a really bad idea ... but
leaving that aside I believe the most effective way would be to have
your lexer return a special "end-of-line" code for either semicolon or
newline and make the end-of-line code optional where it need not be.

You don't say whether your parser is handwritten or tool generated (or
which tools) ... so I can't really give an example.

George

Karsten Nyblad

unread,

Feb 12, 2012, 3:21:48 AM2/12/12

to

> I'm trying write an parser to my compiler, and I'm interessed to
ignore the break line (\n) sometimes. E.g:
>
> if true then [\n]
> foo(); [\n]
> end; [\n]

One option is to write a recursive descendent parser, and have two ways
of calling the lexer: One that return line ends and one that does not.

An other option is to base your parsing on a parser generator like
bison, and modify the code that drives the automaton. That code is
modified such that when the lexer returns a line feed token, you copy
the stack of states, and on the copy you simulate the actions that the
parser would have taken. When the simulation stacks the line feed, you
throw away the copy and resume parsing on the real stack with the line
feed in the window. When the simulation encounters an error, you throw
away the simulation AND the line feed and call the lexer again.

If you chose the second option, it is important that you chose the right
parser generator, because some parser generators already generate code
that can help you. Many LR parser generators, e.g., bison, include
facilities for generalised LR parsing, and many LL parser generators
include facilities for backtracking. That might help you.

Karsten Nyblad

Stefan Monnier

unread,

Feb 12, 2012, 10:48:20 AM2/12/12

to

> So, in the first line, the '\n' after 'then' isn't important, but in the
> second "foo();" could replace the need of the semicolon to conclude the
> statement, or still, in the 'end'.

A simple approach is to treat every newline as a semi-colon, and then to
adapt your grammar so as to accept (and ignore) extra semi-colons.
I.e. accept "if true then; foo(); ; end; ;"

Stefan

Joshua Cranmer

unread,

Feb 12, 2012, 1:03:13 PM2/12/12

to

On 2/11/2012 8:56 AM, Geovani de Souza wrote:
> Hi all!
>
> I'm trying write an parser to my compiler, and I'm interessed to
> ignore the break line (\n) sometimes. E.g:
>
> if true then [\n] foo(); [\n] end; [\n]
>
> So, in the first line, the '\n' after 'then' isn't important, but in
> the second "foo();" could replace the need of the semicolon to
> conclude the statement, or still, in the 'end'.

It sounds like you want something like ECMAScript's magic
you-don't-always-need-a-semicolon feature.
<http://bclary.com/2004/11/07/#a-7.9> describes how it works in detail.
The thrust of it is that "if you see an invalid token, but you saw a
newline before, automatically insert a semicolon to fix things."

There are more than a few people who believe that this feature should
not have been implemented.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth

Kaz Kylheku

unread,

Feb 12, 2012, 7:16:57 PM2/12/12

to

On 2012-02-12, Karsten Nyblad <uu3kw...@snkmail.com> wrote:
> that can help you. Many LR parser generators, e.g., bison, include
> facilities for generalised LR parsing, and many LL parser generators
> include facilities for backtracking. That might help you.

General LR and backtracking, just to make semicolons optional when
there are newlines? LOL.

BartC

unread,

Feb 13, 2012, 7:25:04 PM2/13/12

to

"Geovani de Souza" <geovani...@gmail.com> wrote

> I'm trying write an parser to my compiler, and I'm interessed to ignore
> the break line (\n) sometimes. E.g:
>
> if true then [\n]
> foo(); [\n]
> end; [\n]
>
> So, in the first line, the '\n' after 'then' isn't important, but in the
> second "foo();" could replace the need of the semicolon to conclude the
> statement, or still, in the 'end'.
>

> To ignore '\n' in the white lines.

I've tried a few schemes. One just converts a newline to a semicolon,
*unless* the last symbol was (for example) a comma.

This requires some sort of continuation symbol for when a semicolon would be
inappropriate.

And it helps if the grammar is tolerant of extra semicolons, otherwise the
source code could be full of continuation symbols! (After 'then' for
example.)

Whatever scheme you choose, you'll know it works well when you have
thousands of lines of code without a single semicolon, and hardly any
continuations. And that is perfectly clear to read.

--
Bartc

Gene Wirchenko

unread,

Feb 19, 2012, 11:57:51 PM2/19/12

to

On Sun, 12 Feb 2012 12:03:13 -0600, Joshua Cranmer
<Pidg...@verizon.invalid> wrote:

[snip]

>It sounds like you want something like ECMAScript's magic
>you-don't-always-need-a-semicolon feature.

But please do not go there.

><http://bclary.com/2004/11/07/#a-7.9> describes how it works in detail.
>The thrust of it is that "if you see an invalid token, but you saw a
>newline before, automatically insert a semicolon to fix things."
>
>There are more than a few people who believe that this feature should
>not have been implemented.

There is a bit more to this. As a result of this kludge, it is
illegal to have newlines at certain points in some statements. For
example:
return
<expression which I decided to put all on its own line>;
is not legal. It is not permitted to have a newline immediately after
"return".

Sincerely,

Gene Wirchenko

glen herrmannsfeldt

unread,

Feb 20, 2012, 3:09:07 AM2/20/12

to

Gene Wirchenko <ge...@ocis.net> wrote:
(snip, someone wrote)

>> There are more than a few people who believe that this
>> feature should not have been implemented.

> There is a bit more to this. As a result of this kludge, it is
> illegal to have newlines at certain points in some statements.
> For example:

> return
> <expression which I decided to put all on its own line>;
> is not legal. It is not permitted to have a newline immediately after
> "return".

Sounds about like the way IBM's JCL from OS/360 and successors works.

You can split a statement after a comma in most cases, and continue
it on the next line, after the usual // and some spaces.

I believe the original (early) versions had a more usual system
with a continuation character in column 72, and then start the
next statement in column 16. I presume it was found hard to get
right so they changed it.

I believe that there are a few other languages with a similar
continuation method. That is, if you end a statement in a legal
end, no continuation is needed.

-- glen

Aharon Robbins

unread,

Feb 23, 2012, 4:51:46 PM2/23/12

to

glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
>I believe that there are a few other languages with a similar
>continuation method. That is, if you end a statement in a legal
>end, no continuation is needed.

Awk is like this. You can continue after a comma, && or ||. Possibly
in other places too. You can supply semicolons to separate statements
on the same line, if you want.

It tends to work fairly naturally in awk, I rarely use \ to continue
onto the next line. :-)
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Jonathan Thornburg

unread,

Feb 26, 2012, 10:49:16 PM2/26/12

to

Aharon Robbins <arn...@skeeve.com> wrote:
> Awk is like this. You can continue after a comma, && or ||. Possibly
> in other places too. You can supply semicolons to separate statements
> on the same line, if you want.
>
> It tends to work fairly naturally in awk, I rarely use \ to continue
> onto the next line. :-)

On the other hand pic (Kernighan's picture-drawing "little language")
is very finicky about where it accepts \ line-continuations, allowing
them in some places but forbidding them in others. For example, the
pic code

for j = 2 to 6 by 2 do { \
for i = 3 to 7 by 2 do { \
fine_space_interp_point at grid_point(j,i) } }

does NOT allow a \ line-continuation between either "for" and the
following "{". (Or more precisely, all my attempts to make such
produced the usual unhelpful pic syntax-error messages.) :(

--
-- "Jonathan Thornburg
Dept of Astronomy & IUCSS, Indiana University, Bloomington, Indiana, USA
"Washing one's hands of the conflict between the powerful and the
powerless means to side with the powerful, not to be neutral."
-- quote by Freire / poster by Oxfam