Parsing Expression Grammar

skj...@gmail.com

neprebran,

31. avg. 2005, 00:37:1531. 8. 05

do

Does anybody have working examples of a Parsing Expression Grammar
(http://en.wikipedia.org/wiki/Parsing_expression_grammar) for a Java
like language? I know the Rats! package generates parsers based on
PEG, but since Rats! is GPL`d I can`t use it for commercial use. It
would help to actually see the set of needed production rules for the
Java language as a base for designing a new language.

The reason for this is that I have a working VM implementation that I
need to test feed with bytecode samples, so it makes sense to start
bringing in the compiler step. As opposed to LL and RL parsers, PEG
parsers seem both more efficient AND unambiguous and handle context
free languages. I was wondering if Java, apart from the usual dangling
else conflicts, has been frame already in a PEG?

Cheers!

Laurence Finston

neprebran,

2. sep. 2005, 14:19:312. 9. 05

do

On Wed, 31 Aug 2005 skj...@gmail.com wrote:

> I know the Rats! package generates parsers based on
> PEG, but since Rats! is GPL`d I can`t use it for commercial use.

No, the GNU General Public License does allow commercial use, under
specified conditions. If you modify software licensed under the GNU
GPL, you may not be able to use it in the way you want, but you can
certainly charge money for it.

This is an excerpt from the FAQ for the GPL at http://www.gnu.org:

If I use a piece of software that has been obtained under
the GNU GPL, am I allowed to modify the original code into a
new program, then distribute and sell that new program
commercially?

You are allowed to sell copies of the
modified program commercially, but only under the terms of
the GNU GPL. Thus, for instance, you must make the source
code available to the users of the program as described in
the GPL, and they must be allowed to redistribute and modify
it as described in the GPL.

These requirements are the condition for including the
GPL-covered code you received in a program of your own.

This refers to modifying GPL'd software. Merely using it in
combination with your own software is an entirely different matter,
and in most cases should not cause problems.

Problems can arise when a GPL'd program produces output files, and
these output files contain a license. GNU packages do not usually do
this. Clearly, it is possible to use GCC to compile commercial
software, GNU Emacs to write files used for commercial purposes,
etc. It is certainly not my intention that all drawings made with GNU
3DLDF should be free in the sense of the GNU GPL. My own are not,
unless I decide to make them so.

> It would help to actually see the set of needed production rules for
> the Java language as a base for designing a new language.

I don't see any reason why you can't study Free Software. That's what
it's there for (among other things). However, if you use any of the
code, then you must respect the terms of the license, or you will be
violating the rights of the copyright holder.

Of course, studying software and then writing one's own implementation
is a tricky issue. Both the _Numerical Recipes_ books and _The NURBs
Book_ contain non-free code which can be studied but not used, unless
one obtains a license from the copyright holders. I've been very
hesitant to read such code. How is it possible not to be influenced
by it, when you go to write your own? What if you see a great
programming idiom that never would have occurred to you?

Laurence Finston

Russ Cox

neprebran,

2. sep. 2005, 14:20:002. 9. 05

do

> Does anybody have working examples of a Parsing Expression Grammar
> (http://en.wikipedia.org/wiki/Parsing_expression_grammar) for a Java
> like language? I know the Rats! package generates parsers based on
> PEG, but since Rats! is GPL`d I can`t use it for commercial use. It
> would help to actually see the set of needed production rules for the
> Java language as a base for designing a new language.

The PEG for Java that Bryan Ford used in his thesis is at
http://pdos.csail.mit.edu/~baford/packrat/thesis/java/Java.pappy.

> The reason for this is that I have a working VM implementation that I
> need to test feed with bytecode samples, so it makes sense to start
> bringing in the compiler step. As opposed to LL and RL parsers, PEG
> parsers seem both more efficient AND unambiguous and handle context
> free languages. I was wondering if Java, apart from the usual dangling
> else conflicts, has been frame already in a PEG?

I am unconvinced that PEG parsers are much better than LALR(1) parsers
on efficiency grounds. They're linear-time, but the constant is
non-trivial. They're unambiguous, but I think it's just as difficult
to write what you mean as it is in grammars with ambiguity (and at
least the latter can tell you that there's a problem!). One thing
PEGs definitely do well (and this is at least one reason why Bryan
worked on them) is that they are composable: you can take any terminal
symbol in a PEG, replace it with a non-terminal with its own PEG
grammar, and you get a bigger, but still valid, PEG grammar. This is
not true of LALR(1) and (most of?) its ilk.

Russ

Chris F Clark

neprebran,

2. sep. 2005, 14:20:422. 9. 05

do

Almost any grammar suitable to ANTLR (, PCCTS, or JavaCC) will be a
Parsing Expression Grammar, because any LL(1) grammar is also a
Parsing Expression Grammar (modulo some slight syntax issues). So, if
you want a "PEG" for Java. Find a Java grammar for ANTLR (or PCCTS)
and change the | operators into / operators and you are done.

A pure (unpredicated) LL(1) grammar, is simply a PEG where the
predicate operators are not used (and the alternatives are separated
by | rather than /).

In addition, any recursive descent parser generated from an LL(1)
grammar will be (atleast) as efficient as the corresponding packrat
parser, because there is no backtracking involved in a "pure" LL(1)
grammar and the memoization of a packrat parser is only to preserve
linear run time's in the presence of backtracking. thus, a packrat
parser and a recursive descent LL(1) parser do exactly the same steps
(omitting the memoization, which has no benefit) for an LL(1) grammar.

Moreover, the benefits listed for PEGs (unambiguity and linear time
requirements) are shared by LL(1) grammars.

Now, if you have an LR grammar for Java, say one for CUP or Yacc++,
there is more to do, because LR grammars are not subsets of PEGs.
However, almost all LL(1) grammars are LR(1) also. Thus, if you have
an LL(1) grammar, you can use your choice of parser generators: LL,
LR, or PEG.

There are also PEGs that are not LL grammars, but you don't need that
to parse Java. Therefore, the PEG technology is not a loss. In fact,
it is a good solution to some problems. Moreover, the algorithm for
processing PEGs is useful and provides some insight that wouldn't have
been seen if the technology hadn't been developed.

However, it is not necessary to rewrite all our grammars into PEG
form, LL(1) is perfectly adequate. The interesting case might come
from languages which don't have an LL(1) grammar (e.g. C++). For
them, it might be useful to have a PEG.

But, as a potential language designer, you need to think long and hard
about why you want to create a language which has no LL(1) grammar.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

A Pietu Pohjalainen

neprebran,

2. sep. 2005, 14:23:342. 9. 05

do

skj...@gmail.com wrote:
> Does anybody have working examples of a Parsing Expression Grammar
> (http://en.wikipedia.org/wiki/Parsing_expression_grammar) for a Java
> like language? I know the Rats! package generates parsers based on
> PEG, but since Rats! is GPL`d I can`t use it for commercial use. It
> would help to actually see the set of needed production rules for the
> Java language as a base for designing a new language.

Umm, a parser generator license doesn't have much to do with the
generated code license. Just like while the GCC is under licensed
under GPL, it doesn't affect the copyright of the code that is
compiled using it. License of linked libraries might affect the usage
possibilities in a commercial setting, however.

The Rats runtime libraries seem to be licensed under LGPL. If that's
too restrictive also, I'd guess that it wouldn't be much of work to
rewrite those as well.

> The reason for this is that I have a working VM implementation that
> I need to test feed with bytecode samples, so it makes sense to
> start bringing in the compiler step.

For generating test data to Java (?) VM, you really don't need a full
compiler. Just grab a JVM assembler, such as Jasmin, write the test
suites in symbolic bytecodes, and you're done. The 'javap' tool
distributed with the JDK disassembles existing classes for you.

Or why not just write your test cases in regular Java, and compile
them using the regular compiler provided by the JDK?

> As opposed to LL and RL parsers, PEG parsers seem both more efficient
> AND unambiguous and handle context free languages. I was wondering
> if Java, apart from the usual dangling else conflicts, has been frame
> already in a PEG?

Uhm, using context free languages, it is always possible to write
ambiguous grammars, and actually there are context free languages that
are inherently ambiguous. So, you either have to deal with the
ambiguousity in the CFG's (e.g. precedence, associativity), or not use
context free languages.

About the efficiency, I would really like to see how a backtracking
parser with unlimited lookahead beats an LL(k) or LR(k) parser.

Personally, I've been toying with an OO-grammar -based parser
generator, which solves the problems presented in the example
object-oriented parser in Grune et al's "Modern Compiler Design". (See
the exercise A.1). Using backtracking for finding the correct parse
and the GoF bridge-pattern to delegate semantic actions to instances
of subclasses nicely solves the problem.

So far I haven't found motivation to try this toy to a real language
implementation, as OO-grammars are a little bit painful to write.
However, extending it to support PEG structures wouldn't be that big
job. I guess that it's not what you really need, but if I'm wrong, I'd
be happy to have some real data to feed to this toy. It won't beat the
efficiency of LL or LR parsers, but at least it generates readable
code.

br,
Pietu Pohjalainen

Oliver Wong

neprebran,

3. sep. 2005, 15:23:513. 9. 05

do

"A Pietu Pohjalainen" <pohj...@cc.helsinki.fi> wrote in message

> Uhm, using context free languages, it is always possible to write
> ambiguous grammars, and actually there are context free languages that
> are inherently ambiguous. So, you either have to deal with the
> ambiguousity in the CFG's (e.g. precedence, associativity), or not use
> context free languages.

Sorry, I only know a bit about theory of computation, but I
thought ambiguity was a property of grammars, and not of
languages. Since a language is just a set of strings, how can a
language itself be ambiguous? I understand that perhaps there exists a
language which cannot be expressed by an unambiguous context free
grammar, but might there not exist another way (i.e. not using CFGs)
to express the language which is unambiguous?

- Oliver

A Pietu Pohjalainen

neprebran,

7. sep. 2005, 13:11:277. 9. 05

do

Right.

I was writing this with the mindset that context-free languages = all
languages expressable by context-free grammars, and it is possible to
have languages, for which every (CF)-grammar is ambiguous - so the
language is inherently ambiguous, because every (CF)-grammar for that
language allows more than one parse tree on some strings in the
language.

An example of such language is (Taken from P. Linuz, 'An Introduction to
Formal Languages and Automata, pp. 144-145):
L = {a^n b^n c^m} \union {a^n b^m c^m}

Using context-free grammars to write recognizer for this language always
results an ambiguous grammar. I'd guess that e.g. the mentioned parsing
expression grammar could be used to unambiguously describe this
language.

br,
Pietu

Paul Mann

neprebran,

7. sep. 2005, 13:12:117. 9. 05

do

"Chris F Clark" <c...@shell01.TheWorld.com> wrote in message

> But, as a potential language designer, you need to think long and hard
> about why you want to create a language which has no LL(1) grammar.

Answer: Readability for human's sake.

Why constrain your language to LL(1) when LALR(1) tools are available?

Paul Mann
http://parsetec.com

wclo...@lanl.gov

neprebran,

10. sep. 2005, 12:28:5410. 9. 05

do

Paul Mann wrote:
> "Chris F Clark" <c...@shell01.TheWorld.com> wrote in message
> > But, as a potential language designer, you need to think long and hard
> > about why you want to create a language which has no LL(1) grammar.
>
> Answer: Readability for human's sake.

But the LL(1) constraint results in relatively easy to parse languages
for both humans and computers. It might not be the most succinct way
of expressing the grammar for some languages, e.g. operator
precedence, but I have trouble thinking of any programming language
construct whose parsing cannot be transformed to LL(1) that does not
also cause problems for the human reader.

> Why constrain your language to LL(1) when LALR(1) tools are available?

Error reporting.

Chris F Clark

neprebran,

10. sep. 2005, 12:30:5110. 9. 05

do

Paul Mann asked a wonderful question in this thread:

>Why constrain your language to LL(1) when LALR(1) tools are available?

1st, to parrot him:

> Answer: Readability for human's sake.

Now, to explain. I've used dialects of a wide variety of programming
languages for years and so have many of my peers and co-workers, and
guess what,we generally avoid using the "readability" shortcuts that
most languages afford.

A good example is most languages have some precedence operations that
mimic traditional algebraic precedences, i.e. multiplication binds
more tightly than addition. However, when writing conditional
expressions, we almost universally parenthesize them, including
parenthesizing &'s (ands) nested within |'s (ors), despite not knowing
a language which got those precendences wrong (at least not one which
imposed any precedence rules at all, APL being an example of a
language with totally flat precedence rules).

Similarly, I use "Perl" style conditionals, meaning if it takes more
than one line, it goes in a block. I only use a single statement with
a conditional when it fits on 1 line.

Now, how does this apply to LL(1) v. LR(k) grammars? Simple. The
easiest way to write an LL(1) grammar is to start each rule with a
unique non-terminal, just like old BASIC used to with the "let"
statement. Pascal pretty much followed this idea also, with each
declaration type have a unique keyword (and putting the type
expression after the variables being declared). Those languages are
both very easy to read. If I were designing a new language,
especially in an untested or novel area, where there isn't lots of
existing practice to follow, I would keep the syntax simple, so that
novice readers had a chance of understanding the language.

With communities that have existing practice, life is different. For
them, you want to be as close to their default notation as possible.
For example, we kept Yacc++ as close to yacc notation as we could.
Well, we could have stayed a little closer (and that might have been
better). When, we added regular expressions to Yacc++, we used the
"standard" notations (*, +, and ?) and precedences. We even added the
[] notation for optional, because that we often used in our target
community. Again, that was a potentially questionable chose because
much of our community has a LEX background where [] means a character
class. However, very little of our notation is LEX-like, so we hoped
that wouldn't be a major burden.

Now, when parsing an existing notation, you want to try to mimic the
conventions of the community as close as possible. That may require
sophisticated precedence rules, and that might require the most
powerful parsing notation one can get within your other constraints.
At the time we wrote Yacc++, Yacc++ was "it". I've never needed a
feature that Yacc++ doesn't provide, but if I were to come upon a need
for an adaptive parser, I would either look into Meta-S or USSA. I
could also imagine using a GLR parser like DMS. (To be totally
honest, I would be just as likely to add such features to Yacc++. But
if a potential client were to ask me, I would probably recommend one
of the above tools if I didn't think their problem fit well with
Yacc++ as being shipped.)

Even in the case, of implementing an LL(1) language, I would probably
choose the "most powerful" tool that was convenient to use to parse
the language. Thus, I would write an LR grammar for an LL language,
simply because the notation is a little more readable. I would not
give up regular expressions (or even predicates) to get LR power.
thus, if I had to choose between PCCTS and "vanilla" yacc, I would
choose PCCTS most of the time. The exception being the case, where I
had a non-trivial expression syntax. I wouldn't give up the
precedence and associativity declarations for an expression grammar
for regular expressions. Of course, I don't have to, since Yacc++ has
all of the above (excuse my lack of immodesty here).

BTW, for most cases, you shouldn't have to either. We are going to
license a version (roughly 2.1) of Yacc++ for "free" use in
non-commercial projects. The library will be GPL and the generator
will be "gratis" for non-commercial use (but not released in source
for now, next year we will probably have a GPL version of the
generator also). We don't have an official version that meets those
criteria, but if someone writes to us, I can get such a version
finished off "post haste".

George Neuner

neprebran,

10. sep. 2005, 12:37:5010. 9. 05

do

On 7 Sep 2005 13:12:11 -0400, "Paul Mann" <pa...@highpersoft.com>
wrote:

>"Chris F Clark" <c...@shell01.TheWorld.com> wrote in message
>> But, as a potential language designer, you need to think long and hard
>> about why you want to create a language which has no LL(1) grammar.
>
>Answer: Readability for human's sake.

What's your definition of readable?.

>Why constrain your language to LL(1) when LALR(1) tools are available?

I'm just guessing but perhaps it's because LR tools produce sequence
results backwards; it's easier to position embedded actions in LL
grammars than in LR grammars; and when there is a problem, debugging
recursive decent code is easier than debugging a table driven PDA.

A better question is "why use LALR(1) when LL(k) is available"?

Chris mentioned three tools - ANTLR, PCCTS and JavaCC - all of which
use predicated LL(k) and produce recursive decent code. ANTLR creates
parsers and lexers in Java, C++, C# or Python. PCCTS creates parsers
only in C or C++. JavaCC creates parsers and lexers in Java [duh!].

Any of these tools can handle reasonable language grammars. They
handle some unreasonable ones as well - for example they all can
handle the C++ grammar.

I no longer see any reason to use LR tools unless the language is
truly bizarre or code density is *very* important because the result
will be deployed on very small machines.

George

Hans-Peter Diettrich

neprebran,

10. sep. 2005, 12:40:4710. 9. 05

do

Paul Mann wrote:
>
> "Chris F Clark" <c...@shell01.TheWorld.com> wrote in message
> > But, as a potential language designer, you need to think long and hard
> > about why you want to create a language which has no LL(1) grammar.
>
> Answer: Readability for human's sake.

Do you really think that humans prefer harder to "parse" languages?
Lookahead and backtracking affect humans as well as programs.

> Why constrain your language to LL(1) when LALR(1) tools are available?

Why write small and fast programs when users have so much memory and so
fast processors?

Sorry, I don't understand your opinion :-(

DoDi
[Stuff that's easy for computers to parse isn't always easy for people to
read or write. The classic example is Pascal vs. PL/I semicolons, where
Pascal's separator semicolons are syntactically elegant but quite error
prone. -John]

Paul Mann

neprebran,

11. sep. 2005, 11:11:5711. 9. 05

do

"Paul Mann" <pa...@highpersoft.com> wrote

> "Chris F Clark" <c...@shell01.TheWorld.com> wrote in message
>> But, as a potential language designer, you need to think long and hard
>> about why you want to create a language which has no LL(1) grammar.
>
> Answer: Readability for human's sake.

First of all, readability in the grammar. Perhaps that is what I was
thinking about, rather than readability in a programming language.
But I might be able to find some cases where LALR(1) allows greater
readability in the language being parsed also.

I find these LALR(1) grammar rules to be more readable and more
natural:

StatementList -> Statement
-> StatementList Statement

Than the LL(1) equivalent:

StatementList -> Statement
-> Statement StatementList

I know you do it like this:

StatementList : Statement+

But LL(1) is a way of thinking that is limited. With LL, you are
forced to use LL(2) or LL(3) to do things that are very easy for
LALR(1).

----------------------------------------------

Outside of readability in the grammar ...

I'm sure I could find some programming language constructs which
cannot be expressed in LL(1) that are expressible in LALR(1). So
again, why constrain yourself to LL(1).

Now LL(k) can handle anything in the universe, but the discussion is
about which is easier to program and easy to read both.

For me, LL(k) is not easier to program than LALR(1).

----------------------------------------------

About actions in the middle of a rule with LL(1) parsers ...

I have hardly ever needed to do this. Usually, there is already a need
for a nonterminal at this point in the rule and when reducing the
nonterminal, I can have an action take place easily with LALR.

Most of the action takes place when processing the AST anyway. So
actions in the middle of a rule are not the main issue in compiler
front ends.

BTW, processing the AST can be done with a top-down traversal (for
those who don't like LALR's bottom-up characteristic).

--------------------------------------------

I don't like to see code in a grammar any more than I like to put
assembly language code in my C++ programs.

I don't like to put words like SEMI in a grammar, when I can put ';'
instead ... what happens when a language has a keyword, SEMI ? (Yes
I know it's rare) ...

You use 'SEMI', right?

I find

'SEMI' ';'

Paul Mann

neprebran,

11. sep. 2005, 11:12:4511. 9. 05

do

"Hans-Peter Diettrich" <DrDie...@compuserve.de> wrote

> Paul Mann wrote:
>>
>> "Chris F Clark" <c...@shell01.TheWorld.com> wrote in message
>> > But, as a potential language designer, you need to think long and hard
>> > about why you want to create a language which has no LL(1) grammar.
>>
>> Answer: Readability for human's sake.
>
> Do you really think that humans prefer harder to "parse" languages?
> Lookahead and backtracking affect humans as well as programs.

Yes humans do prefer harder-to-parser languages. Example: English.

And the computer should serve the human not the other way around.
That's why we have invented LALR(1), LR(1) and GLR parsing.

>> Why constrain your language to LL(1) when LALR(1) tools are available?
>
> Why write small and fast programs when users have so much memory and so
> fast processors?

Why program in more powerful harder to write programming languages
such as C and C++, when assembly language is available ????

> [Stuff that's easy for computers to parse isn't always easy for people to
> read or write. The classic example is Pascal vs. PL/I semicolons, where
> Pascal's separator semicolons are syntactically elegant but quite error
> prone. -John]

Yes, and the space used for beginning a comment in OS JCL was the most
common cause of errors according to a study.

Some programmer probably found it easier to program it that way.

Paul Mann
http://parsetec.com

Hans-Peter Diettrich

neprebran,

14. sep. 2005, 21:18:3414. 9. 05

do

Paul Mann wrote:

> > Do you really think that humans prefer harder to "parse" languages?
> > Lookahead and backtracking affect humans as well as programs.
>
> Yes humans do prefer harder-to-parser languages. Example: English.

For communication, perhaps, but not necessarily for commands and
statements in data processing. Natural languages are one thing, formal
languages another one. I for my part would kill the person that invented
English <gd&r>

> And the computer should serve the human not the other way around.

Stupid users can profit from the assistance of a clever program, but
in case of languages and parsers it's not always a good idea to allow
for a bad language design, only because it can be handled by
sophisticated techniques. Have a look at the many bad websites,
resulting from uneducated use of publishing tools.

DoDi

George Neuner

neprebran,

14. sep. 2005, 21:19:0414. 9. 05

do

On 11 Sep 2005 11:11:57 -0400, "Paul Mann" <pa...@parsetec.com> wrote:

>Now LL(k) can handle anything in the universe, but the discussion is
>about which is easier to program and easy to read both.
>
>For me, LL(k) is not easier to program than LALR(1).

If you disregard mid-rule actions [more on that later], I think they
are equally easy ... or equally difficult if you prefer.

>About actions in the middle of a rule with LL(1) parsers ...
>
>I have hardly ever needed to do this. Usually, there is already a need
>for a nonterminal at this point in the rule and when reducing the
>nonterminal, I can have an action take place easily with LALR.
>
>Most of the action takes place when processing the AST anyway. So
>actions in the middle of a rule are not the main issue in compiler
>front ends.

The need to embed actions depends greatly on the language ...
resolving context sensitivity requires predication and predication
requires mid-rule computations.

LR can be inefficient for a context sensitive language because
reductions aren't applied until all the tokens have already been
constructed. The usual LALR work around is either to defer
everything until later [the approach you seem to favor] or to create
back channels to the lexer and perform as much disambiguation as
possible there. But neither of these approaches handle languages
which must be lexed differently in different contexts.

LL handles such problems more efficiently because the context can be
decided before dependent tokens have been constructed. It minimizes
the work the parser has to do while maximizing the amount of
disambiguated information that can be provided to later stages.

I'm not claiming there are reasonable programming languages that
require such processing, but I can think of a lot of other uses for
parsing tools which would require context dependent lexing.

>I don't like to see code in a grammar any more than I like to put
>assembly language code in my C++ programs.

That's a fine philosophy so long as avoiding it isn't unnecessarily
complicating your life. The main reason for computing in the parser
[or lexer] is to reduce the ambiguities that later processing must
deal with.

George

Detlef Meyer-Eltz

neprebran,

14. sep. 2005, 21:25:0814. 9. 05

do

> "Paul Mann" <pa...@highpersoft.com> wrote
>> "Chris F Clark" <c...@shell01.TheWorld.com> wrote in message
>>> But, as a potential language designer, you need to think long and hard
>>> about why you want to create a language which has no LL(1) grammar.
>>
>> Answer: Readability for human's sake.

> First of all, readability in the grammar. Perhaps that is what I was
> thinking about, rather than readability in a programming language.
> But I might be able to find some cases where LALR(1) allows greater
> readability in the language being parsed also.

> I find these LALR(1) grammar rules to be more readable and more
> natural:

> StatementList -> Statement
> -> StatementList Statement

> Than the LL(1) equivalent:

> StatementList -> Statement
> -> Statement StatementList

> I know you do it like this:

> StatementList : Statement+

The last version seems to be the clearest of all to me.
There is a third aspect of the readability besides the readability of
a programming language and the readability of the grammar: the
readability of the generated code. LL(1) parser generators can make
extremely elegant code which is to read, to understand and to debug
very easily:

To take your example, a sketch of the code which e.g. the
TextTransformer creates for the StatementList-production would look
like:

return_type StatementList( paramter1, ... )
{
do
{

return_type t = Statement( p1, ...);
// consumes the tokens of the Statement

} while( the next token is the beginning of a Statement )
}

Code for actions which shall be executed can be embedded into the
production and and appears in the generated code in the appropriate
places.

> Most of the action takes place when processing the AST anyway. So
> actions in the middle of a rule are not the main issue in compiler
> front ends.

> I don't like to see code in a grammar any more than I like to put

> assembly language code in my C++ programs.

The comparison limps (German idiom) and in my opinion it reveals a
point neglected by many designers of parser generators: a parser
generator produces a complete program only in the rarest cases. Only
in connection with the semantic actions a parsed text gets really
processed. And for this, without doubt, an AST often is useful or even
necessary. But sometimes the creation of an AST is shooting at
sparrows with cannons (German idiom again). Why create an AST, if the
text can be processed directly? The latter is more intuitive and
faster too. The user of a parser generator should be enabled to
understand the generated parser and he should be free to decide, how
he wants to complete the groundwork that is done by the parser
generator. The user should not be forced to do, what the parser
generator likes, but should be able to do, what he likes

I am interested to know, how the code for the different LR-Parsers
would look like. I only know yacc/bison code which for my feeling is
quite awful.

Regards

Detlef Meyer-Eltz

--
mailto:Meyer...@t-online.de

url: http://www.texttransformer.de
url: http://www.texttransformer.com

Cleo Saulnier

neprebran,

17. sep. 2005, 13:54:3517. 9. 05

do

Detlef Meyer-Eltz wrote:
>>"Paul Mann" <pa...@highpersoft.com> wrote
>>
>>>"Chris F Clark" <c...@shell01.TheWorld.com> wrote in message
>>>
>>>>But, as a potential language designer, you need to think long and hard
>>>>about why you want to create a language which has no LL(1) grammar.
>>>
>>>Answer: Readability for human's sake.

>
>>First of all, readability in the grammar. Perhaps that is what I was
>>thinking about, rather than readability in a programming language.
>>But I might be able to find some cases where LALR(1) allows greater
>>readability in the language being parsed also.
>
>

[snip]

> I am interested to know, how the code for the different LR-Parsers
> would look like. I only know yacc/bison code which for my feeling is
> quite awful.

First, I think an LR parser generator should always be preferred over
SLR or LALR. The tables are gonna be close to the same size if you
compress them correctly, but with the added advantage that LR can
handle more grammars without rewriting them which will expand the
table anyhow.

And I find that GLR is a bit of overkill and you're most likely not
using your lexer like you should, or you're overly relying on the
parser instead of making simple changes to your lexer that would
greatly speed up the parser anyhow.

Just a reminder:
SLR - no lookahead on reduce.
LALR - all lookaheads merged into one state for reduce.
LR - state splitting for each lookahead if necessary, but shouldn't be
much worse than LALR when state merging is possible, although this can
be difficult to detect and accomplish.

LR will create more states than LALR, but there's usually a good reason
for that. It differentiates between different states depending on the
lookahead. This is important for not getting incorrect parses. If the
states can "ideally" be merged, then most likely the grammar isn't
written in the most optimal fashion.

I find LL the easiest to understand and the easiest to follow the
execution of action routines. Unfortunately, I have found that the
generated code is highly dependant on the productions and this actually
leads to a bit of bloat and prone to spaghetti code. Not only that, but
the grammar contortions that are often needed tend to be very ugly.
From the two LL parser generators that I wrote, this is what annoyed me
the most. Second came the size of the actual parsing routines. From a
superficial point of view (one who writes grammars... and beginners), LL
is great. From an implementor's point of view, it lacks finesse and
seems to carry along a lot of baggage.

For me, SLR and LALR are a waste of time. LALR jumps through way too
many hoops to not compute all the states that LR does. LR can simply
merge states when it's appropriate and get close to the same table sizes
with the benefit of handling more special cases. Yes, yes, LR will
create more states, but not necessarilly that many more entries once
compressed.

From a superficial (one who writes grammars) point of view, LR takes a
bit of getting used to, especially for action routines as they can
realistically only be done during reduction, but that's not a real
obstacle. One of the main issues I had was that I could write a startup
action routine in the head production to initialise things in LL. With
LR, you can't do that because that's the last production to be reduced.

From an implementor's point of view (someone who's written a canonical
LR parser generator), LR is nothing short of pure elegance. The way the
states collapse. The way the tables can be compressed. The way the
actual parsing routine can take so few lines of code. The ease at which
the error checking code can be implemented. It's simply a pure pleasure
to be a programmer.

So basically, if you're new to parsers or you don't know the intricacies
of LR parsing, I'm willing to bet you'll much prefer LL. I know I used
to. But once you fully understand LR, I can't imagine why you would use
anything else. I also find that LR code is much more maintainable...
cleaner code basically. The realiance on productions in LL even though
you can use a table, and the code to keep track of all that isn't my cup
of tea.

Just wanted to mention that I have written LL, SLR, LALR and a canonical
LR(1) parser generators. I've just released my LR(1) parser generator
on SF, but the docs are slowly coming out.

Cle

Hans-Peter Diettrich

neprebran,

17. sep. 2005, 13:55:1717. 9. 05

do

Detlef Meyer-Eltz wrote:

> > I find these LALR(1) grammar rules to be more readable and more
> > natural:
>
> > StatementList -> Statement
> > -> StatementList Statement
>
> > Than the LL(1) equivalent:
>
> > StatementList -> Statement
> > -> Statement StatementList
>
> > I know you do it like this:
>
> > StatementList : Statement+
>
> The last version seems to be the clearest of all to me.

The last version has several definite advantages:
- It doesn't imply any left/right recursion.
- It doesn't require to write down the repeated expression multiple
times.
- It looks like structured code, in contrast to the preceding spaghetti
code.

In this form the rules can be understood immediately by people favoring
either LL or LR grammars. An expansion into the preceding LALR or LL
equivalent can be automated easily, in contrast to the opposite
direction; the opposite direction requires to recognize and verify
first, that the Statement expression is the same in all lines, and that
the StatementList is the name of the rule itself, before one can know
for sure whether the rule describes a sequence of Statement or something
else. Every human reader has to perform these steps as well, so that the
last condensed form certainly is easier to read *and* understand.

To the writer the last form is preferable as well, it's not only
shorter, but it also doesn't leave a chance for typos in the required
repetitions of the same tokens or expressions - also a simplification of
grammar maintenance.
It's debatable whether the repeated item has to be a single token, or
can be an expression. In the given example the Statement rule may be so
complex, that it should become an explicit rule. In other cases, with
simpler expressions, an additional rule can be omitted, provided that
this rule is not used in multiple places in the grammar.

There may exist situations (do they?), where the recursion order is
important, but in the given example the Statements certainly have to be
processed in their input order (first to last).

DoDi

Detlef Meyer-Eltz

neprebran,

18. sep. 2005, 00:44:0118. 9. 05

do

> the grammar contortions that are often needed tend to be very ugly.
> From the two LL parser generators that I wrote, this is what
> annoyed me the most.

I guess with "contortions" you mean the result of factorizing
tokens out, to make a grammar LL(1) conform. For compiler generators with
look ahead capabilities as ANTLR, Coco/R or the TextTransformer, this
isn't necessary any more.

Cleo Saulnier

neprebran,

22. sep. 2005, 23:40:5022. 9. 05

do

Detlef Meyer-Eltz wrote:
>>the grammar contortions that are often needed tend to be very ugly.
>> From the two LL parser generators that I wrote, this is what
>> annoyed me the most.
>
>
> I guess with "contortions" you mean the result of factorizing
> tokens out, to make a grammar LL(1) conform. For compiler generators with
> look ahead capabilities as ANTLR, Coco/R or the TextTransformer, this
> isn't necessary any more.

I simply meant that I found myself writing more code in the LL parser
generator (not changing the grammar by hand) to make the grammar
conform. To me, it detracted from the task of generating better code
for the parser.

Cle

Sylvain Schmitz

neprebran,

22. sep. 2005, 23:42:5222. 9. 05

do

Detlef Meyer-Eltz wrote:
> I am interested to know, how the code for the different LR-Parsers
> would look like. I only know yacc/bison code which for my feeling is
> quite awful.

I think it's more a matter of optimized C style than a real problem on
the LR readability. Maybe you would like it better in a functional
version, like the CPS of Essence
[http://www.informatik.uni-freiburg.de/proglang/software/essence/].
And clearly, my currently favorite piece of parser on the readability
level is LL-based, namely Parsec
[http://www.cs.uu.nl/~daan/parsec.html]; for instance, this piece of
Haskell code parses TeX environments:

environment :: Parser ()
environment = do{ name <- envBegin
; environment
; envEnd name
}
<|> return ()

envBegin :: Parser String
envBegin = do{ reserved "\\begin"
; braces (many1 letter)
}

envEnd :: String -> Parser ()
envEnd name = do{ reserved "\\end"
; braces (string name)
}

Of course, everything comes with a price, so when Detlef Meyer-Eltz wrote:
> I guess with "contortions" you mean the result of factorizing
> tokens out, to make a grammar LL(1) conform. For compiler generators
> with look ahead capabilities as ANTLR, Coco/R or the TextTransformer,
> this isn't necessary any more.

I couldn't help thinking about the explosive parsing complexity
unbounded lookaheads and backtracking sometimes involve.

Cleo Saulnier wrote:
> SLR - no lookahead on reduce.

No, SLR parsers use the Follow sets as lookaheads.

> For me, SLR and LALR are a waste of time. LALR jumps through way too
> many hoops to not compute all the states that LR does. LR can simply
> merge states when it's appropriate and get close to the same table sizes
> with the benefit of handling more special cases. Yes, yes, LR will
> create more states, but not necessarilly that many more entries once
> compressed.

Yes, you can then compress the LR tables, but for this you need to
compute them first. Even now, this is not the kind of computations for
which one is happy to wait for termination. And if you compress on the
fly, then you are jumping through hoops as well.
Then, I agree LR parsing is extremely nice, and that to spend more
time to generate a faster parser is overall better.

--
Regards,

Sylvain

Detlef Meyer-Eltz

neprebran,

22. sep. 2005, 23:44:1222. 9. 05

do

> Detlef Meyer-Eltz wrote:
> > I am interested to know, how the code for the different
> LR-Parsers
> > would look like. I only know yacc/bison code which for my feeling
> is
> > quite awful.

Sylvain Schmitz wrote

> I think it's more a matter of optimized C style than a real problem
> on the LR readability.

Cleo Saulnier wrote

> First, I think an LR parser generator should always be preferred
> over SLR or LALR.

I must apologize that I apparently haven't expressed my question
clearly.

>I wrote : There is a third aspect of the readability: the readability
>of the generated code.

This code and especially the integration of semantic actions in this
code is, what I am interested in. That means code for parsers, which
are generated from parser descriptions like yacc. I am interested in
the question, how to do something with a parser.

MY THESIS
---------
is, that recursive descent LL compiler compilers for most purposes
generate the clearest and most flexible code. Parse trees can be
created, but the processed text can be treated directly too.
Descriptions of productions are translated into functions (with
parameters and return types) which directly reflect the description.
Or said the other way round: the descriptions nearly are written like
functions of the corresponding programming language. (In response to
Sylvain Schmitz: As far as I know, Parsec Haskell code cannot be
embedded into other programming languages.)

For some commercial compiler compilers I either haven't found any
online documentation at all or these tools only provide the generation
of some kinds of parse trees or AST's.

There is a discussion appearing again and again in this forum, whether
LL or LR parsers should be preferred. I think, the advantages and
disadvantages are reciprocal. But the one decisive argument in favor
of the LL parsers, generated by recursive descent compiler compilers,
is the simplicity of the code generated by them.

Regards

Detlef Meyer-Eltz

--
mailto:Meyer...@t-online.de

url: http://www.texttransformer.de
url: http://www.texttransformer.com

url: http://www.text-konverter.homepage.t-online.de

Sylvain Schmitz

neprebran,

22. sep. 2005, 23:47:1622. 9. 05

do

All this thread on PEGs has convinced me of giving them a better look.
But something isn't quite clear; in his ICFP'03 paper, Bryan Ford
presented an unambiguous CFG that could not be recognized by his packrat
parser:

P -> x P x | x

(this CFG is especially interesting for it has a quadratic parsing time
complexity using an Earley parser).
I have tried "P <- x P x / x" to see what would happen, and indeed,
the packrat parser is only able to parse sequences of 2^n - 1 "x"
symbols, n > 0.

This problem does not seem to be addressed in his POPL'04 paper; the
well-formed definition, if I understand it correctly, is only concerned
about left recursion and does not reject the above grammar.
Now, strictly speaking, the language recognized by "P <- x P x / x"
*is* {x^(2^n - 1) | n > 0} (and that's quite an achievement in itself),
but I find it a bit misleading due to the CFG generative habits.
Looking at this, I also wonder whether there exist a PEG for the CFL
{ww^R | w \in {a,b}^*}.

All this to get to the conclusion that the immediate translation of a
CFG into a PEG by replacing "|" with "/" will not necesseraly yield the
expected result. Even the simple grammar "S -> Ac, A -> a| ab" should
not be translated "S <- Ac, A <- a / ab" but "S <- Ac, A <- ab / a",
otherwise input "abc" would not be accepted.

Russ Cox wrote:
> One thing PEGs definitely do well (and this is at least one reason why
> Bryan worked on them) is that they are composable: you can take any
> terminal symbol in a PEG, replace it with a non-terminal with
> its own PEG grammar, and you get a bigger, but still valid, PEG grammar.

This is definitely an excellent property.

--
Regards,

Sylvain

Chris F Clark

neprebran,

22. sep. 2005, 23:51:4322. 9. 05

do

George Neuner <gneu...@comcast.net> writes:

> The need to embed actions depends greatly on the language ...
> resolving context sensitivity requires predication and predication
> requires mid-rule computations.

I think predicates are wonderful. I don't know that predicates
"require" mid-rule computations. Syntactic ones don't. Semantic
predicates "are" mid-rule computations, so yes they require them.

> LR can be inefficient for a context sensitive language because
> reductions aren't applied until all the tokens have already been
> constructed. The usual LALR work around is either to defer
> everything until later [the approach you seem to favor] or to create
> back channels to the lexer and perform as much disambiguation as
> possible there. But neither of these approaches handle languages
> which must be lexed differently in different contexts.

Presuming one has mid-rule computations (and those are not hard to
implement in an LR parser generator), one can also determine context
before the dependent tokens have been constructed. It is a persistent
myth that LL parsers have inheriently more context information than LR
parsers. It is only true, if one has null productions that are used
in multiple contexts and then only if one is doing SLR or LALR
parsing, so that states with different lookahead sets are merged. If
one has an LL grammar for the language, an LR parser will also have
nicely disambiguated contexts. The myth persists because an LR
grammar that is not LL will have ambiguous contexts, which is why the
grammar is not LL. Note, when using mulitple lookahead (LL(k))
grammars, there are also places where the tokens have to be
constructed before the context is determined, i.e. for the first k
tokens of rules that need the lookahead for discrimination.

However, these are just minor quibbles. It isn't context that sells
most users on LL over LR parsing, it is recursive descent output.
That is a much more "transparent" output format than the best labelled
table driven approach. So, if you need to debug your grammar, you
will probably prefer an LL generator that outputs recusrive descent
code. (Now, I agree with Cleo Saulnier, that I personally prefer LR
table output, but I've had more than a little experience dealing with
it and the Yacc++ tables reflect my mental habits.)

Russ Cox

neprebran,

22. sep. 2005, 23:49:1622. 9. 05

do

Forwarded with permission.

From: Bryan Ford <baf...@mit.edu>
Date: Sep 20, 2005 5:43 PM
Subject: Re: Fwd: Parsing Expression Grammar

> All this thread on PEGs has convinced me of giving them a better look.
> But something isn't quite clear; in his ICFP'03 paper, Bryan Ford
> presented an unambiguous CFG that could not be recognized by his packrat
> parser:
>
> P -> x P x | x
>
> (this CFG is especially interesting for it has a quadratic parsing time
> complexity using an Earley parser).
> I have tried "P <- x P x / x" to see what would happen, and indeed,
> the packrat parser is only able to parse sequences of 2^n - 1 "x"
> symbols, n > 0.
>
> This problem does not seem to be addressed in his POPL'04 paper; the
> well-formed definition, if I understand it correctly, is only concerned
> about left recursion and does not reject the above grammar.

The fact that it's possible to transform most CFGs into a syntactically
correct PEG by replacing "->" with "<-" and "|" with "/" does not by any
means imply that the PEG so generated will actually represent the same
language as the CFG did. CFGs and PEGs are different syntactic
meta-languages whose expressions have substantially different
interpretations, and I've never yet even tried to define a "direct" formal
relationship directly between similar-looking PEGs and CFGs. In other words,
the fact that the CFG "P -> x P x | x" looks a lot like the PEG "P <- x P x /
x" means absolutely nothing formally.

It's probably possible to define some such formal correspondence between a
very restrictive subset of CFGs, and a similar very restrictive subset of
PEGs, and prove that the corresponding grammars represent the same language.
The LL(0) CFGs might be a good starting point for building such a
correspondence. I haven't done so yet, however, nor has anyone else that I
know of. Needless to say, the CFG "P -> x P x | x" would probably not be
covered by this correspondence.

> Now, strictly speaking, the language recognized by "P <- x P x / x"
> *is* {x^(2^n - 1) | n > 0} (and that's quite an achievement in itself),

Actually, if you analyze the behavior of the above PEG further I think you'll
find that it represents the language {"x", "xxx"}. In other words, the
PEG-based parser fails to behave in the way the similar-looking CFG does for
input strings of 5 or more x's, because it deterministically makes the
"wrong" parsing choices in the middle of the string and never goes back to
explore different choices as a fully backtracking CFG-based parser might.

It's easy in this case to write a _different_ PEG that recognizes the same
language {x^(2^n - 1) | n > 0}, e.g., "P <- x x P | x", but that fact doesn't
seem to generalize to CFGs in general.

> Looking at this, I also wonder whether there exist a PEG for the CFL
> {ww^R | w \in {a,b}^*}.

My guess is no. The basic problem seems to be the lack of syntactic markers
in the middle of the string from which the PEG-based parser might start
building correct intermediate results.

> All this to get to the conclusion that the immediate translation of a
> CFG into a PEG by replacing "|" with "/" will not necesseraly yield the
> expected result. Even the simple grammar "S -> Ac, A -> a| ab" should
> not be translated "S <- Ac, A <- a / ab" but "S <- Ac, A <- ab / a",
> otherwise input "abc" would not be accepted.

Correct.

Cheers,
Bryan

Sylvain Schmitz

neprebran,

22. sep. 2005, 23:52:2022. 9. 05

do

Bryan Ford wrote:
> The fact that it's possible to transform most CFGs into a syntactically
> correct PEG by replacing "->" with "<-" and "|" with "/" does not by any
> means imply that the PEG so generated will actually represent the same
> language as the CFG did. CFGs and PEGs are different syntactic
> meta-languages whose expressions have substantially different
> interpretations, and I've never yet even tried to define a "direct" formal
> relationship directly between similar-looking PEGs and CFGs. In other words,
> the fact that the CFG "P -> x P x | x" looks a lot like the PEG "P <- x P x /
> x" means absolutely nothing formally.
>
> It's probably possible to define some such formal correspondence between a
> very restrictive subset of CFGs, and a similar very restrictive subset of
> PEGs, and prove that the corresponding grammars represent the same language.
> The LL(0) CFGs might be a good starting point for building such a
> correspondence. I haven't done so yet, however, nor has anyone else that I
> know of. Needless to say, the CFG "P -> x P x | x" would probably not be
> covered by this correspondence.

It seems like a good idea; I would have liked a generative
interpretation of all PEGs, but the undecidability of the emptiness
problem is a sign that it's unlikely to happen. The bottom line here is
that it can be difficult to tell what language a PEG will accept.
Am I wrong if I write that the language L(alpha / beta) recognized by
the PEG "alpha / beta" is

L(alpha / beta) = L(alpha) \cup { xy \in L(beta) | x \not\in L(alpha) }

It looks like a good starting point for understanding PEGs as generative
devices. (At least I find it helpful.)

>> Now, strictly speaking, the language recognized by "P <- x P x / x"
>>*is* {x^(2^n - 1) | n > 0} (and that's quite an achievement in itself),
>
> Actually, if you analyze the behavior of the above PEG further I think you'll
> find that it represents the language {"x", "xxx"}. In other words, the
> PEG-based parser fails to behave in the way the similar-looking CFG does for
> input strings of 5 or more x's, because it deterministically makes the
> "wrong" parsing choices in the middle of the string and never goes back to
> explore different choices as a fully backtracking CFG-based parser might.
>
> It's easy in this case to write a _different_ PEG that recognizes the same
> language {x^(2^n - 1) | n > 0}, e.g., "P <- x x P | x", but that fact doesn't
> seem to generalize to CFGs in general.

It seems like you read the language expression a bit fast; I concur

P -> x P x | x,
P -> x x P | x and
P <- x x P / x

all recognize the same language { x^(2n - 1) | n > 0 }, but

P <- x P x / x

recognizes { x^(2^n - 1) | n > 0 }, a non CF language, not {x, xxx}. I
joined the source of a packrat parser for it (and I apologize for my
poor Haskell skills).

>>Looking at this, I also wonder whether there exists a PEG for the CFL

>>{ww^R | w \in {a,b}^*}.
>
> My guess is no. The basic problem seems to be the lack of syntactic markers
> in the middle of the string from which the PEG-based parser might start
> building correct intermediate results.

Isn't this overall problem handled in Birman's Information and Control
paper? I haven't really read it yet, but he introduces multiple-failure
schemes, (l,m)-TS, which look like a way of expressing the need for more
backtracking power in a "bounded" way; the problem with palindromes
being that they need an "unbounded" number of backtracks, hence their
quadratic time complexity in Earley parsing. (This is just an intuition.)

--
Regards,

Sylvain
-- PalPackrat.hs
--
-- Packrat parser for grammar P <- x P x / x.
-- See "4.3 Recognition Limitations" in Bryan Ford, Packrat Parsing: Simple,
-- Powerful, Lazy, Linear Time. ICFP'02.
-- The motivation for the study of this grammar is its possibly inherent
-- quadratic time parsing complexity; see Jay Earley, An Efficient Context-Free
-- Parsing Algorithm. Communications of the ACM, 13(2):94-102, 1970.
--
-- Usage in Hugs98: let's see why "xxxxx" is not accepted:
-- Hugs.Base> :load PalPackrat.hs
-- PalPackrat> dvPalindrom (parse "345")
-- "(P <- '3'(P <- '4')'5')"
-- Remains ""
-- PalPackrat> dvPalindrom (parse "2345")
-- "(P <- '2')"
-- Remains "345"
-- PalPackrat> dvPalindrom (parse "12345")
-- "(P <- '1'(P <- '2')'3')"
-- Remains "45"
-- PalPackrat> map eval [1,3..66]
-- [1,3,0,7,0,0,0,15,0,0,0,0,0,0,0,31,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,63,0]

module PalPackrat where

data Result v = Parsed v Derivs | NoParse
instance Show v => Show (Result v) where
show (Parsed v (Derivs r p c)) = show v ++ "\nRemains \"" ++ r ++ "\""
show (NoParse ) = "No Parse"

data Derivs = Derivs {
remain :: String,
dvPalindrom :: Result String,
dvChar :: Result Char
}

-- Returns `n' if the parser accepts an input of length `n', `0' otherwise.
eval :: Int -> Int
eval n = case dvPalindrom (parse (map (\i -> 'x') [1..n])) of
Parsed v d -> case dvChar d of -- check for EOF
NoParse -> n
_ -> 0
_ -> 0

-- Construct a parse result structure for an input string.
parse :: String -> Derivs
parse s = d where
d = Derivs r pal char
r = s
pal = pPalindrom d
char = case s of
(c:s') -> Parsed c (parse s')
[] -> NoParse

-- P
pPalindrom :: Derivs -> Result String
pPalindrom d = alt1 where
-- P <- x P x
alt1 = case dvChar d of
Parsed c1 d' -> case dvPalindrom d' of
Parsed p d'' -> case dvChar d'' of
Parsed c2 d''' -> Parsed ("(P <- " ++ (show c1) ++ p ++ (show c2) ++ ")") d'''
_ -> alt2
_ -> alt2
_ -> alt2
-- P <- x
alt2 = case dvChar d of
Parsed c d' -> Parsed ("(P <- " ++ (show c) ++ ")") d'
_ -> NoParse

-- An equivalent result can be obtained using pappy
-- [http://www.pdos.lcs.mit.edu/~baford/packrat/thesis/]
-- to generate a packrat parser from the following 'Pal.pappy' file:

-- parser Pal:
--
-- top Axiom
--
-- Axiom :: Int =
-- p:Palindrom !Char -> { p }
--
-- Palindrom :: Int =
-- 'x' p:Palindrom 'x' -> { p + 2 }
-- / 'x' -> { 1 }
--
-- {
-- -- test:
-- -- Pal> map eval [1,3..66]
-- -- [1,3,0,7,0,0,0,15,0,0,0,0,0,0,0,31,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,63,0]
-- --
-- eval :: Int -> Int
-- eval n = case palAxiom (palParse "Palindrom" (map (\i -> 'x') [1..n])) of
-- Parsed v _ _ -> v
-- NoParse e -> 0
-- }

Paul Mann

neprebran,

22. sep. 2005, 23:52:2922. 9. 05

do

>> the grammar contortions that are often needed tend to be very ugly.
>> From the two LL parser generators that I wrote, this is what
>> annoyed me the most.
>

> I guess with "contortions" you mean the result of factorizing
> tokens out, to make a grammar LL(1) conform. For compiler generators with
> look ahead capabilities as ANTLR, Coco/R or the TextTransformer, this
> isn't necessary any more.

So now you have another kind of contortion, infinite look-ahead.
Replace one contortion with another for the sake of using LL? Just
proves it is flawed from the start.

Paul Mann
http://parsetec.com

Chris F Clark

neprebran,

23. sep. 2005, 15:56:1923. 9. 05

do

Sylvain Schmitz <sch...@i3s.unice.fr> writes:
> All this thread on PEGs has convinced me of giving them a better look.

...

> All this to get to the conclusion that the immediate translation of a
> CFG into a PEG by replacing "|" with "/" will not necesseraly yield the
> expected result. Even the simple grammar "S -> Ac, A -> a| ab" should
> not be translated "S <- Ac, A <- a / ab" but "S <- Ac, A <- ab / a",
> otherwise input "abc" would not be accepted.

Yes, this is "the" problem with PEG's. / implies ordering of the
search (parsing) space. You need to order your / operators so that
special cases (e.g. longer matches), so that they appear first.
Unfortunately, if you don't do this, nothing will tell you you have a
problem with your grammar, it will simply not parse some inputs. To
me this implies that if one wants to use a PEG to parse some input,
then one must exhaustively test the parser.

The exception being, if the grammar is LL(1), then a PEG will
correctly parse it and the above issue won't occur. However, if you
have made your grammar LL(1), it won't have rules like "A -> a | ab"
because you will have had to left-factor them. The parsing correctly
of "A -> ab / a" is a feature of PEG's since it obviates
left-factoring. However, one could be just as well off using one of
the LL(k) parsers, which automatically left-factor your grammars and
thus handle "A -> a | ab" without the user worrying about ordering or
having inputs that won't parse.

> Russ Cox wrote:
> > One thing PEGs definitely do well (and this is at least one reason why
> > Bryan worked on them) is that they are composable: you can take any
> > terminal symbol in a PEG, replace it with a non-terminal with
> > its own PEG grammar, and you get a bigger, but still valid, PEG grammar.
> This is definitely an excellent property.

Composability is an extremely useful feature. It really helps novice
and naive grammar writers.

Another userful feature is determinism, which PEG's have, by
determinism I mean that there is a unique parse for any legal input,
so you don't have ambiguous specifications.

The final useful property which PEG's have is run-time "linearity",
you can predict parsing time by taking the input length and
multiplying by some constant factor.

The fourth useful feature, which PEG's don't have, is the one cited
above, a kind of "completeness" or transparency. It is possible to
write a PEG grammar such that some alternative never participates in a
parse, and as a result some inputs which "look" parseable to the naive
grammar writer are actually not.

It is useful to compare this to other popular parsing technologies.

LL(1) and LR(1) parsers both fail composability, but succeed on
determinism, linearity, and transparency.

GLR and Earley parsers are composable and transparent, but are not
deterministic or linear. I think the same properties holds for CYK
parsers, but I'm not well-versed in them, so don't quote me.

I think it is possible to get a parser which nearly succeeds on all
four properties by combining the PEG ordering idea with GLR parsing.
In particular, one runs the GLR parser to get multiple ambiguous
parses, but uses the PEG idea of ordering those parses to
deterministically disambiguate them. Doing so, would give one a parser
that is composable, deterministic, and transparent. It would also be
linear on all LL and LR grammars, and just non-linear on ambiguous
grammars. It is even possible to warn on non-LL/LR grammars, so that
the user would know if the grammar is potentially non-linear (and
where the PEG disambiguation rules may be pruning away parses).

By the way, note that the PEG disambiguation rules are not the only
way of disambiguating the parses, but it does have a certain
simplicity to it. Another approach that could be used for
disambiguation is "cost functions" like are used in the Burg tools.
If I recall correctly, iburg has cost functions that impose linear
orders on the result of ambiguous parses.

Chris F Clark

neprebran,

23. sep. 2005, 15:58:2623. 9. 05

do

Detlef Meyer-Eltz <Meyer...@t-online.de> writes:

> This code and especially the integration of semantic actions in this
> code is, what I am interested in. That means code for parsers, which
> are generated from parser descriptions like yacc. I am interested in
> the question, how to do something with a parser.
>
> MY THESIS
> ---------
> is, that recursive descent LL compiler compilers for most purposes
> generate the clearest and most flexible code. Parse trees can be
> created, but the processed text can be treated directly too.

...

> There is a discussion appearing again and again in this forum, whether
> LL or LR parsers should be preferred. I think, the advantages and
> disadvantages are reciprocal. But the one decisive argument in favor
> of the LL parsers, generated by recursive descent compiler compilers,
> is the simplicity of the code generated by them.

Your thesis is not without merit. In fact, few will argue that table
driven parsers (either LL or LR ones) are easier to read than
recursive descent parsers. For most people, your thesis holds true.
They want to be able to read the parser generator output as a series
of functions. In fact, the "OO-parsing" that is discussed in another
thread is just a poor-person's hand-implementation of recursive
descent, which in some sense proves your thesis.

There are advantages to table driven parsing technologies, the main
one being that one can do "things" is the interpreter that aren't
obvious. For example, GLR parsing runs multiple copies of an LR
parser in parallel when it encouters an ambiguity. Similarly, we run
Yacc++ engines in a "call-back" mode designed for being used in an
event driven loop. Many of those things are hard (if not impossible)
to do in a recursive descent implementation, because in recursive
descent, one is limited to the semantics of procedures in the target
programming language. If your target programming language doesn't
support parallel procedure calls (or event driven procedures) and most
don't, you can't use a recursive descent implementation to get such
features. (Note, that is one of the reasons I haven't added a
(modified) recursive descent output format to Yacc++. I'm too
addicted to the features that the table driven engine give me.)

Now, if one doesn't need those "features" then the advantages of a
table driven approach tend to disappear. Thus, since most users don't
need such features, most users see recursive descent as a panacea due
to the readable generated output (and in some ways it is).

======================================================================

However, I would caution about the direct processing of the text as
part of the parsing process. That seems seductively simple. However,
it is a false economy and leads one to create a design that does not
generalize properly. The canonical question that illustrates that
problem is the FAQ of "How do I now handle loops (or conditional
statements) in my input?" Handling loops and conditions are easy if
you build an AST, and often not so easy if your code is ad hoc. The
simplifying assumptions one can make when processing the code linearly
fall apart when you have to handle the more complex cases. Thus, it
is better to avoid being able to make such simplifying assumptions and
then have to rework all the code that depends upon them.

In fact, too me the big drawback of recursive descent is that it
allows users to stay in the cocoon of the single threaded top-down
world. In recursive descent, the parser is in control. One starts at
the root and parses according to the grammar, never losing context of
what one has done before. As mentioned, that allows a lot of
simplifying assumptions, and programs written that way tend to be
brittle, because the assumptions are implicit and never challenged
until the program fails. (I have a similar problem with the singleton
design pattern, as it also tends to aid in creating brittle designs.)

Well enough ranting,

Detlef Meyer-Eltz

neprebran,

25. sep. 2005, 12:50:4725. 9. 05

do

Thank you for your long response. I am satisfied, tha you confirm my
thesis about the advantages of the code created by recursive descent
LL compiler compilers:

>For most people, your thesis holds true

Then you try to show the restrictions of this method. Though I don't
really understand what's meant by

>we run Yacc++ engines in a "call-back" mode designed for being used
>in an event driven loop.

I absolutely agree, that direct processing of the input often is not
sufficient.

I try to take the perspective of a user of a parsing tool, who mostly
isn't an experienced compiler theorist. E.g. he might want to take
some contents from some Internet pages and write programming code,
which feeds them into an stl container. For such tasks not only a
parser can be constructed with the TextTransformer in minutes, but you
even get the result. This, due to the simple insertions of -
interpreted - semantic code into the parsing productions. The
TextTransformer interpreter uses a TextTransformer parser and can

> handle loops (or conditional statements)

Indeed the interpreter uses an AST. Though I'm concentrated on
practical application of compiler tools, I am also quite ambitious to
extend the capabilities of the TextTransformer as far as
possible. Until now I found no real task that I could not in principle
fulfill with this tool. But I confess, until now I stayed

>in the cocoon of the single threaded top-down world.

Thanks for your attempt to show me the world outside.

George Neuner

neprebran,

25. sep. 2005, 12:51:3625. 9. 05

do

On 22 Sep 2005 23:51:43 -0400, Chris F Clark
<c...@shell01.TheWorld.com> wrote:

>George Neuner <gneu...@comcast.net> writes:
>
>> The need to embed actions depends greatly on the language ...
>> resolving context sensitivity requires predication and predication
>> requires mid-rule computations.
>
>I think predicates are wonderful. I don't know that predicates
>"require" mid-rule computations. Syntactic ones don't. Semantic
>predicates "are" mid-rule computations, so yes they require them.

That depends on your definitions ... syntactic predicates are extra
grammatic computation - meaning they are (generally) nicety rather
than necessity. Offhand I can't think of a case of a grammar that
could be written using a syntactic predicate but not without.

>> LR can be inefficient for a context sensitive language because
>> reductions aren't applied until all the tokens have already been
>> constructed. The usual LALR work around is either to defer
>> everything until later [the approach you seem to favor] or to create
>> back channels to the lexer and perform as much disambiguation as
>> possible there. But neither of these approaches handle languages
>> which must be lexed differently in different contexts.
>
>Presuming one has mid-rule computations (and those are not hard to
>implement in an LR parser generator), one can also determine context
>before the dependent tokens have been constructed.

I beg to differ. The tokens may not yet have been *consumed* by the
parser, but they most definitely have already been constructed else
the rule reduction would not have been attempted.

And LR mid-rule actions require complete syntactic predication of the
rule head - if not the entire rule - because the user actions can't be
undone and should not be executed unless the parse is committing to
the rule. Operationally it makes the entire rule head a separately
matched sub-rule, which may be logically incorrect.

>> It is a persistent myth that LL parsers have inheriently more
>> context information than LR parsers. It is only true, if one has
>> null productions that are used in multiple contexts and then only
>> if one is doing SLR or LALR parsing, so that states with different
>> lookahead sets are merged. If one has an LL grammar for the
>> language, an LR parser will also have nicely disambiguated
>> contexts. The myth persists because an LR grammar that is not LL
>> will have ambiguous contexts, which is why the grammar is not LL.
>> Note, when using mulitple lookahead (LL(k)) grammars, there are
>> also places where the tokens have to be constructed before the
>> context is determined, i.e. for the first k tokens of rules that
>> need the lookahead for discrimination.

???

AFAICT I never said anything about which method has _more_ context
information available - whatever you think that means.

George

Chris F Clark

neprebran,

25. sep. 2005, 23:44:5025. 9. 05

do

I replied to George Neuner:
c>I think predicates are wonderful. I don't know that predicates
c>"require" mid-rule computations. Syntactic ones don't. Semantic
c>predicates "are" mid-rule computations, so yes they require them.

George Neuner <gneu...@comcast.net> replied back:
g> That depends on your definitions ... syntactic predicates are extra
g> grammatic computation - meaning they are (generally) nicety rather
g> than necessity. Offhand I can't think of a case of a grammar that
g> could be written using a syntactic predicate but not without.

Syntactic predicates are far more than a nicety. There are two well
studied cases that one can parse with predicated grammars that one
can't (deterministically) parse with unpredicated ones: palindromes
{ w w**reverse } and multi-part matching { a**i b**i c**i }. Now,
perhaps those cases are mostly of theoretical importance.

However, syntactic predicates can also be used to resolve conflicts
(and ambiguities), which are a quite practical concern for compiler
writers. For example, If one has a context sensitive special case,
one can usually code the special case and the context up as a
predicate which is then used to parse only the special case part.
that way you tell the parser generator how much context and lookahead
to use to idetify the special case and make the parser generator's job
easier.

Moreover, predicates imply an ordering of the parsing space (just as
parsing expression grammars do), so that one can be certain that the
special case takes precedence over incidental matches with other
sequences. The advantage of predicates is that they apply this
ordering only in limited contexts, so that one knows eactly which
clauses to be careful with and not globally over the entire grammar.

My rule of thumb to Yacc++ users is first write your grammar without
any predicates or precedence rules. Then, look at the conflict
reports, and determine for each of the conflicts how you want the
problem resolved, and from that use either predicates or precedence
directives to guide the generator to parse those conflict situations
the way you want.

By the way, the C++ rule, of if it looks like a declaration it is, and
otherwise it is a statement can only be handled with a syntactic
predicate.

Chris F Clark

neprebran,

27. sep. 2005, 09:49:5327. 9. 05

do

On the topic of context sensitive parsing (using mid-rule actions),

I replied to George Neuner:

c>Presuming one has mid-rule computations (and those are not hard to
c>implement in an LR parser generator), one can also determine context
c>before the dependent tokens have been constructed.

And he replied back:
g>I beg to differ. The tokens may not yet have been *consumed* by
g>the parser, but they most definitely have already been constructed
g>else the rule reduction would not have been attempted.
g>
g>And LR mid-rule actions require complete syntactic predication of
g>the rule head - if not the entire rule - because the user actions
g>can't be undone and should not be executed unless the parse is
g>committing to the rule. Operationally it makes the entire rule
g>head a separately matched sub-rule, which may be logically
g>incorrect.

If I understand you (the word "head" in your post implies I may not),
then you misunderstand the LR method. You don't have to wait for the
reduction of the entire rule to apply a mid-rule action. In fact, for
the "correct" semantics, you don't want to wait until the end of the
rule. A mid-rule action in an LR parser should occur at the point
where the lexer has only constructed the tokens preceding the mid-rule
action point. This is the same point the mid-rule action occurs in an
LL parser. If it occurs later, then it is not a mid-rule action; it
is an end-of-rule action written mid-rule.

If by rule head, you mean the part of the rule preceding the point
where the mid-rule action occurs, then yes the tokens preceding the
mid-rule action point have to be constructed before the mid-rule
action occurs, but none of the tokens after the action point.
However, that is exactly the point where you have specified that
action to occur. If it is not the correct point, then move the action
to the correct point.

Perhaps you are trying to suggest that moving the action earlier in
the rule can be problematic (i.e. it can introduce conflicts). That
is generally only true if the grammar is not LL(1). If the grammar is
LL(1), then an LR parser generator can have mid-rule actions inserted
anywhere in the grammar, except before the first item in a rule (and
it can even have them there if the generator is LR and not SLR or
LALR). Before the first position of the rule is a problem for SLR and
LALR generators because they don't handle the context dependence of
nullable rules correctly.

However, since the initial point you made was about advantages of
using LL over LR to do context sensitive lexing, I presume you were
talking about using the parser context to determine how tokens were to
be lexed. My point was that anything one can do with an LL parser
(which requires an LL grammar), one can do with feeding the same LL
grammar to an LR parser generator (with some slight caveats for
trailing context sensitive nullable rules and SLR and LALR
generators).

Moreover, I brought up mid-rule actions because, I presumed you were
suggesting using actions to change how the tokens were lexed and to
implement the context sensitivity. If you are not using actions to
change how the tokens are lexed, please explain further what feature
of LL parsing you are using to do so and why you don't think feeding
that LL grammar to an LR parser generator would achieve the same
result.

Thank you,
-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site :
http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24
hours)

------------------------------------------------------------------------------

Chris F Clark

neprebran,

27. sep. 2005, 09:44:3427. 9. 05

do

George wrote this very nice example which illustrates the points about
lookahead and context being discussed well.

> My point was that the addition of the action changes the semantics
> so the grammar written as
>
> rule1 : token1 token2 {action1} token3
>
> will actually be parsed as if it were written:
>
> rule1 : rule2 {action1} token3
> rule2 : token1 token2

Or something equivalent, as in the more likely implementation of:

rule1 : rule2 token3
rule2 : token1 token2 {action1}

> The question then becomes whether the two grammars are equivalent ...
> which is only true if the action is either stateless or reversible.

Or the parsing context is unique at that point. If there is no
conflict at that point, because the parse only has one choice at that
point, then the action can be performed irreversibly.

The context *will be* unique at that point if the grammar is LL(1).
This was my earlier point.

If one has an LL(1) grammar, one can perform the transformation of
splitting the rule to place the action code (any place after the first
item) and still have an LL(1) grammar. Moreover, if the unsplit LL(1)
grammar was SLR (LALR or LR), the split grammar will be too. The
LL(1) condition on first sets guarantees this, because the first item
of any LL(1) rule must be unambiguous. Thus, if you are in the middle
of an LL(1) rule, then you know that there is a unique rule that
applies at that point. You can, of course, make the same argument for
the k-th token of an LL(k) grammar.

We will see the importance of this case, when we deal with George's
next grammar.

> Having it anywhere can introduce conflicts. Consider
>
> rule1 : token1 {action1} token2
> rule2 : token1 token3
>
> The rules are not ambiguous, but because the action is not subject
> to look ahead, the parser can't tell which way to go after recognizing
> the common prefix. Logically action1 has to be executed before
> token2 can be recognized ... not parsed, recognized. That means
> the action must be taken *before* look ahead.

Actually the rules are ambiguous, because you cannot tell whether to
execute the action until you have seen the token after the actions.
In particular, this grammar is not LL(1). The first token of the rule
does not disambiguate whether to apply rule1 or rule2.

Unfortunately, many parser generators (including I must ruefully admit
Yacc++) do not flag this ambiguity, they simply read one token ahead
and implement the grammar wrong. Of course, there is no correct
implementation of this grammar, as it specifies something which cannot
be done. One cannot know whether to execute the action until the
lookahead has been read to see if the token read is a token2.
However, if the action somehow affects whether a token2 or token3 is
returned from the lexer, one has a circularity.

Effectively, by using the lookahead one is moving the action to a
later point in the grammar, changing the rules to:

rule1 : token1 token2 {action1}
rule2 : token1 token3

Note, because of this issue, I have considered introducing a new
syntax into Yacc++ to indicate actions that should not be "moved"
(i.e. have the tokens after them used as lookahead for
disambiguation). I just haven't figured out what that syntax should
be. That way if the rule was important for getting the token lexed
right and it appear in a context where the lookahead was important for
knowing whether to apply the action, one would get an error message
indicating the circularity.

> This is where PDA and recursive descent part company. Neither
> can use look ahead alone to decide what to do, but the RD parser
> in most cases can handle the problem more efficiently.
>
> The PDA answer has been to fork the machine, [logically] try all
> branches simultaneously, and prune the branches that fail. However,
> this is generally more costly than a recursive decent approach using
> backtracking. Forking in an RD parser is only necessary if there are
> conflicting actions, as in
>
> rule1 : token1 {action1} token2 token3
> rule2 : token1 {action2} token4 token5
>
> and the scope of the fork in RD is limited to recognition of the first
> token following the conflicting actions.

An LR (inclding SLR or LALR) parser generator does not need to fork
the PDA to solve this problem. If the conflict can be solved with a
fixed amount of lookahead, the parser simply looks that many tokens
ahead. This is the same as an LL(k) parser does. (I want to emphsize
this point. For fixed lookahead an LR parser does not need to fork
copies of the PDA. The parallelism can be embedded in the PDA, the
same way that DFA's encode NFA's.)

Effectively, by using the lookahead one is moving the action to a
later point in the grammar, changing the rules to:

rule1 : token1 token2 {action1} token3
rule2 : token1 token4 {action2} token5

As noted above, I consider that solution "wrong" for this case, as it
changes when the action occurs. However, it is in some sense
convenient. Especially if the rules are combined in different
contexts and the action is unambiguous in some and ambiguous in
others, but you want the effect of the action to be the same.

Now if fixed lookahead cannot solve the problem, then one needs
something more powerful that LL or LR parsing. Here is an example of
that case:

rule1 : token1 {action1} rule3 token3
rule2 : token1 {action2} rule3 token5
rule3 : token2 | token4 rule3

In this case, one needs either backtracking or GLR parsing or the
ability to move the action across non-terminals, as in:

rule1 : token1 rule3 token3 {action1}
rule2 : token1 rule3 token5 {action2}
rule3 : token2 | token4 rule3

Now, the above is the actual "correct" grammar for the case, with no
actions that require lookahead. However, I know of no parser
generator that will implicitly make "that" correction. Since the
correction radically changes the point at which the action is
executed, it is perhaps good that no parser generator does that.

The backtracking solution works like one has a predicated grammar. In
fact, I think of predicates as indicating what parts of the grammar
one wishes to backtrack over. Here is a predicated version of the
grammar:

rule1 : token1 (rule3 token3) >> {action1} rule3 token3
rule2 : token1 (rule3 token5) >> {action2} rule3 token5
rule3 : token2 | token4 rule3

The notation used above for predicates, says the text in parenthesis
preceding the >> operator is a predicate. That is the parser parses
the text in the parenthesis as "lookahead" before continuing the
parse. If the text is matched, the parse continues with the predicate
text rewound (placed back in the input, backtracked over) before the
rest of the rule is matched. If the text isn't matched, the parse
does not apply the rule and rewinds the text before attempting some
other rule. Note, in Bryan Ford's notation this is an & predicate.
There is also an inverse predicate << that says continue parsing only
if the predicate does not match. I don't recall Bryan Ford's notation
for that.

> Cloning the PDA also creates bookkeeping problems - like what to do
> with side effects like AST building - which can usually be avoided using
> the RD approach because the side effects can be deferred until the
> correct rule context is known.

As George correctly notes, backtracking is not the only solution. GLR
parsers don't require predicates and simply fork copies of the parser
PDA at these conflict points. His issues with the forking that GLR
does are generally well founded, but are probably overly cautious for
the typical case.

There is yet another option which hasn't been addressed. The GLR
problem with forking is that one ends up with potentially many forks,
since each input symbol can potentially introduce forks for each
different alternative. However, the way that LR parsers handle the
parallel branches of the PDA shows that there can be a low cost
solution. Instead of backtracking or forking parallel parsers, one
can simply compute the lookahead PDA using a variant of the LR
algorithm. If the grammar is unambiguous, this PDA exists as the
limit of a closure. This technique I call LR(infinite) or LR(closed)
parsing. (The only difficulty is that if the grammar is not
unambiguous, the construction method can loop and one cannot tell if
the loop will ever converge. This is a side-effect of the halting
problem.)

To close, it is worth mentioning that for predicated grammars, we
implemented them using backtracking in Yacc++. We have done
experiments with LR(closed) parsing using Yacc++, but the results have
not led us to think that it is a panacea that is ready for general use.

George Neuner

neprebran,

30. sep. 2005, 02:02:2430. 9. 05

do

On 27 Sep 2005 09:44:34 -0400, Chris F Clark <c...@TheWorld.com> wrote:

>George Neuner <gneu...@comcast.net> writes:
>> Having [mid-rule actions] anywhere can introduce conflicts. Consider

>>
>> rule1 : token1 {action1} token2
>> rule2 : token1 token3
>>
>> The rules are not ambiguous, but because the action is not subject
>> to look ahead, the parser can't tell which way to go after recognizing
>> the common prefix. Logically action1 has to be executed before
>> token2 can be recognized ... not parsed, recognized. That means
>> the action must be taken *before* look ahead.
>
>Actually the rules are ambiguous, because you cannot tell whether to
>execute the action until you have seen the token after the actions.
>In particular, this grammar is not LL(1). The first token of the rule
>does not disambiguate whether to apply rule1 or rule2.

I agree that the grammar is not LL(1), but I question the conclusion
of ambiguity (see below). Mid-rule actions are technically meta
information and not part of the grammar. Depending on what they
actually do, there may be no noticable effect on the parsing.

>Unfortunately, many parser generators (including I must ruefully admit
>Yacc++) do not flag this ambiguity, they simply read one token ahead
>and implement the grammar wrong. Of course, there is no correct
>implementation of this grammar, as it specifies something which cannot
>be done. One cannot know whether to execute the action until the
>lookahead has been read to see if the token read is a token2.
>However, if the action somehow affects whether a token2 or token3 is
>returned from the lexer, one has a circularity.
>
>Effectively, by using the lookahead one is moving the action to a
>later point in the grammar, changing the rules to:
>
> rule1 : token1 token2 {action1}
> rule2 : token1 token3
>
>Note, because of this issue, I have considered introducing a new
>syntax into Yacc++ to indicate actions that should not be "moved"
>(i.e. have the tokens after them used as lookahead for
>disambiguation). I just haven't figured out what that syntax should
>be. That way if the rule was important for getting the token lexed
>right and it appear in a context where the lookahead was important for
>knowing whether to apply the action, one would get an error message
>indicating the circularity.

Your point is well taken. However, given the original grammar above,
I would argue that the likelihood that the action creates circularity
is very small. It seems to me that a reasonable grammar would
position such an action in a prefix common to both rules rather than
in either one of them. As the grammar is written it is far more
likely that the action affects only tokens following it.

Of course the tool must be conservative and issue a warning because it
cannot know the meaning of the action. However, I think it is quite
reasonable to try the rules without embedded actions first and
backtrack on failure, or alternatively, to fork and try the rules in
parallel.

>Now if fixed lookahead cannot solve the problem, then one needs
>something more powerful that LL or LR parsing. Here is an example of
>that case:
>
> rule1 : token1 {action1} rule3 token3
> rule2 : token1 {action2} rule3 token5
> rule3 : token2 | token4 rule3
>
>In this case, one needs either backtracking or GLR parsing or the
>ability to move the action across non-terminals, as in:
>
> rule1 : token1 rule3 token3 {action1}
> rule2 : token1 rule3 token5 {action2}
> rule3 : token2 | token4 rule3
>
>Now, the above is the actual "correct" grammar for the case, with no
>actions that require lookahead.

It's only correct if one assumes the actions have no effect on lexing
- which is probably wrong. We've already established that moving
actions is logically wrong no matter what they do.

>The backtracking solution works like one has a predicated grammar. In
>fact, I think of predicates as indicating what parts of the grammar
>one wishes to backtrack over. Here is a predicated version of the
>grammar:
>
> rule1 : token1 (rule3 token3) >> {action1} rule3 token3
> rule2 : token1 (rule3 token5) >> {action2} rule3 token5
> rule3 : token2 | token4 rule3
>
>The notation used above for predicates, says the text in parenthesis
>preceding the >> operator is a predicate. That is the parser parses
>the text in the parenthesis as "lookahead" before continuing the
>parse. If the text is matched, the parse continues with the predicate
>text rewound (placed back in the input, backtracked over) before the
>rest of the rule is matched. If the text isn't matched, the parse
>does not apply the rule and rewinds the text before attempting some
>other rule.

This is where I was going in my previous post where I said rules with
embedded actions should be handled by predication. The problem lies
in your assumption of circularity being introduced by actions which is
an issue I had never considered (see above). If one must always
assume circularity, then I am not aware of any existing method that
will work. I suppose a recursing combinative fork method could be
used but I can't imagine such a thing being very efficient or easy to
automatically construct.

George

Chris F Clark

neprebran,

2. okt. 2005, 02:51:042. 10. 05

do

In George's reply, I think we converged on agreement.

> However, given the original grammar above,
> I would argue that the likelihood that the action creates circularity
> is very small.

...

> Of course the tool must be conservative and issue a warning because it
> cannot know the meaning of the action.

Yes, the problem discussed is exceedingly rare. I'm mostly aware of
it, because, as a parser generator author, I see the worst cases. As
in "why did your tool complain when I did this reasonable thing?" and
(at the same time but from a different user) "why didn't your tool
complain as I shot myself in the foot?" The problem is that there are
places where the theory (especially as it meets practice) is really
opaque and one can easily write something that looks obvious, but
doesn't do what you want for reasons that aren't obvious at all.
Fortunately, such cases are generally rare as George mentioned. The
problem is as a parser generator writer, one has to balance those
concerns. Execpt in the simplest cases, one's tool is likely to
generate both false positives and false negatives--places where
problems are indicated where there are none and places where it fails
to indicate a problem which is there.

And, when George started this conversation off with a line about
context sensitive grammars, my oh-this-is-really-difficult red-lights
went off. With a little cleverness one can parse some very tricky
cases. However, when one makes a mistake in such specifications, the
tool isn't going to help you much. A tool that is too restrictive
isn't going to let you write what you need to solve the problem. A
tool that isn't restictive enough is going to let you write something
that won't work reliably. In the worst case, the tool lets you write
something that looks like it will work (and does so for some trivial
cases), but then falls down when you are using it on real live data,
perhaps after it has become mission critical, because it worked for a
while.

To bring this back around to "parsing expression grammars", that is
one of my concerns with them. The ordering of alternatives in peg's
make it real easy to write appliications that do just that, look like
they are going to work, but don't. What we don't have data on, is how
much this is a problem in "real life". Perhaps, authors don't make
the kind of mistakes that would cause peg's to be unreliable.

What I do know is that if one writes an LL(1) grammar (and not a
predicated one), then one is safe. One can use any of the major
parsing technologies and either get something that works from that
grammar or get out an error message saying there is a mistake. Well,
a non-LL tool won't tell you if you've wandered outside the safe LL
subset, so maybe you need to check your grammar against an LL tool to
be certain, but even if not you're generally in pretty nice territory.

In any case, thanks to George for the wonderful exchange. I think in
trying to explain what I meant, I made even my own understanding
better.

Quinn Tyler Jackson

neprebran,

2. okt. 2005, 02:55:062. 10. 05

do

Chris F Clark said:

> However, syntactic predicates can also be used to resolve conflicts
> (and ambiguities), which are a quite practical concern for compiler
> writers. For example, If one has a context sensitive special case,
> one can usually code the special case and the context up as a
> predicate which is then used to parse only the special case part.
> that way you tell the parser generator how much context and lookahead
> to use to idetify the special case and make the parser generator's job
> easier.

Indeed, predicated grammars can accept a great number of fairly intricate
programming language constructions, as seen in this paper from Okhotin:

"On the existence of a Boolean grammar for a simple procedural language"

Found here:

http://www.cs.queensu.ca/home/okhotin/boolean/

Except for the behavior of the conjunction operators | and /, conjunctive
grammars and PEGs are pretty much one and the same, formally.

I suppose one might be able to show a direct relation of the implied
predication of the ordering in PEGs and the explicit predicates found in
both formalisms, and this suspicion leads me to believe that one way to
address language equivalency with PEGs is not to examine their relation to
the CFLs, but to the conjunctive/Boolean languages, and work from there.

Now, it could be shown that there exists no algorithm for detecting such
cases as the P <- x P x / x situation noted in other posts, since x can be
replaced with any arbitrary production X, and the crux of finding such cases
would rest in proving that the LHS and RHS expressions of P accept the same
language, where that language might be a CSL. I haven't given this too much
formal thought, and so I freely admit I could be speaking out the top of my
head.

That is to say, one could easily say that P <- X P X / X is not acceptable
(or potentially misleading), but what of P <- X P Y / Z where X, Y, and Z
generate the same strings?

--
Quinn

http://members.shaw.ca/qtj/

Sylvain Schmitz

neprebran,

4. okt. 2005, 01:45:404. 10. 05

do

Quinn Tyler Jackson wrote:
> Indeed, predicated grammars can accept a great number of fairly intricate
> programming language constructions, as seen in this paper from Okhotin:
>
> "On the existence of a Boolean grammar for a simple procedural language"

I would not really consider predicates on the same level as the
operators in boolean grammars: syntactic predicates are additional
run-time checks written by the user to help disambiguate a conflict. As
such, they are highly convenient and utterly dangerous, allowing more or
less arbitrary code to be run during the parsing phase, possibly
unrelated to the language at hand and possibly introducing an
exponential overhead. Basically, you have to trust the grammar writer
for not doing anything harmful.

Boolean grammars on the other hand integrate the additional
conjunction and negation operations with clear language semantics; I'm
not sure the difference in philosophy is sensible here, but it is rather
huge: the operations are really part of the grammar, and not ad hoc rules.

> I suppose one might be able to show a direct relation of the implied
> predication of the ordering in PEGs and the explicit predicates found in
> both formalisms, and this suspicion leads me to believe that one way to
> address language equivalency with PEGs is not to examine their relation to
> the CFLs, but to the conjunctive/Boolean languages, and work from there.

That seems the best hope one would have to express what exactly is the
language of A / B given the language of A and that of B. It still seems
very difficult. (The formula I expressed earlier in this thread was
completely false.) Its definition involves the remaining "unmatched"
input, and I don't really see how to get rid of it.
In his "Boolean Grammars" paper, Okhotin presents a grammar for {
a^(2^n) | n >= 0 } (on page 23), which bears little resemblance with
P<-xPx/xx.

--
Regards,

Sylvain

Chris F Clark

neprebran,

5. okt. 2005, 00:00:445. 10. 05

do

Sylvain Schmitz wrote this interesting comment:

> I would not really consider predicates on the same level as the
> operators in boolean grammars: syntactic predicates are additional
> run-time checks written by the user to help disambiguate a conflict. As
> such, they are highly convenient and utterly dangerous, allowing more or
> less arbitrary code to be run during the parsing phase, possibly

^^^^^^^^^

> unrelated to the language at hand and possibly introducing an
> exponential overhead.

I'm not sure what you mean by the underlined word. If you mean that
predicates allow running arbitrary "host language" (e.g. C) code in
the process of the parse, then you are talking about SEMANTIC
predicates. SYNTACTIC predicates only allow one to write grammar
fragments (rules, non-terminals, and regular expressions) that are
used as "lookahead".

Now, part of your contention is potentially true for syntactic
predicates. When implemented via backtracking, they do have a
potential exponential cost. However, so does converting an NFA to a
DFA. In practice, I have not seen that as an issue. Moreover, it is
not required that one implement predicates via backtracking. I
believe it is possible to fold the predicates into the normal parsing
process for many instances and to have guaranteed linear time bounds
on performance with a run-time overhead comparable to standard LR
parsing.

So, if your argument is that predicates are not a panacea and one can
write very poorly structured parsers that use predicates to
bail-themselves out, but which as a result are unreliable, slow, and
otherwise bad, I would agree. However, if you are arguing that
boolean grammars resolve all these problems, I'll need more evidence,
especially that we are talking about the same things. What I've read
of boolean grammars, maps loosely onto how I view syntactic
predicates. It is a little more formal. However, that is to be
expected, since it is intended as a formal system more than the ad hoc
way predicates were initially formulated.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

------------------------------------------------------------------------------

Sylvain Schmitz

neprebran,

6. okt. 2005, 15:02:206. 10. 05

do

Chris F Clark wrote:

> Sylvain Schmitz wrote:
>>they are highly convenient and utterly dangerous, allowing more or
>>less arbitrary code to be run during the parsing phase, possibly
> ^^^^^^^^^
>>unrelated to the language at hand and possibly introducing an
>>exponential overhead.
>
> I'm not sure what you mean by the underlined word.

The choice of word was poor indeed. I meant "arbitrary recognition
code", such as regular expressions, and thus syntactic predicates by this.

We seem to agree; I'll just throw a few more lines in favor of boolean
grammars instead of syntactic predicates. I feel the difference is
indeed in the formal way boolean grammars are treated, which means that
the boolean operations have very clear consequences on the generated
language. We can thus expect them to be used by the language designer,
for instance to express a non CF but still syntactic requirement, like
naming scopes.
Syntactic predicates on the other hand are tools of the language
implementor, not of the language designer. They are used to solve an
implementation problem. The language implementor might then recognize a
slightly different language than the intended one when he interferes
with the parser by adding syntactic predicates.

--
Regards,

Sylvain

A Pietu Pohjalainen

neprebran,

13. okt. 2005, 20:41:5213. 10. 05

do

Chris F Clark <c...@shell01.theworld.com> wrote:
> Detlef Meyer-Eltz <Meyer...@t-online.de> writes:

>> MY THESIS
>> ---------
>> is, that recursive descent LL compiler compilers for most purposes
>> generate the clearest and most flexible code. Parse trees can be
>> created, but the processed text can be treated directly too.
>

> Your thesis is not without merit. In fact, few will argue that table
> driven parsers (either LL or LR ones) are easier to read than
> recursive descent parsers. For most people, your thesis holds true.
> They want to be able to read the parser generator output as a series
> of functions. In fact, the "OO-parsing" that is discussed in another
> thread is just a poor-person's hand-implementation of recursive
> descent, which in some sense proves your thesis.

Yes. This is exactly the point of writing the parser using function
calls: the next guy, who doesn't know about compiler generator tools
wants to have the parser implemented in terms that he already knows;
maybe stepping in the debugger and not wondering why his debugger
opened a strange file with suffix '.y'.

What 'OO-grammars' give to this picture is added clarity: class
structure of the parser reflects the grammar structure. However,
there's no need to restrict possible implementations only to
hand-implemented static code doing function/method/constructor
calls. However, once the implementation moves away from this approach,
its 'steppability' is also reduced.

(Of course, you can always generate beautiful static code, which has
good steppability as well).

br,
Pietu Pohjalainen

Chris F Clark

neprebran,

14. okt. 2005, 17:24:3614. 10. 05

do

I wrote:

> They want to be able to read the parser generator output as a series
> of functions.

Pietu Pohjalainen replied:

> Yes. This is exactly the point of writing the parser using function
> calls: the next guy, who doesn't know about compiler generator tools
> wants to have the parser implemented in terms that he already knows;
> maybe stepping in the debugger and not wondering why his debugger
> opened a strange file with suffix '.y'.
>
> What 'OO-grammars' give to this picture is added clarity: class
> structure of the parser reflects the grammar structure. However,
> there's no need to restrict possible implementations only to
> hand-implemented static code doing function/method/constructor
> calls. However, once the implementation moves away from this approach,
> its 'steppability' is also reduced.

However, this is exactly what I fear. Some subsequent developer
coming along and fiddling with the readable generated code to hack in
some special case, not realizing that their special case breaks some
other call to the same function that just doesn't happen to be
exercised in this debugging session (or by this set of inputs).

Using a tool (and thus having an obscure ".y" file) prevents those
kind of accidents, because the tool can (and should) check consistency
for you. When your debugger steps you into some file which is
different from what you normally work on, one is supposed to realize,
"Oh, maybe something different and special is happening here; maybe I
should be careful!"

I have the same sympathy for a person who finds that too constraining
that I do for people who don't wear seatbelts. Sure, we can write and
debug our recursive descent compilers by hand. We can also program
our machines in binary. What makes one think that doing it that way
makes for more reliable software?

Similarly, are you really convinced that stepping through input
samples one token at a time is the best way to determine the
correctness of a grammar? I can understand that once one finds a flaw
and is trying to understand where it comes from, one may need to step
through a little to see "where" the code is going wrong. However,
that should generally be one of the last tasks one does and shouldn't
have to do so on a regular basis. I can probably count on one hand
the number of times I have stepped through the parsing of Verilog
programs in the parser I wrote. In the last year, I did it just once
when updating the grammar from the 1995 to the 2001 spec and that was
to uncover a typographical error that sent a parse down "the wrong
path". I did make and fix other errors, but none by stepping through
the parsing of sample inputs. Most of my errors were found by the
warnings the tool I was using gave me.

I have nothing against recursive descent as an implementation
technique as the output of a tool (other than it can lead to the
acceptance of hand-written recursive descent compilers). Hoever, I
think it is irresponsible to recommend hand-writing recursive descent
parsers. It leads to shoddy programming.

End of tirade.

My apologies if this attack is too harsh or seems too personal. It's
just something I feel strongly about, so excuse me if my objectivity
is blurred in this case. It's just that the flaws of hand-written
code are so blatant to me and I don't understand why people don't see
the obvious.

-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

------------------------------------------------------------------------------
[If you hadn't said it, I would. Writing code to make it easy for people
who don't understand it to make arbitrary changes does not strike me as a
winning strategy. -John]

Paul Mann

neprebran,

15. okt. 2005, 13:00:4715. 10. 05

do

"Chris F Clark" wrote ...

> I have nothing against recursive descent as an implementation
> technique as the output of a tool (other than it can lead to the
> acceptance of hand-written recursive descent compilers). Hoever, I
> think it is irresponsible to recommend hand-writing recursive
> descent parsers. It leads to shoddy programming.
>

> [If you hadn't said it, I would. Writing code to make it easy for
> people who don't understand it to make arbitrary changes does not
> strike me as a winning strategy. -John]

Thanks Chris and John !

I look forward to the day when we view recursive-descent hand-written
code as we view hand-written applications in assembly language.

Also, doing research with error recovery, partial parsing, and other
things is virtually impossible with hand-written recursive descent
parsers.

When I first discovered table-driven parsers I was relieved to know
that there was something I could depend on to be free of errors.
Finally, something in computer "science" that has a hint of
engineering associated with it.

Paul Mann
http://parsetec.com

Paul Mann

neprebran,

17. okt. 2005, 00:26:0417. 10. 05

do

"Chris F Clark" wrote ...

> I have nothing against recursive descent as an implementation

> technique as the output of a tool (other than it can lead to the
> acceptance of hand-written recursive descent compilers). Hoever, I
> think it is irresponsible to recommend hand-writing recursive descent
> parsers. It leads to shoddy programming.
>

> [If you hadn't said it, I would. Writing code to make it easy for
> people who don't understand it to make arbitrary changes does not
> strike me as a winning strategy. -John]

Thanks Chris and John !