> Is there anything to be cautious of when using LL? I looked in
> Principles of Compiler Design, but I couldn't decipher exactly what
> the pros and cons were.
There is no strict ordering between LL(k) and LALR(1); that is, I know of
one grammar that is LL(k), but not LALR(1). HOWEVER, LALR(1) has stronger
recognition strength in practice. yacc's advantage in this area has
narrowed due to PCCTS's full LL(k>1) parsing (only k=1 was commonly
available before). Strength is not the end of the story because one
rarely wants to merely recognize --> we TRANSLATE. Here, LL parsers are
the clear winner. A few highlights:
o Actions cannot introduce ambiguities into your LL grammar. All yacc
programmers say "arrgggghhh!" here.
o LL rules may inherit attributes; i.e. you can pass stuff to them just
like in a programming language. For example, you can pass a scope to
your rule that recognizes declarations. Future versions of PCCTS
will even let you pass productions to rules-->context-sensitive
parsing.
o If recursive-descent compilers are generated, then local, stack-based
variables are trivial to implement. Bottom-up folks have no
*convenient* way. Local variables are useful because you get a
new one at each rule invocation; e.g. a new 'scope' variable appears
every time you enter rule 'function'.
The disadvantage really only lies in the fact that you have more constraints
when writing LL(k) grammars--no left-recursion and no common prefixes of
tokens >= k tokens in length.
> Basically, I just use yacc (bison whatever) to parse some input
> specification files.
If you can do what you need to do with yacc and/or already have grammars
that do what you want, then there is no reason to change.
Moral of story: Most languages can be described easily with LL(k);
the semantic flexibility of LL(k) makes any fancy
footwork regarding the grammar worth the effort for me.
One other point: The tool's programming interface is also a consideration.
Does the system allow EBNF? Can it build trees for you
automagically? Can you debug the resulting parsers
(tables vs recursive-descent) easily?
Terence Parr, pa...@ecn.purdue.edu
Purdue University Electrical Engineering
--
Send compilers articles to comp...@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
pa...@ecn.purdue.edu (Terence J Parr) writes:
>Strength is not the end of the story because one rarely wants to merely
>recognize --> we TRANSLATE.
This is a good point. We want don't so much want to recognize a
_language_ so much as generate a parse tree of the input for a particular
grammar. The _grammar_ itself is important; if one has to contort the
"natural" grammar (i.e., the one that matches the semantic grouping) of
language in order to accomodate our parsing scheme, this is a serious dis-
advantage. This is why, for example, we don't just convert our ambiguous
grammars into Chomsky normal form and apply the obvious dynamic
programming parsing scheme. We'll see in a minute that both LL(k) and
LR(1) parsing schemes suffer from this problem.
>Here, LL parsers are the clear winner. A few highlights:
>
>o LL rules may inherit attributes; i.e. you can pass stuff to them just
> like in a programming language. For example, you can pass a scope to
> your rule that recognizes declarations. Future versions of PCCTS
> will even let you pass productions to rules-->context-sensitive
> parsing.
People often make the mistakes that (1) yacc <=> LALR parsing and
(2) yacc is limited to synthesized attributes. If you think about what is
actually happening during parsing, you will realize that the left-context
of each production is available, albeit in a somewhat disguised form: the
left context of currect production is the contents of the entire parsing
stack. Accessing the left context of a production is no more painful than
accessing the attributes of the rhs of a production. The problem is that
we've been spoiled by yacc's convenient use of the $n constructs.
Considering your example, where expressions are parsed in a
context that containing declarations, we can do the following: Suppose our
grammar is:
exp -> LET decls IN exp | atexp
exp -> exp + exp | ....
decls -> ...
Then I can access the attribute of the "decls" by the following:
exp : exp '+' exp1 {$$ = parse_in_context($1,$3,$<decls>-1);
since the stack at the time of the reduction of exp '+' exp -> exp looks
like:
------------------------------
| exp | <attribute of exp> | = "$3"
------------------------------
| '+' | <attribute of '+'> | = "$2"
------------------------------
| exp | <attribute of exp> | = "$1"
------------------------------
| IN | <attribute of IN> | = "$0"
------------------------------
| decls | <attribute of decls> | = "$-1"
------------------------------
| LET | <attribute of LET> | = "$-2"
------------------------------
A "hack" you say? Well, a hack is in the eye of the, er, hacker.
Clearly, accessing the synthesized attributes is, by the nature of yacc's
facilities, somewhat easier than accessing the left context. But that is
a feature of yacc, not LR parsing. All I need to do general L-attributed
grammars is an attribute stack, and I would be happy if yacc only allowed
action functions that took a single argument: the top of the attribute
stack. It's possible to write a compiler that uses no variables other
than the attribute stack (along with some locals of action functions).
You keep the symbol table in the attribute stack; this makes exiting
declaration scopes painless.
>The disadvantage really only lies in the fact that you have more constraints
>when writing LL(k) grammars--no left-recursion and no common prefixes of
>tokens >= k tokens in length.
But this is no small restriction! Consider the common expression
grammar:
E -> E + T | E - T | T
T -> T * N | T / N | N
N -> (E) | 1 | 2 | ...
This grammar is not LL(k) for any k (since I can pump with the
parentheses.) But this is the natural grammar for this language. Sure, I
can change it to LL(1) by changing the associativity of the operators, but
that produces a parse tree that doesn't reflect the intended semantics of
the language, and will cause me headaches later on.
I don't see any good solution to this, other than LR parsing. How
_do_ you parse expressions in an LL(k) parser? Sure, I can write a parser
that, when trying to consume an E when looking ahead at a "(", recursively
calls itself, but that isn't an LL(k) parser.
>Moral of story: Most languages can be described easily with LL(k);
> the semantic flexibility of LL(k) makes any fancy
> footwork regarding the grammar worth the effort for me.
Moral of _my_ story: Most languages can be described easily with LR(1);
the grammatical flexibility of LR(1) makes any fancy
footwork regarding the semantics worth the effort for me.
:-) :-) :-) :-) :-) :-) :-) :-) :-)
Real moral of the story: Different tools for different people. Finally,
it's important to distinguish between the _tool_ and the _technique_:
*** YACC IS NOT THE FINAL WORD IN BOTTOM-UP PARSING! ***.
BTW, if anyone wants to see a _really_ slick parser and parser-generator
tool, check out the parser in the SML/NJ compiler, generated by ML-yacc.
A very nice piece of code!
--
Jack Eifrig (eif...@cs.jhu.edu) The Johns Hopkins University, C.S. Dept.
An then there would be the if-then-else problem. If you have a grammar like
...
stmt --> ifstmt | otherstmts
ifstmt --> 'if' expr 'then' stmt elsepart
elsepart --> \epsilon | 'else' statement
...
you have the ambiguity in parsing a statement like
if e1 then if b2 then s1 else s2
not knowing if the else should go with the inner if or the outer one.
You can transform the grammar, keeping it context-free of course, but the
transformed grammar that pairs an else with the innermost if is not LL(k)
for any k (says John Gough in "Syntax Analysis and Software Tools").
However, using explicit delimiters (hooray!) like end or fi nicely avoids
this problem.
--
Debora Weber-Wulff d...@inf.fu-berlin.de
Institut fuer Informatik +49 30 89691 124
Nestorstr. 8-9
D-W-1000 Berlin 31
I agree that "pure" LL is a pain to work with. Your example is unfair
though because you made no attempt to translate it into an equivalent LL
grammar (by left-factoring). I could give you an equally unfair LR set of
productions that had right recursion.
However, using extended BNF for recursive descent parsing removes the
horrors of left recursion and common prefixes. It also makes attachment
of semantic actions very easy. I would rewrite your grammar as:
E -> T { ("+" | "-") T }
T -> N { ("*" | "/") N }
N -> "(" E ")" | 1 | 2 | . . .
The first translator writer strategy I tried was translating the EBNF back
to LL BNF and then traditional LL parsing. This too was a pain. I then
discovered the syntax trees approach on Barrett, Bates, Gustofson, and
Couch "Compiler Construction: Theory and Practice" which builds a
canonical syntax tree for EBNF productions, enables first and follow set
evaluation, and very nicely allows you to produce one procedure for each
EBNF production. The semantic actions are neatly embedded between the
grammar symbols.
A TWS language is not easy by adding specification of attributes
(inherited, synthized, and local = temporary values). This allows the TWS
to automatically generate the local storage declarations as well as the
formal and actual procedure header declarations. Add to this the matching
of a name with each grammar terminal symbol and the produced parser is
even readable.
The above has been implemented in Modula-2 if anyone is interested.
The bottom line is that producing recursive descent translator writer
systems is very straight forward and clean if one uses EBNF. It works for
me much better than LR parsing.
Tom Reid, Director of Computer Science re...@vtopus.cs.vt.edu
Virginia Tech Northern Virginia Graduate Program (703)698-4712
2990 Telestar Court, Falls Church, VA 22042
(1) top-down parsing is "more intuitive" in the sense that it's a
lot easier to teach to neophytes.
(2) "grammar hacking" is easier with a bottom-up parser; certain common
constructs (e.g. expressions) are more "naturally" expressed,
and changes are less likely to produce a grammar that the parser
can't handle.
(3) top-down parsers make it easier to handle semantic routines.
As several people have noted, you have to break your productions
into pieces or introduce new dummy symbols that generate epsilon
in order to get a semantic routine into the middle of a production
in yacc. Fancier bottom-up parser generators automate this process
somewhat, but they *cannot* let you put semantic routines in *arbitrary*
places; you still have to hack your grammar if you want to do
something in the left corner of a production.
(4) If you can't get hold of a parser generator (less of a problem
than it once was, but still possible), you'll want to parse top-down
with recursive descent.
(5) In table-driven parsers, management of space for attributes has
a very different "feel" in the top-down and bottom-up worlds,
and different programmers seem to have different preferences.
The "natural" attribute stack for bottom-up parsers is intuitive
and easy to automate, but inherited attributes are a hack. Automatic
management of space for top-down attributes is also possible,
but forces you to use a *lot* of copy rules. Ad-hoc, explicit
pushing and popping of a semantic stack is probably preferable in
top-down parsers; many programmers find it elegant, once they
get the hang of it.
(6) Bottom-up parsers can handle a larger class of languages, but this
doesn't seem to be a serious problem in practice; most "real"
languages have a grammar that's LL(1), or close enough to handle with
simple disambiguating rules.
(7) Top-down parsers require less table space, but this is seldom a problem
any more.
One clarification from an earlier post: you do NOT have to change the
associativity of your operators (i.e. change the semantics of your
language) to parse expressions top-down. You have to build a parse tree
that does not naturally reflect that associativity. During semantic
analysis, you use inherited attributes to push information across
production boundaries.
E -> E + T | E - T | T
T -> T * N | T / N | N
N -> (E) | 1 | 2 | ...
becomes
E -> T T_tail
T_tail -> - T T_tail | + T T_tail | epsilon
T -> N N_tail
N_tail -> * N N_tail | / N N_tail | epsilon
N -> (E) | 1 | 2 | ...
This is LL parsing at its absolute worst; if you can stomach expressions,
you won't object to anything else.
T_tail and N_tail have inherited attributes that summarize the information
to the left in the expression. It's a little messy, but not really
difficult.
It should probably be pointed out that the (less common) RIGHT-associative
operators (e.g. exponentiation) are more naturally expressed with top-down
grammars.
--
Michael L. Scott
Computer Science Dept. (716) 275-7745, 5671
University of Rochester FAX 461-2018
Rochester, NY 14627-0226 sc...@cs.rochester.edu
Given an LR(1) grammar, to which one now must add semantic
actions/attribute functions, one may wind up with the following two
apparently unrelated productions:
foo -> A B {action1} C D
bar -> A B {action2} C E
But since the LR construction puts the productions into the same state,
there is a conflict in the actions. The translator-writer must revise the
action to handle both situations. With an LL(1) grammar, the
common-prefix will already have been factored; once the grammar is LL,
actions can be added without further trouble. It's all a matter of when
you go through the pain.
> A "hack" you say? Well, a hack is in the eye of the, er, hacker.
Push this thought on the stack for later retrieval.
>>The disadvantage really only lies in the fact that you have more constraints
>>when writing LL(k) grammars--no left-recursion and no common prefixes of
>>tokens >= k tokens in length.
> But this is no small restriction! Consider the common expression
>grammar:
> E -> E + T | E - T | T
> This grammar is not LL(k) for any k
>But this is the natural grammar for this language. Sure, I
>can change it to LL(1) by changing the associativity of the operators,
> I don't see any good solution to this, other than LR parsing. How
>_do_ you parse expressions in an LL(k) parser?
a) Natural is in the eye of the naturalist. Having taught grammars and
parsing to undergraduates, I have to recognize that many people won't find
the LR grammar very natural either. A left-recursive grammar does not
supprt the abstract view of expressions as a list of operands separated by
operators, a view which is suggested by this grammar:
E -> T <more Ts>
<more Ts> -> + T <more Ts>
<more Ts> -> empty
In practice there are a few LL idioms, and once you get used to them,
writing LL grammars is easy. You can then make an informed trade-off
between the BNF and the semantic actions.
b) You parse expressions LL(1) by adjusting where you put the semantic
actions. The above LL grammar produces a right-leaning parse tree, but
does not put an obvious associativity on the operator, since all the
operands are at different levels of the tree. Pass the expression-so-far
as an inherited attribute, and do the important evaluation after seeing
the second operand:
<more Ts> -> + T {eval expr-so-far} <more Ts>
You may prefer to have all of the associations explicit in the parse tree,
but I find it to be no hackier than passing inherited attributes during a
bottom-up parse.
>Real moral of the story: Different tools for different people.
De gustibus non est deterministum.
> Finally, it's important to distinguish between the _tool_ and the
>_technique_: *** YACC IS NOT THE FINAL WORD IN BOTTOM-UP PARSING! ***.
Amen brother! Many people seem to think that there are only two ways to
build a parser: yacc, and hand-constructed recursive descent. But there
are other bottom-up parser generators, and there are a variety of top-down
generators.
d...@inf.fu-berlin.de writes:
>Perhaps I've missed something, but it would seem to me that LL parsers
>need a lot of memory for all the pending production procedure calls, i.e
>you can't reduce <prog> ::= <decl> <stmts> until the very last token has
>been read in. You would normally have as many pending calls as the depth
>of the parse tree at the moment. This could be a problem for some systems.
Recursive-descent parsers commonly convert the tail-recursive productions
into iteration in the parsing procedure, so there is no stack growth.
This is especially easy if the parser generator accepts an extended BNF.
With table-driven parsers, symbols are popped off the stack when expanded
and again there is no serious stack growth.
>An then there would be the if-then-else problem. If you have a grammar like
> ...
The dangling-else is an important theoretical limitation of LL(1) parsing.
You just can't do it. Fortunately the kludge to get around them problem
is simple and painless; you instruct the parser generator to choose in
favor of matching the else -- its a lot like the precedence rules in yacc.
--
Jon Mauney, parsing partisan
Mauney Computer Consulting
mau...@csc.ncsu.edu (919) 828-8053
Comments on the practical effects of this observation, especially from the
LL advocates?
(For starters, I believe this is why at least some RD parsers have
embedded operator precedence parsers to do arithmetic.)
Gene
Conjecture 1: an algorithm exists which, given any unambiguous
(but otherwise unrestricted) grammar for an LL(1) language,
produces in finite time an LL(1) grammar for the same language.
Conjecture 2: an algorithm exists which, given any unambiguous
(but otherwise unrestricted) grammar, determines in finite time
whether the grammar describes an LL(1) language.
I suspect that full left-factorization and left-recursion removal will
satisfy the requirements of (1), and that conjecture (2) is false.
I'm sure these are trivial, but not to me :-). The application of these
conjectures to compilers is immediately obvious :-). Thanks for your
help,
Bart Massey
ba...@cs.uoregon.edu
Just a quick review: We're comparing the relative merits of parsing
expressions; in particular, bottom-up using
LR grammar: E -> E + T | E - T | T
T -> T * N | T / N | N
N -> (E) | 1 | 2 | ...
or top-down using
LL grammar: E -> T E'
E'-> + T E' | - T E' | (empty)
T -> N T'
T'-> * N T' | / N T' | (empty)
N -> (E) | 1 | 2 | ...
I think we've established that these are the two "canonical" grammars for
expressions. In <92-0...@comp.compilers>, Michael Scott
(sc...@cs.rochester.edu) does a very nice job of comparing the two
approaches. In particular, he correctly points out that:
> This is LL parsing at its absolute worst; if you can stomach expressions,
> you won't object to anything else.
In particular, he points out that LR grammar parse tree for an
expression can be generated using the LL grammar in a top-down fashion by
using inherited attributes: if E -> T E', the parse tree for T is handed
to the "parser" for E', which consumes the operator and the other operand,
cons's them up, and passes the result along to the next E'.
An ingenious solution, sort of like "parsing with continuations."
Let's agree to call this "the canonical LL hack." :-)
Michael Scott continues:
> It should probably be pointed out that the (less common) RIGHT-associative
> operators (e.g. exponentiation) are more naturally expressed with top-down
> grammars.
To be fair, it should be mentioned that right-recursion _can_ be
used in an LR parser; it only causes the maximum stack depth to become
unbounded, something that's going to happen anyways (if we allow
parentheses). LL(k) parsing techniques can't handle left-recursion
_at_all_.
Finally, I think what we've established is that there is no
"right" way to think about parsing. Some people will find that annoyance
of the "canonical hack" is more than compensated by the clarity of
top-down parsing, and will opt to use an LL parser-generator. Others
(like myself) have been so warped by using yacc that our brains have
turned inside out and LR parsing seems perfectly clear while LL is murky.
(Either that, or we've been looking at the results of too many CPS
transformations! :-)
BTW, what _I_'d really like to see is a version of yacc that
allows me to specify a special "reduce function" for productions, such
that the production is reduced iff the "reduce function" evaluates to
"true".
If this proves to be a real problem and you're willing to put some work
into it, you can revise the "grammar" (most serious LL people use some
form of extended BNF, not strict LL(1) BNF) to fast-track the common
cases. (Bear in mind that the above is really quite a complicated
expression, and is well beyond the typical case, which is something like
`x' or `5'.) For example, revise `expr' in
expr = term { "+" term }
term = atom { "*" atom }
to
expr = atom [ "*" term ] { "+" term }
which will deal with the (common) single-atom exprs without the dreaded
"recursive plunge". Doing this for a messy zillion-level expression
grammar is a lot of work, and the semantics stuff may need adjustments if
you really care about associativity, but it can be done.
Credit Where Due Dept: I first saw this trick in the SP/k PL/I-subset
compiler done by the Computer Systems Research Group here at U of T, in
project-internal documentation that I read in late 1977. I'm not sure it
ever got written up anywhere more accessible. I don't remember exactly
whose name was on it, although I'd guess the idea came from Ric Holt or
possibly Jim Cordy.
--
Henry Spencer @ U of Toronto Zoology, he...@zoo.toronto.edu utzoo!henry
In fact, I once modified a RD parser to use operator precedence ideas
to reduce recursion in the face of parentheses. It was really bad with
cfront output, where there seem to be more paren's than white space.
It turned out that the OP version was much more concise than the pure
RD version, which had 16 versions of "expr : expr op expr" to manage
the precedence. The key thing to note is that all you get from
precedence is the resolution of a shift-reduce conflict. So, code
a RD parser that uses precedence to resolve the conflict, rather than
simple lookahead:
Expr expr(Token op0) {
Expr e1 = primary();
while (shift_prec[nextToken] > reduce_prec[op0]) {
Token op1 = nextToken;
match(op1);
Expr e2 = expr(op1);
e1 = GenOp(op1, e1, e2);
}
return e1;
}
As I recall, there were a few hacks needed. For example, ?: takes
three operands. This was a few years ago, so it is getting fuzzy.
Bob
You should add substitution of leading nonterminals in right parts of rules
to your list, as this can solve problems with overlapping firstsets.
You should note however that blind application of these transformations
might get you in an infinite loop like the next grammar:
S -> A a
S -> B b
A -> c A
A -> /* epsilon */
B -> c B
B -> /* epsilon */
FIRST(S) = {a,b},
FIRST(A) = {c, epsilon}, FOLLOW(A) = {a},
FIRST(B) = {c, epsilon}, FOLLOW(B) = {b},
The S-rules have overlapping firstsets, so let's substitute A and B
S -> c A a
S -> a
S -> c B b
S -> b
A -> c A
A -> /* epsilon */
B -> c B
B -> /* epsilon */
Now factorize the S-rules on the common leading c:
S -> c C
S -> a
S -> b
A -> c A
A -> /* epsilon */
B -> c B
B -> /* epsilon */
C -> A a
C -> B b
So starting from a collision of A and B in the S-rules, we end up
with a collision of A and B in the C-rules.
There exits however a LL(1) grammar:
S -> c S
S -> a
S -> b
We just have to be smarter.
Let's start again:
S -> A a
S -> B b
A -> c A
A -> /* epsilon */
B -> c B
B -> /* epsilon */
Couple a and b to the A-rules en B-rules:
S -> A
S -> B
A -> c A
A -> a
B -> c B
B -> b
Join A and B into one nonterminal AorB:
S -> AorB
A -> c A
A -> a
B -> c B
B -> b
AorB -> c A
AorB -> a
AorB -> c B
AorB -> b
Factorize on the common c:
S -> AorB
A -> c A
A -> a
B -> c B
B -> b
AorB -> c D
AorB -> a
AorB -> b
D -> AorB {add D->A to D->B giving D->AorB}
Now remove the obsolete A and B.
S -> AorB
AorB -> c D
AorB -> a
AorB -> b
D -> AorB
Finally notice S, AorB and D are identical due to
the productions of the form X -> Y, calling them all S.
S -> c S
S -> a
S -> b
Suggesting this grammar can indeed be found automatically
Jan Schramp (j...@si.hhs.nl)
Hi,
I don't want to advocate top-down parsing at all, and especially not LL(1)
parsing, but I think you're exaggerating a bit. Consider the following
(boring) two grammars:
E -> E + T | T
T -> T * F | F
F -> ( E ) | ident
E -> T E'
E'-> + T E' |
T -> F T'
T'-> * F T' |
F -> ( E ) | ident
both grammars can easily be identified as _the_ grammars (LR(1) vs LL(1))
to do the job.
When LL(1) parsing, the deepest stacklevel is reached here:
((x*x)+(y*y))
^______________ current token
The stack contains the following suffix of the current sentential form:
*, F, T', E', ), T', E', ), T', E'
This is just 10 elements deep, not 30. And it didn't take 70 calls or
table lookup actions to get here. I agree with you that the stack contains
more elements than a LR(1)'s stack would contain at this stage. The
latter contains just 3 elements at most here:
((x*x)+(y*y))
^______________ current token
[5 i], [5 (], [0 (]
BTW if you implement a table driven LL(1) parser, don't push the entire
rhs of the current rule on the stack. That is naive. Simply push a
reference of the current production rule and the position therein on the
stack. In this example, this reduces the logical stack depth to 8, but the
actual amount of bytes needed is far less ...
But there's also a positive thing about this stack, it is very easy to
implement inherited attributes on the fly. The attribute stack simply
mimics the behavior of the parse stack. Every token of the sentential form
generated so far, owns a slot on the attribute stack. IMHO, implementing
inherited attributes in a LR(1) parser is a hack. When I come to think of
it: I've implemented a lot of LL(1) parsers and I simply used a stack of
no more than, say, 200 elements. I never noticed any stack overflow. And
who cares about a stack with just 200 elements?
Concluding: I really _hate_ writing an LL(1) grammar, but when the job is
done, sticking in the semantic actions (or `output tokens' if you prefer)
is very easy. A lot easier compared to LR(1) parsers when these inherited
attributes come in. I think it's just a matter of `give a bit, take a
bit'. LL(1) grammars are a mess to work with. LR(1) grammars are a lot
nicer. LL(1) parsers can do some more for you than LR(1) parsers. BTW did
you ever have a look at the table sizes for both types of generated
parsers?
Let me stress this again: I`m not advocating LL(1) grammars and parsers at
all. I simply think that they can be nice tools when appropriate. The same
reasoning applies to LR(1) parsers and grammars too ...
kind regards,
Jos aka j...@and.nl
Explicit pushing and popping of the semantic stack is especially elegant
in Forth, where the language's data stack can be used as the semantic
stack. I have implemented this approach in Gray, a parser generator that
accepts an extended BNF and produces recursive descent parsers (Gray is
freely available, mail me to get it).
With this approach it is easy and natural to handle left-associative
operators. The equivalent of yacc's
Expr ->
Expr K_MINUS Term
{ $$ = MakeNode(MINUS_TAG, $1, $3); }
is
(( Term
(( "-" Term {{ MINUS_TAG MakeNode }} )) **
))
<- Expr ( -- node )
A few words of explanation are probably needed here: You find the defined
nonterminal 'Expr' near the end (Everything in Forth is reversed :-).
Gray's delimiters are double, because the single versions are already used
in Forth. The curly braces delimit actions. The '**' works like the '*'
in lex, meaning zero or more occurences of the preceding item. 'Term'
parses a term and pushes a node. '"-"' just parses a '-'. 'MakeNode' pops
two nodes and a tag and pushes a node. So the overall stack effect of
'Expr' is to push a node; this is documented by the comment '( -- node )'.
Right-associativity is slightly harder; it requires recursion:
(( Base "**" Exponent {{ POWER_TAG MakeNode }} ))
<- Exponent ( -- node )
Some advantages over yacc: It is easy to keep information local to a rule
and to pass information around. When you add syntactic sugar, changes in
the actions (renumbering) are unnecessary.
--
M. Anton Ertl an...@mips.complang.tuwien.ac.at
|I suspect that full left-factorization and left-recursion removal will
|satisfy the requirements of (1), [ ... ]
Ok, I'll give it a try:
Consider a Post Correspondence Problem {v , v , ... v } {w , w , ... w }
1 2 n 1 2 n
over an alphabet A.
Let the terminal alphabet T be A+{x ,x , ... x }
1 2 n
Let the non-terminal alphabet N be { 1, 2, ... n, S, U }
Consider the following production rules of a context sensitive grammar:
_ _
S -> i v S w for i in [1 .. n] and w represents the word w reversed
i i
S -> U
a U a -> U for t in A
a i -> i a for a in A and i in [1 .. n]
i -> x for i in [1 .. n]
i
This grammar generates a language L = { x x ... x } iff the embedded PCP
i1 i2 im
has a solution. On the other hand, if the PCP doesn't have a solution,
this grammar generates the empty language. The sentences generated by
this grammar represent the indexes needed to solve the PCP. This language
is a regular language.
Proof: consider a string of indexes x x ... x such that they form
i1 i2 im
a solution to the PCP and there is no proper prefix of this string that
is a solution to this PCP too. Call such a string a `simple' solution.
There are a finite number of `simple' solutions to a PCP. No simple
solution has another simple solution as a proper prefix (by definition.)
Name these solutions s , s , ... s
1 2 m
There exists a type three grammar with the following production rules:
S -> s S | s for i in [1 .. n]
i i
that generates all the solutions of the PCP. The language generated by
this regular grammar is the same langauge as the one generated by our
first grammar. This is clearly an LL(1) language. Because of the simple
fact that no simple solution contains another simple solution as a proper
prefix, the language and the grammar are both non-ambiguous.
Now, suppose that by sheer madness, I can find a solution to any PCP, (I'm
not a Turing Machine, so don't tell me I can't do it.) Your algorithm,
proposed in the first conjecture, on the other hand, _is_ equivalent to a
Turing Machine. Suppose, I give you the first grammar described above, and
I solemnly swear that the embedded PCP has a solution, then still, your
algorithm won't find the equivalent LL(1) grammar for every input grammar
I feed it ...
Concluding: I think the first conjecture is false too.
kind regards,
Jos aka j...@and.nl
ps. I realize that my reasoning in the last paragraph is close to the edge.
I'm still not completely sure myself ... correct me if I'm wrong.
It works best in recursive descent parsers because it is so easy to pass
additional parameters:
1) Have exactly one function for parsing operator expressions
2) Pass it the precedence at which to stop accumulating operators
Has this technique been known for a long time and/or reinvented
repeatedly?
Here is the (rough) algorithm for this "parse_operator_expression"
function implemented in Ada:
function Parse_Operator_Expression(
Left_Precedence : Operator_Precedence := Operator_Precedence'FIRST)
return Expression_Tree is
-- Parse operands and operators.
-- Quit parsing if next operator has precedence <= Left_Precedence
Op : Token_Type;
Result_So_Far : Expression_Tree;
-- Next_Token is a global, initialized by Advance_To_Next_Token
begin
if Is_Unary_Operator(Next_Token) then
Op := Next_Token;
Advance_To_Next_Token; -- Advance past operator
-- Leftmost operand is unary-op tree
Result_So_Far := Make_Unary_Op_Tree(
Operator => Op,
Right_Operand => Parse_Operator_Expression(
Left_Precedence => Precedence(Op)));
-- Recurse with precedence of unary operator to gets
-- its operand
else
-- Leftmost operand is atom
Result_So_Far := Parse_Atom; -- Note: No "plunge" needed here
end if;
-- We have leftmost operand, now accumulate more operators
-- and operands
while Is_Binary_Operator(Next_Token) and then
Precedence(Next_Token) > Left_Precedence loop
Op := Next_Token;
Advance_To_Next_Token; -- Advance past operator
-- Assuming Left Associativity for ops at same precedence
Result_So_Far := Make_Binary_Op_Tree(
Left_Operand => Result_So_Far,
Operator => Op,
Right_Operand => Parse_Operator_Expression(
Left_Precedence => Precedence(Op)));
-- Recurse with precedence of binary operator
-- to get its right operand
end loop;
-- Return now, since we can't accumulate any more operators
return Result_So_Far;
end Parse_Operator_Expression;
To parse a whole expression, call "Parse_Operator_Expression" without
parameters, so it defaults to the lowest operator precedence (actually, 1
lower than the precedence of the lowest precedence operator).
-S. Tucker Taft s...@inmet.com
Intermetrics, Inc.
Cambridge, MA 02138
There are numerous ways of doing this. I once wrote a one routine
expression parser. I don't remember exact details (it was a while back)
but from memory it worked something like this .. I had a table of operator
precedences. The routine built a parse tree by re-ordering the parse tree
if the just seen operator was of higher precedence than (I think) the root
of the existing parse tree.
If I remember rightly the re-ordering was very efficient as it always
involved doing the following steps :-
b) new_node =
latest_operator
/ \
/ \
/ \
old_root.right_hand_side rhs of latest_operator
b) new_root =
old_root.operator
/ \
old_root.left_hand_side new_node
So the whole rotuine worked like
(initial parser tree = first operand)
a) read an operator
b) read an operand
c) re-order tree
d) goto a)
So for example say we have the expression
a + b * c
The routine builds the tree
+
/ \
a b
Then the rotuine sees '* c' and re-orders the tree as follows,
new_node = *
/ \
b c
new_root = +
/ \
a new_node
which is the required tree.
As I remember it there were a few complications to the scheme (eg:
unary operators and ternary operators) but not many. New operators
could be added by simply adding to the operator precedence table.
graham
--
Graham Matthews
Pure Math, Uni.Sydney, Oz
gra...@maths.su.oz.au
The Whitesmiths C compilers use something very similar. The original
routine was originally coded in Asm (I think) back in the 70's. Sort of as
"how compact I can write this routine" kind of contest. The C version of
the tree rewriting routine is only about 6 lines or so, a conditional
recursive call embedded in a for loop. Many times I have wished for
comments though ;-).
--
- Richard F. Man (m...@labrea.zko.dec.com)