Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Recursive Descent vs. LALR

1,475 views
Skip to first unread message

John Resler

unread,
Jun 20, 2003, 12:17:22 AM6/20/03
to
I got halfway through a compiler theory course a few years back and
finances required dropping out of school. Since then I've been messing
around with Parsing tools and the like and have been using JavaCC. It is
a recursive descent parser and I understand a bit of the way a Recursive
Descent Parser works versus bottom up parsing. I seem to recall a
theorem that said any LALR(K) grammar could be rewritten to an LALR(1)
grammar and another theorem that said Recursive Descent versus LALR(1)
were equally capable. Can anyone shed some light on this? I am also
curious if anyone knows anything about the JJTree tool? I subscribe to
the JavaCC mailing list but even there, few individuals are aware of
JJTree's capabilities and thus its uses. I'd appreciate any information
anybody could give me. Thanks in advance.

-John

be...@sonic.net

unread,
Jun 25, 2003, 1:21:56 AM6/25/03
to

Any LALR(K) grammar can be rewritten as an LALR(1) grammar. The
number of nonterminals in the new grammar will be up to K! times the
number of nonterminals in the old grammar or less (usally considerably
less). Therefore, all of these languages are considered to be "level
3" languages (linear languages) in the Chomsky hierarchy. The process
of transforming an LALR(K) grammar into an LALR(1) grammar is called
"factoring". If you own the Dragon Book, it will show you several
examples of factoring.

Recursive descent to read a level 3 language can be implemented
without backtracking. However, there's a special case somewhere
between Chomsky's level 2 and level 3 (well, actually I guess it's a
special class of level 2 languages) that recursive descent without
backtracking can read, but which cannot be expressed as an LALR(K)
language for any constant K.

If you want to read level 2 languages (context-free languages)
generally, you will need to go a formalism that has at least the power
of recursive descent with backtracking. Note that you can write a
recursive descent parser with backtracking for level 3 languages, and
it's easier to write than a "good" recursive-descent parser; but in
efficiency terms, that's always a mistake. People do it because they
don't want to bother calculating the "first" and "follow" sets from
their grammar, or don't know how. You don't actually *need*
backtracking unless you're reading a language for which first and
follow sets aren't defined. And that would be a level 2 language.
First and follow sets aren't defined for level 1 or level 0 either,
but recursive descent won't parse those anyway regardless of
backtracking, so recursive descent with backtracking is really only
warranted for level-2 languages.

Level-1 languages are context-sensitive; level-1 parsers are generally
considered to be beyond the scope of compiler construction, but have
been implemented for natural-language efforts. The usual formalism is
a "Recursive Transition Network" (Google is your friend).

There are also level-0 languages, but there aren't really any specific
rules about what kind of parser is needed for them other than that it
can be implemented as a Turing machine. Which isn't saying much. A
broad class of level-0 languages (again sometimes used by us
Natural-Language geeks) can be parsed by an "Augmented Transition
Network" parser. (Google is your friend again if you're interested.)

HTH,

Bear

Chris F Clark

unread,
Jul 2, 2003, 12:45:05 AM7/2/03
to
Note, this question has been asked before, and I've answered it before
(as well as have others). In some ways, I'm a little surpised it
isn't in the FAQ, but I guess it isn't asked that frequently, only
about once per year or two. Here is more than you wanted to know.

There is a hiearchy of languages, and a separate hierarchy of
grammars. The difference (which is important) is that when we say a
(set of) language(s) is in a certain class, we mean there exists a
grammar that parses that lanugage which is in the appropriate grammar
class. However, we may not know what the grammar is, and more
importantly there may be no mechanical way of finding that grammar.

Thus, just because a certain language is LR, does not mean that a
grammar one might have for that language is LR. In fact, "you" might
never be able to find an LR grammar for the language.

In addition, to the classes listed in this hierarchy, there are many
other grammar/language classes (in fact, infinitely many).

1) Regular (type 3) languages

These are things one can express with just a regular expression.

Also, text one can parse with a FSA (finite state automata) FSM
(... machine), two words for the same thing, with no auxilary
storage. It does not matter what the FSA is determinitic or
non-deterministic, both can parse exact the same languages.

Equivalently, text one can parse with a "linear" grammar. A linear
grammar has only 1 type of recursion, either left or right, but not
both in the same grammar and no recursion which is "central"
recursion--what one does to express parenthesis.

Note, given any one of the above representations of a type 3
language, it is possible to apply a mechanical algorithm that
produces any of the other forms.

Most "lexer" generators use regular expressions and handle type 3
languages. The "regeps" of editors are type 3, unless they support
back-references (which most do), which require a more powerful
parser.

2) Dike grammars

These are grammars that handle parenthesis (and only parenthesis).
They have only central recursion.

There is little overlap between Dike grammars and regular grammars.
What one can express in one, one cannot express in the other. The
exception being the empty language (i.e. parsing only an empty
file, a file with no tokens).

2) LL(k) (top-down or predictive) grammars

The k refers to how many tokens of lookahead one needs at the be
examine at the start of a rule to determine which rule (or
alternative of a rule) to apply.

If at the beginning of any non-terminal in the grammar, one can
always look at just one token and decide whether the rule should be
apllied or not (to a syntactically legal input), the grammar is
LL(1). This is the most common case.

All regular languages are LL(1), in particular all right linear
grammars (grammars that use only right recursion) are LL(1).

LL(1) grammars also allow some forms of central recursion,
parenthesization. All Dike grammars are also LL(1).

LL grammars do not allow left recursion.

Also, some forms of central recursion mixed with right recursion
are not LL(1). In particular, the common if-then/if-then-else rule
of most commonly used programming languages is not LL(k) for any k.
This is not as big of problem as it would seem as there is a simple
easy way (hack) to extend LL parsers to handle the if-then/if-then-
else conflict by having it simply choose the else version if and
only if the else is present. This binds the else to the closest
if, which is the normally desired semantics. Most LL parser
generators will automatically do this for you with at most a
warning.

Most recursive descent parsers are "almost" LL(k), where k is the
number of tokens they need to look at to figure out what to do in
the worst case. The reason they are almost LL(k) is that they
generally implicitly do the if-then-else hack without being aware
that they are.

Strictly speaking, for each k, there exists a grammar which is
LL(k+1) but not LL(k).

There are mechanical manipulations one can apply, removing left
recursion, and factoring to convert a non-LL grammar into an LL
one. However, I believe there are LL languages that have CFGs
(described later) which cannot be converted mechanically into an
equivalent LL grammar. (However, I could be wrong on this, as I
don't keep that rule in my head securely.) Equally importantly,
the meachanical transformations to make a grammar LL can sometimes
destroy the "character" of the grammar, making it impossible to
refer to the way the tokens were grouped into non-terminals in the
original grammar, for sematic purposes.

Also, LL grammars need expression precendence to be expressed
explicitly in the grammar, rather than in a separate formalism
(i.e. not in a precedence table nor in precedence declarations).

However, all of those "negatives" aside, almost every programming
language is either LL(k) +if-the-else hack for some very small k (~
5 or less with most being LL(1)) or is not parsable with any type 2
grammar (defined later). The only difficultly is that if one has
an LR grammar for the language, it may not be easy to get an
equivalent LL grammar.

Moreover, certain extensions to LL grammars: backtracking,
predicates, and adaptive grammars allow extended LL grammars to
parse type 1 and even type 0 languages (again defined later).

3) LR(k) (bottom up or shift-reduce) grammars

In LR(k) grammars, the k refers to how many tokens one needs at the
end of the rule to decide which rule to apply.

If at the end of a rule, one can always determine with one token of
lookahead which rule to apply the grammar is LR(1). If at the end
of the rule, one doesn't need any tokens of lookahead to decide
which rule to apply (i.e. the ambiguities have all been sorted out
before coming to the end of the rule) the grammar is LR(0).

LR(0) is important because two important sub-classes of LR grammars
arise from it, SLR(k) and LALR(k). The key feature of LR(0)
grammars is that they throw away all left (preceding) context in
the lookahead calculations. The throwing away of the left context
occurs, because the LR(0) mahcine "merges" states without regard to
their lookahead information. thus, if in one context the lookahead
for one rule is three tokens: a, b, or c and in another the
lookahead for the same rule is d or e, the LR(0) machaine marges
those two states, confusing the lookaheads, so that at the end of
the rule any of a, b, c, d, or e should result in a reduction (note
if the language is LR(0) meaning we don't need to look at all, it
doesn't matter, since only the correct token will appear anyway in
a syntactically legal program). Most "LR" parser generation
algorithms first build the LR(0) machine and then resolve the
conflicts the occur in the reduce states.

A grammar is SLR(k) if at the end of all uses of a rule, the same k
tokens always determine whether to apply the rule or not. A
grammar is LALR(k) if at the end of each use of a rule, there are k
tokens which decide whther to apply the rule or not and those k
tokens do not conflict with the lookahead tokens for some other
rule that is applied in the same LR(0) reduce state. The
difference between the two grammar classes is subtle, and all
SLR(k) grammars are LALR(k) (and all LALR(k) grammars are LR(k)).
Essentially, LALR(k) gives some context sensitivity to the
lookahead tokens and different sets of conflicting rules can have
different lookahead tokens to distinguish them.

That brings us back to "full" or canonical LR(k), where k is >= 1.
LR(k) parsing preserves the needed left context to disambiguate
certain situations where LALR(k) finds the rules to conflict. The
canonical way of doing this is to not merge states with different
lookaheads in the first place. That causes an explosion in the
table size as many states are kept distinct simply because they
have different lookaheads, when in reality the lookahead for those
states will never be consulted. A more modern method for solving
the problem involves splitting the states only when a conflict is
detected. Then, if the grammar is LALR, no splitting will occur
and one has the same size machine as the LR(0), but as conflicts
arise the require left context to resolve, the tables slowly grow.

Back now, to the inclusion rules.

Every LL(k) grammar is an LR(k) grammar, but not necessarily an
SLR(k) or LALR(k) grammar. The troublesome cases occur with empty
rules (null productions). An LL(k) grammar uses left context to
decide which lookahead to apply to each use of an empty rule.
However, the LR(0) machine throws that context away and thus the
result SLR or LALR machine may have conflicts due to ignoring the
left context. If there are no rules shorter than k-1, an LL(k)
grammar is always an SLR(k) or LALR(k) grammar.

And to the languages, every LR(k) language is actually also an
LR(1) language, meaning that there exists an LR(1) grammar for the
language. However, there is no mechanical method to find that
LR(1) grammar.

The LR(k) languages also correspond to what you can parse with an
deterministic FSA and an auxillary stack (also know as a fifo) with
k tokens of lookahead, sometimes called a (D)PDA ((determinsitic)
push down automata). These are also called the deterministic
context free languages. Since every LR(k) grammar is also LR(1),
one can show that one needs only one token of lookahead in the PDA.

By the way, there are many other grammar classes in the area such
as the various precedence grammars and the bounded context
grammars. Many of these have their only parsing methodologies.
Moreover, the grammars form a farily complex hiearchy with some
nesting inside others and other sets not nesting. However, all of
these languages are a subset of the LR(1) languages, meaning that
there is always an LR(1) grammar for anything that any one of these
grammars can parse, but not necessarily (and often not) a method
of finding that LR(1) grammar.

4) type 2 grammars (cfgs, NPDAs)

There aare also non-deterministic PDA's. These can parse some
languages that determinstic ones can't, notably palindromes.

Backtracking is equivalent to non-determinism.

Any grammar with only one [non-]terminal on the left-hand-side
(LHS) of a rule (e.g. "a: b c d;" but not "a b: c d e;") is called
a context free grammar (cfg) and the resulting language a context
free language (cfl). Context free, because the additional
[non-]terminals on the LHS of a rule are considered the context
where the (non-context) non-terminal applies. These are the same
languages that one can parse with an NPDA.

The places where a DPDA gets confused and an NPDA doesn't are
called ambiguities. (The places where a parser generator gets
confused are called conflicts. All ambiguities cause conflicts.
Not all conflicts are true ambiguities.)

Of course, any language which can be parsed deterministically can
also be parsed non-determinstically, by simply having the
non-determinstic parse choose the determinstic "choices".

5) type 1 grammars (LBAs context sensitive grammars)

Any grammar where there can be more than one token on the LHS of a
rule, but the LHS of a rule cannot be shorter than the RHS of a
rule are called "context sensitive grammars" (csgs) (such as "a b:
b a;" but not "a b: c;"). All cfgs are trivially csgs.

These are the grammars that can be parse by a Turing Machine (TM)
with only a fixed amount of tape that is a linear multiple of the
input size, also known as a linear bounded automata.

It is worth noting that the intersection of two cfgs need not be a
cfg, but the intersection of two csgs is a csg. Predicates
(mentioned above) allow one to write grammatically grammars that
are the intersection of context free grammars, which allows
predicated parsers (either LL or LR) to parse context-sensitive
languages.

6) type 0 languages (Turning Machines)

At the other end of the spectrum are Turing Machines (TMs) which
ware equivalent to grammars with arbitrary numbers of symbols on
both the left and right hand sides). They are also equivalent to
machines with a queue rather than a stack or machines with two
stacks, or numerous other models of computation.

Everything previously discussed can be parsed by a TM (or one of
the equivalent forms) and non-determinism does not add anything to
a TM. Moreover, most of the TM equivalents have a mechanical
method for converting a TM into (or out of) their representation,
which is how TM equivalence is generally proven.

BTW, adaptive grammars are equivalent in power to TMs.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
19 Bronte Way #33M voice : (508) 435-5016
Marlboro, MA 01752 USA fax : (508) 251-2347 (24 hours)

Kamal R. Prasad

unread,
Jul 3, 2003, 11:31:48 PM7/3/03
to
John Resler <JohnM...@netscape.net> wrote in message news:<03-0...@comp.compilers>...

> I got halfway through a compiler theory course a few years back and
> finances required dropping out of school. Since then I've been messing
> around with Parsing tools and the like and have been using JavaCC. It is
> a recursive descent parser and I understand a bit of the way a Recursive
> Descent Parser works versus bottom up parsing. I seem to recall a
> theorem that said any LALR(K) grammar could be rewritten to an LALR(1)
> grammar and another theorem that said Recursive Descent versus LALR(1)
> were equally capable.

Recursive descent parsing cannot resolve some conflicts that LALR(1)
can. In general, a rightmost derivation has been found to be more
useful in resolving ambiguities than a leftmost one. (Someone can
correct me if Im wrong).

regards
-kamal

Boeing

unread,
Jul 4, 2003, 12:10:54 AM7/4/03
to
Thanks everybody for the responses,
For some reason the news server at home was not working and I just
got the one at work up so I have not been able to obtain new
postings. I appreciate the feedback. Nice to know I didn't forget
everything but I'm certain I don't completely follow this. Thanks
though.

-John


"Chris F Clark" <c...@shell01.TheWorld.com> wrote in message


> Note, this question has been asked before, and I've answered it before
> (as well as have others). In some ways, I'm a little surpised it
> isn't in the FAQ, but I guess it isn't asked that frequently, only

> about once per year or two. Here is more than you wanted to know. ...

be...@sonic.net

unread,
Jul 4, 2003, 12:11:57 AM7/4/03
to
Chris F Clark wrote:
>
> Note, this question has been asked before, and I've answered it before
> (as well as have others). In some ways, I'm a little surpised it
> isn't in the FAQ, but I guess it isn't asked that frequently, only
> about once per year or two. Here is more than you wanted to know.

I thank you for responding, especially in such detail, but I must
admit that several points of your response elicit confusion or refer
to formalisms or parser models that aren't immediately clear.

> 2) Dike grammars
>
> These are grammars that handle parenthesis (and only parenthesis).
> They have only central recursion.

I had not heard of this classification before. They have some
interesting
properties, such as being able to parse left-to-right with equal ease as
right-to-left. But they don't seem very expressive.


> 2) LL(k) (top-down or predictive) grammars
>
> The k refers to how many tokens of lookahead one needs at the be
> examine at the start of a rule to determine which rule (or
> alternative of a rule) to apply.
>
> If at the beginning of any non-terminal in the grammar, one can
> always look at just one token and decide whether the rule should be
> apllied or not (to a syntactically legal input), the grammar is
> LL(1). This is the most common case.

Nonterminals are just symbols. They don't have beginnings, in a
way that's useful to think about. What I think you mean is that,
given the *first* nonterminal in the expectation sequence of the
parser, and the *first* unshifted token of input, you should be
able to uniquely determine which reduction rule to use.

In terms of the grammar, this means that the 'before' sequence
of every reduction rule consists of a single nonterminal and the
'after' sequence of every rule starts with a single terminal token,
and that no two different grammar rules can have both identical
'before' sequences and 'after' sequences that start with the same
terminal token.

In parsing, when the reduction rule is applied to the expectation
sequence, it replaces the nonterminal token at the beginning of the
expectation sequence with the 'after' sequence of the rule, so now
the expectation sequence starts with a terminal token that matches
the first unshifted token of input. So a 'Left shift' happens,
simultaneously removing the nonterminal from the expectation sequence
and the unshifted input.

(note that there are also right-linear grammars, where a 'Right Shift'
is used exposing new nonterminals to match at the end of the
expectation sequence rather than the beginning, but I have not seen
them used in programming languages.)


> Also, some forms of central recursion mixed with right recursion
> are not LL(1). In particular, the common if-then/if-then-else rule
> of most commonly used programming languages is not LL(k) for any k.

> This is not as big of problem as it would seem as there is a simple
> easy way (hack) to extend LL parsers to handle the if-then/if-then-
> else conflict by having it simply choose the else version if and
> only if the else is present. This binds the else to the closest
> if, which is the normally desired semantics. Most LL parser
> generators will automatically do this for you with at most a
> warning.

What happens is something like this:

('Statement) --> ("if" "(" 'Boolean-Expr ")" 'Statement 'Optional-Else )
('Optional-Else) --> ( "else" 'Statement )
('Optional-Else) --> ()

The third reduction rule shown here, with an empty 'after' sequence,
violates the requirement to have a nonterminal at the beginning of
each 'after' sequence. In theory, we can look at "else" waiting for
us in input and still use either rule to reduce the 'Optional-Else
nonterminal, because we can always match an empty sequence at the
beginning of the unshifted input and use the third rule, even if
there's an "else" sitting there waiting for us. Moreover, if there
is another 'Optional-else following the current one, it can eat the
"else" token, attaching it to a different "if", and the parse will
succeed (although the semantics will be ambiguous).

In practice, it's no trouble to parse because you can just pick the
second rule if you see an "else" in the input and the third rule
if you see any other thing in the input.

> 3) LR(k) (bottom up or shift-reduce) grammars
>
> In LR(k) grammars, the k refers to how many tokens one needs at the
> end of the rule to decide which rule to apply.

Okay, here you lost me. "end of a rule?" A grammar rule is a
'before' sequence and an 'after' sequence. For all type-2
languages (including LR languages) the 'before' sequence must
consist of a single nonterminal constrained to match at the
start of the expectation sequence, and the 'after' sequence can
be an arbitrary sequence of terminals and nonterminals. A
parsing rule for an LR(k) grammar, in turn, is an inputmatch
sequence of up to K terminal tokens to match in input, a
nonterminal to match at the beginning of the expectation
sequence, and a pointer to the grammar rule to apply as a
reduction.

Is there also a shifted-input index to say where the input
matching needs to start relative to the first unshifted input?
You talk about left context or shifted input below, but
I thought the input matching index was assumed to be zero
(never looking at shifted input) for LR(k) parser rules?

I think I recognize your "tokens at the end of a rule" as
being the input sequence of a parsing rule, but don't
understand what the word "end" means in this context.

I recognize the hierarchy of languages parsable by three different
LR parsing algorithms: simple LR (SLR), lookahead LR (LALR) and
canonical LR (LR languages). I recognize LR languages as a subset
of context-free or type-2 languages. But I don't know exactly
what you're saying about them here.

> LR(0) is important because two important sub-classes of LR grammars
> arise from it, SLR(k) and LALR(k). The key feature of LR(0)
> grammars is that they throw away all left (preceding) context in
> the lookahead calculations. The throwing away of the left context
> occurs, because the LR(0) mahcine "merges" states without regard to
> their lookahead information. thus, if in one context the lookahead
> for one rule is three tokens: a, b, or c and in another the
> lookahead for the same rule is d or e, the LR(0) machaine marges
> those two states, confusing the lookaheads, so that at the end of
> the rule any of a, b, c, d, or e should result in a reduction (note
> if the language is LR(0) meaning we don't need to look at all, it
> doesn't matter, since only the correct token will appear anyway in
> a syntactically legal program). Most "LR" parser generation
> algorithms first build the LR(0) machine and then resolve the
> conflicts the occur in the reduce states.

I don't know what "states" you're talking about merging. And
although I *think* you're talking about parsing rules, I can't tell
for sure whether you mean grammar rules or parsing rules. Are you
talking about the "states" of the DFSA that controls the SLR parser?
In the Dragon Book (section 4.7, page 226-227 in my edition) it says
that an SLR parser can use a single state machine as its control
program. This implies to me that the inputmatch indices of all its
parser rules must be equal.

The DFSA is made from the parser rules' inputmatch sequences, and must
recognize all legal subsequences of k tokens of input, identifying
unique accepting states for each. During an SLR parse, the lookahead
tokens are fed into this DFSA. The accepting state returned from it
is then used as an index (along with the first nonterminal of the
current sequence) to look up which grammar rule to apply as a
reduction.

Given that kind of structure, there are "states" that are important,
but I'm still not sure what it means to "merge states".

Your "left context" appears to be equivalent to what I've learned
as "shifted input"; normally we are trying to reach a production
that will cause the beginning of the expectation sequence to match
input starting at the first unshifted input token. The first token
of the expectation sequence, of course, is always a nonterminal. If it
were a terminal token matching the first unshifted input, it would
immediately cause a "shift" (which will result in an "accept" if it
shifts the last token of input and simultaneously empties the
expectation sequence). If it were a terminal token not matching the
first unshifted input, it would immediately result in an "error". But
if the inputmatch index is -1, a parsing rule can start input matching
against the most recently shifted input token instead of the first
unshifted input token, giving your "left context." and if K is 2 or
greater, the inputmatch index can be -2, to start matching two tokens
before the first unmatched token. And so on.

> A grammar is SLR(k) if at the end of all uses of a rule, the same k
> tokens always determine whether to apply the rule or not. A
> grammar is LALR(k) if at the end of each use of a rule, there are k
> tokens which decide whther to apply the rule or not and those k
> tokens do not conflict with the lookahead tokens for some other
> rule that is applied in the same LR(0) reduce state.

I do not know what you mean by "the end of all uses of a rule."
I believe that you must be talking about parsing rules rather
than grammar rules since you mention the input matching sequence
("lookahead tokens") Does "The same k tokens" mean the sequence
of k tokens starting at a particular point in the input? Or does
it mean identical sequences of k tokens in the input matching
sequences of two different parsing rules? I think it must be the
first - to match what I think an SLR(k) language means, all parsing
rules must have the same inputmatch index (and, although I'm not
sure whether it's by convention or definition, its value is zero.)

I'm still not sure I grasp the difference between SLR(k) and LALR(K)
however. I know that for full LR parsing, you need a different DFSA
for each value of inputmatch index to make sure you find the correct
indexes to look up the applicable grammar rule for the next reduction,
and for a SLR(k) language you only need one. But I'm not sure how to
characterize the inputmatching needs for an LALR(K) language.


> The
> difference between the two grammar classes is subtle, and all
> SLR(k) grammars are LALR(k) (and all LALR(k) grammars are LR(k)).
> Essentially, LALR(k) gives some context sensitivity to the
> lookahead tokens and different sets of conflicting rules can have
> different lookahead tokens to distinguish them.

Obviously, if you have two different rules in the parser that point to
different grammar rules with the exact same 'before' sequence, and also
have the exact same inputmatch sequence and inputmatch index, they
conflict. Just as obviously, some difference in those values is
preserved by all LALR parsers (to the extent that ambiguous grammars are
avoided) so conflict is avoided. SLR(k) parsers have parsing rules
differentiated by the 'before' sequence (one nonterminal token, since
it's a type 2 language) and the inputmatch sequence (up to k
nonterminals).
LR(k) parsers have parsing rules which may also be differentiated by
inputmatch index (ie, they may use "left context" to determine which
grammar rule to apply). I'm not sure where the middle ground is for
LALR to occupy.


> That brings us back to "full" or canonical LR(k), where k is >= 1.
> LR(k) parsing preserves the needed left context to disambiguate
> certain situations where LALR(k) finds the rules to conflict. The
> canonical way of doing this is to not merge states with different
> lookaheads in the first place. That causes an explosion in the
> table size as many states are kept distinct simply because they
> have different lookaheads, when in reality the lookahead for those
> states will never be consulted. A more modern method for solving
> the problem involves splitting the states only when a conflict is
> detected. Then, if the grammar is LALR, no splitting will occur
> and one has the same size machine as the LR(0), but as conflicts
> arise the require left context to resolve, the tables slowly grow.

"Not merge states with different lookaheads" -- after scrambling a
little bit with that one, I'm assuming you mean that in ordinary
cases you use the same DFSA output state to report the same sequence
of upcoming input tokens. I was having trouble understanding why
you'd ever _NOT_ do this, and then I realized that you must be
treating shift as optional, so that parser rules don't necessarily
need an inputmatch index at all. That would mean that the "correct"
reduction rule might not match if the input were shifted before it
expected, or that the "correct" reduction rule might not match until
something (what? Another parser rule?) explicitly caused a shift.

In my universe, rules that need left context use an inputmatch index
other than zero to explicitly match against already-shifted input.
With different DFA's for different inputmatch indices, there is
absolutely never a possibility of needing to avoid merging states
within a DFA. Usually you can tell from the first token of the
expectation sequence which DFA you need to use, and you can always
tell the optimal order to try them in - but once in a great while
you do need to check them all, which for an LR(K) language, in the
worst case, can cause it to take up to K times as long to find the
correct reduction rule.

Are you assuming shift *isn't* automatic whenever there's a matching
terminal token at the start of the expectation sequence, and thus
you may have an expectation sequence that starts with _unshifted_
_terminal_ _tokens_ (this violates a major invariant in my parser)
for the parsing rules to match their input sequences against?

Doesn't that cause an explosion in the number of parser rules? And
doesn't it mean you need special rules just to decide when it's okay
to shift? Hmmm. You are skipping the one-DFSA-per-different
inputmatch index so you probably find rules slightly faster in the
average case, but you're also actually taking the time of a rule
lookup just to decide when to shift. Isn't that an efficiency hit?
Or have I misunderstood completely?

What is clear though, is that now we're talking about different
implementations of parsers rather than about different language
families. We ought to be able to talk about different language
families in ways that don't require people to figure out from
context exactly what our particular parser implementation does or
doesn't do.


Bear

Quinn Tyler Jackson

unread,
Jul 4, 2003, 12:13:46 AM7/4/03
to
Chris F. Clark said:

> It is worth noting that the intersection of two cfgs need not be a
> cfg, but the intersection of two csgs is a csg. Predicates
> (mentioned above) allow one to write grammatically grammars that
> are the intersection of context free grammars, which allows
> predicated parsers (either LL or LR) to parse context-sensitive
> languages.

That turns out to be a nifty observation in theory, so I thought I
would demonstrate it in practice.

grammar AnBnCn
{
S ::=

($a(X S.x<*>) $b(Y S.y<*>))<a b=anbn> $c(Z<S.x!=S.z> S.z<*>)<b
c=bncn>;

anbn ::= S.x [anbn] S.y;
bncn ::= S.y [bncn] S.z;

X ::= $S.x<-("(" '[a-z][a-z]+' ")");

Y ::= $S.y<-("(" '[a-z][a-z]+' ")");

Z ::= $S.z<-("(" '[a-z][a-z]+' ")");
}; // grammar AnBnCn

The above $-grammar accepts strings in the form:

(a)^n(b)^n(c)^n where |a| > 1, |b| > 1, |c| > 1, and a != c

Thus:

(cat)(cat)(dog)(dog)(mouse)(mouse)

but not

(cat)(cat)(dog)(mouse)(mouse)

and not

(cat)(cat)(mouse)(mouse)(cat)(cat)

How it actually goes about its magic is found in the back referencing
and predicates.

Here's an English-ed version of the first production, S:

$a(X S.x<*>))

"Match an X followed by zero to many of whatever X matched, and place
the result into a."

$b(Y S.y<*>))<a b=anbn>

"Match a Y followed by zero to many of whatever Y matched, and place
the result into b. Concatenate a and b, and parse the resulting string
with anbn. Fail if a+b is not accepted by anbn."

$c(Z<S.x!=S.z> S.z<*>)<b c=bncn>

"Match a Z. Fail if whatever matched X is equal to whatever matched Z,
otherwise, match zero to many of whatever matched Z, and place the
final result into c. Concatenate b and c, and parse the resulting
string with bncn. Fail if b+c is not accepted by bncn."

The cumulative result of the predicate intersections and the back
referencing is a grammar that accepts the language described.

> BTW, adaptive grammars are equivalent in power to TMs.

Right. Strictly speaking, the notations in the above grammar are
sufficient for Turing Power (give or take a notation), but I leave
that as an exercise for the reader.

--
Quinn Tyler Jackson

Hongbo Rong

unread,
Jul 15, 2003, 2:49:54 PM7/15/03
to
Hello, I have a question regarding gcc: If I reuse gcc peephole opt in
another compiler as an independent phase, what is the main work to do?

My current understanding is that since the peephole rules are defined in
this way:
(define_peephole
[insn-pattern-1
insn-pattern-2
...]
"condition"
"template"
"optional insn-attributes"),

we need first to translate our own code into RTL, then port the related
compilation component to use the above rule, after that, translate the
resulting RTL to our code again.

Any comment please?

Thanks.

Rong

Barak Zalstein

unread,
Jul 17, 2003, 12:37:02 AM7/17/03
to
> Hello, I have a question regarding gcc: If I reuse gcc
> peephole opt in
> another compiler as an independent phase, what is the main work to do?

ebo_special is starting to depress me as well, I just can't find the time to
rewrite it :)
(GNU comment: this is *still* freedom-compliant, right?)

> My current understanding is that since the peephole rules are
> defined in
> this way:
> (define_peephole
> [insn-pattern-1
> insn-pattern-2
> ...]
> "condition"
> "template"
> "optional insn-attributes"),
>
> we need first to translate our own code into RTL, then port
> the related
> compilation component to use the above rule, after that, translate the
> resulting RTL to our code again.
>
> Any comment please?

Since define_peephole generates assembly strings and runs as a final
compilation action, you might want to take a look in define_peephole2
(RTL to RTL transformation). I just skimmed gcc-3.3 and it seems to
be performed by peephole2_optimize in recog.c, but I'm afraid that it
will never work with OP * due to the recursive nature of rtx data
type. If you have a machine description file, you would just need to
define your peephole patterns there (otherwise the compiler will
probably die due to unrecognized insn error). Translating from one IR
to another is never an easy task (especially when it is discouraged by
the FSF) because even if you can map your compiler instruction to GCC
pattern and vice versa, you still have to deal with uninitialized
variables that could have been properly set only by a complete GCC
running from main() . In addition, an rtx pattern usually has
multiple alternatives, so accurate mapping of OP * to rtx and back can
be done only after register allocation.

Barak.

0 new messages