Operator precedence parsing with modgrammar

49 views

Skip to first unread message

mbuhot

unread,

Dec 26, 2011, 3:56:09 AM12/26/11

to modgrammar

Hello and thankyou for modgrammar.

Is there an easy/recommended way to parse expression grammars with
infix operators?
Ideally I'd like to just declare the operators with a precedence
number.

My current thinking is to parse expressions as a list of terms
separated by operators, then post-process the result with an operator
precedence function to recover the proper parse tree...

class Operator(Grammar):
grammar = WORD('-+!@#$%^&*')

class Term(Grammar):
grammar = Number | Identifier | ParensSubExpr

class Expression(Grammar):
grammar = LIST_OF(Term, sep=Operator, min=1)

Alex Stewart

unread,

Dec 27, 2011, 7:03:34 PM12/27/11

to modgrammar

Hey there.. sorry about the delay in responding (been a bit hectic
with the holidays and all)..

Heh.. you happen to have stumbled upon one of the hairier issues in
syntax parsing, actually. There are a couple of different ways to do
this sort of thing. Your approach would certainly work, and there's
nothing wrong with doing things that way (and in many cases it's often
the easiest overall). If you'd prefer to have modgrammar do the work
for you instead, though, that is possible too, but it requires a bit
of careful design.

Generally, the way to do this would be to take advantage of the fact
that the OR operator in modgrammar always tries to match in a left-to-
right order, which means you can tell it to try to match certain (more
desirable) expressions before trying (less desirable) others, so, for
example, a simple grammar implementing addition, multiplication, and
division (with standard precedences) might look something like the
following:

from modgrammar import *

class Number (Grammar):
grammar = (OPTIONAL('-'), WORD('0-9'), OPTIONAL('.', WORD('0-9')))

class ParenExpr (Grammar):
grammar = (L('('), REF('Expr'), L(')'))

class P0Term (Grammar):
grammar = (ParenExpr | Number)

class P0Expr (Grammar):
grammar = (P0Term, L('/'), P0Term)

class P1Term (Grammar):
grammar = (P0Expr | ParenExpr | Number)

class P1Expr (Grammar):
grammar = (P1Term, L('*'), P1Term)

class P2Term (Grammar):
grammar = (P0Expr | P1Expr | ParenExpr | Number)

class P2Expr (Grammar):
grammar = (P2Term, L('+') | L('-'), P2Term)

class Expr (Grammar):
grammar = (P2Expr | P1Expr | P0Expr | ParenExpr)

As you can see, for each precedence "tier", we define an expression
grammar where its terms can either be a number or a higher-precedence
expression, and we always try to match the highest-precedence
expression first.

(Note that the exception is the final "Expr" grammar, where we reverse
the order and check in lowest-to-highest precedence, because we want
to match the largest possible expression, thus for, say, "1*2+3", it
will try to match a P2Expr (and find "1*2" "+" "3") before looking for
a P1Expr (which would just match "1" "*" "2" and leave a remainder of
"+3").

There is one problem with all of this, however: This does correctly
order operators of different precedence, but it doesn't deal well with
strings of operators that have the same precedence (for example,
"1+2+3" gets parsed as "1+2" with a remainder of "+3"). The obvious
way to fix this would be to simply make the grammars recursive (that
is, a P2Term could itself be another P2Expr), but the problem with
this is that it would lead to what's known as left-recursion (where
P2Expr could start with a P2Expr, which could start with a P2Expr,
which could...), which pretty obviously results in infinite parsing
loops pretty quickly. We could limit it to only allowing recursion on
the right-hand side of the expression, which would fix the left-
recursion problem, but has an unfortunate side-effect of making the
resulting math expressions seem to evaluate in a right-to-left order,
which is not what is usually desired (that is, for example, "1-2+3"
would be evaluated as "1-(2+3)" (=-4) instead of the expected
"(1-2)+3" (=2).

Luckily, there is another way: We can just define each type of
expression as being a term followed by "one or more" operator+term
pairs, so then if you say (for example) "1-2+3", it would give you
back an expression object which actually has three terms ("1", "2",
and "3", separated by two operators ("-" and "+"), but the operators
are guaranteed to be of the same precedence, so when calculating the
value of the expression, you can just iterate through them in a left-
to-right order and it'll give you the right answer). To illustrate
this, the following code does this, and also provides .value() methods
for each of the types of grammar that will do the appropriate
calculations, so if 'result' is the result returned from parsing, you
can just say 'result.value()' and it will return the calculated value
of the whole expression (give it a try!):

#!/usr/bin/python3

import sys
from modgrammar import *

class Number (Grammar):
grammar = (OPTIONAL('-'), WORD('0-9'), OPTIONAL('.', WORD('0-9')))

def value(self):
return float(self.string)

class ParenExpr (Grammar):
grammar = (L('('), REF('Expr'), L(')'))

def value(self):
return self[1].value()

class P0Term (Grammar):
grammar = (ParenExpr | Number)

def value(self):
return self[0].value()

class P0Expr (Grammar):
grammar = (P0Term, ONE_OR_MORE(L('/'), P0Term))

def value(self):
value = self[0].value()
for e in self[1]:
value /= e[1].value()
return value

class P1Term (Grammar):
grammar = (P0Expr | ParenExpr | Number)

def value(self):
return self[0].value()

class P1Expr (Grammar):
grammar = (P1Term, ONE_OR_MORE(L('*'), P1Term))

def value(self):
value = self[0].value()
for e in self[1]:
value *= e[1].value()
return value

class P2Term (Grammar):
grammar = (P0Expr | P1Expr | ParenExpr | Number)

def value(self):
return self[0].value()

class P2Expr (Grammar):
grammar = (P2Term, ONE_OR_MORE(L('+') | L('-'), P2Term))

def value(self):
value = self[0].value()
for e in self[1]:
if e[0].string == '+':
value += e[1].value()
else:
value -= e[1].value()
return value

class Expr (Grammar):
grammar = (P2Expr | P1Expr | P0Expr | ParenExpr)

def value(self):
return self[0].value()

parser = Expr.parser()
result = parser.parse_string(sys.argv[1], eof=True)
remainder = parser.remainder()
print("Parsed Text: {}".format(result))
print("Value: {}".format(result.value()))

I hope this helps (and wasn't too long-winded or confusing).. if
anything's not clear, please let me know.

--Alex

Reply all

Reply to author

Forward

0 new messages