On Jan 16, 9:20 am, James Harris <
james.harri...@gmail.com> wrote:
...
Here is some documentation to go with the previously-posted algorithm.
This post will likely only be of interest to someone else who wants to
write an expression parser.
I found that when an expression has a mix of precedences and operator
positions (prefix, postfix and infix) the structure of an expression
can be surprisingly complex and hard to understand and parse. I will
explain the relevant concepts I came up with. Hopefully this will make
the approach easier to see and easier to verify for correctness.
As may be expected, the higher-precedence operators have to be carried
out first. Equal-precedence operators are normally applied left-to-
right but provision is also made in the design so that right-to-left
operators are automatically handled when they are needed.
I needed a way to say that some operators cannot be combined (because
they have no precedence relationship defined between them) and so
allowance is made for detecting invalid operator combinations.
The correct application of precedences to the three types of operators
was, for me, very hard to grasp and I tried many examples to work out
how to boil the approach down to the essentials. So you can verify for
yourself and can understand the approach I will explain the key
principles - and try not to bore you in the process!
Prefix Operators
================
As a rule, a prefix operator (such as might be used in most familiar
programming languages) is not applied to the following *operand* but
to the following *expression*. That expression contains operators
which have a higher precedence than the prefix operator. To illustrate
take the example code
not A * B
In this case the idea is that the boolean "not" has to be applied to
the subexpression A * B to its right, not just to the operand A.
Conversely, if to the right there are operators of lower or equal
precedence they do not have that prefix operator applied to them. For
example,
- A or B
Here the unary minus is applied only to A. The parser has to stop
moving right when it sees that "or" has lower precedence than unary
minus. It can only consider "or" once it has built a node for (-A).
Of course, these may be just parts of bigger expressions. So
not A ** B > C and D
Here, the parser has to apply "not" to just the following
subexpression A ** B > C. The "and D" part of the expression has to be
applied later.
Your language may not use these specific precedences but they
illustrate the idea.
The principle for unary operators is recursive. The expression that
follows a unary operator is itself an expression and may or may not
start with another unary operator. For example,
- ~ abs A
Here, "abs" applies to A, "~" applies to "abs A" and "-" applies to "~
abs A".
Therefore, prefix operators are applied right-to-left.
The precedence, however, of each unary operator effectively determines
how far to the right the following subexpression that it applies to
extends. Basically it applies rightwards only to higher precedence
operators. Or looked at another way the parser has to accumulate an
expression until it gets to an operator (either postfix or infix)
which has lower or equal precedence. An example is given in the next
section.
Because the concept of applying prefix operators is recursive it makes
sense to have the program call itself to evaluate the expression to
the right of a prefix operator and then apply the prefix operator to
the result. For example, if a unary minus is seen the parser calls
itself to evaluate the expression to the right of the minus sign then
builds a node to apply the unary minus to the returned expression.
Postfix Operators
=================
These are unary operators that appear after the expression to which
they apply. Not all languages have them but you might want to read on
as the principles herein apply to infix too.
In an earlier version of the algorithm I had postfix operators
applying to only the *value* they followed. That was sufficient at the
time as the language to be compiled had only equal-highest precedence
postfix operators but it dawned on me that it was not sufficient for
general use. To be used more generally postfix operators have to apply
to the preceding *expression*. Of course, if they have highest
priority they will still apply to just the preceding value so the
algorithm can be applied in either case.
Given that a postfix operator applies to the expression to its left
how far left does that expression extend? Essentially it applies to
any expression which is made up of values tied together by operators
which have higher or equal precedence than the postfix operator.
So if there was a mid-precedence postfix operator it would apply to
(i.e. be used after) the high- and mid-precedence operators to its
left and would not apply to (i.e. would be used before) a low-
precedence operation to its left. To illustrate, if # was a postfix
operator of lower precedence than ** it would apply after it in
A ** B #
This would resolve to (A ** B)# because of ** having precedence
greater than or equal to that of #. Conversely, consider
A + B #
This would resolve to A + (B#) because of # having a higher precedence
than +.
Multiple postfix operators must be applied left-to-right regardless of
what their precedences are. So if the entire expression was
A ~ # ^
then the three operators ~, # and ^ have to be applied in order
regardless of their precedences relative to each other.
Things get more difficult when looking at prefix and postfix operators
either side of the same value. I tried loads of real-life examples to
try to work out the order the prefix and postfix operators had to be
applied. In the end I found it helped to represent a subexpression as
a letter and to represent prefix and postfix operators with the
numbers of their precedences. So, for example, 6v4 would indicate a
prefix operator of precedence 6, a subexpression v, and a postfix
operator of precedence 4.
Here are some examples using that notation to show how the precedences
determine the limits of an expression.
6v4 becomes (6v)4
I.e. apply 6 first as its priority is higher than or equal to 4.
2v3 becomes 2(v3)
I.e. apply the 3 first because it has higher precedence than 2.
What if the precedences on either side are the same? Because left
associativity is the default (but see below for how to specify right
associativity) we apply the left-hand side one first. So,
4v4 becomes (4v)4
I.e. the algorithm has to apply the prefix 4 before it applies the
postfix 4.
Given the above, consider the following which has two prefix operators
and one postfix operator.
37v4 becomes 3(7v)4 then becomes 3((7v)4)
Here the prefix 7 is applied first as its priority is higher than or
equal to that of the postfix 4. Once that has been done we are left
with 3(7v)4 and the postfix 4 has higher priority than the prefix 3 so
it is applied next.
This latter example shows that the direction of resolution can vary.
It can, in fact, shuttle back and forth until all prefix and postfix
operators have been used.
In terms of the limits of a subexpression consider
2323v19
In this case even though there is a very high precedence operator to
the right (the 9) all the left-hand prefix operators have to be
applied first because of the 1 immediately to the right of v. The 9
cannot be applied until the 1 has been applied and the 1 cannot be
applied until after the higher precedences on the left have been dealt
with.
Similarly, in
91v2323
All the RHS operators need to be applied first (left-to-right) because
of the low precedence operator to the left of v.
From this the rules are
* Only consider one precedence on either side of v
* If the LHS >= the RHS use the left operator
* If the LHS < the RHS use the right operator
That's the concept but how to parse it? Fortunately it turned out that
it could be done simply and efficiently. On each prefix operator the
parser has to call itself recursively with the precedence of that
operator as a parameter. If, for example, there is a unary minus
prefix we have to evaluate the expression to the right and then apply
the minus to it.
How far to the right, though, does the expression extend? On
encountering a postfix operator all that's needed is to compare the
precedence of that postfix operator with the precedence this
activation of the parser was called with. If the postfix operator has
a higher precedence then it gets applied to the expression as it
exists at the time and we repeat (see below). Alternatively, if the
postfix operator has a lower or equal precedence then it cannot be
applied yet and we have to return the expression as it stands leaving
the postfix operator to be dealt with later.
Returning the expression as it stands can be illustrated in the
processing of
3v1
Here, the parser has initially seen the 3 and called itself
recursively to process what remains. The called activation of the
parser begins at the v. It remembers the v and then sees the postfix
1. Because the postfix 1 does not have a higher precedence than the 3
it was called with the activation returns the v to the caller. The
caller, having had v returned to it applies the 3 prefix to get (3v).
It then continues with the parse and again sees the postfix 1. As this
is the only remaining operator it applies the 1 to the (3v) to get
(3v)1.
Given the above, the expression parser has to iterate over operators
which have a higher precedence than that with which it was called. It
only returns a subtree once it has resolved all higher precedence
operators. Therefore it can be started simply by calling it with a
pseudo precedence of zero. As long as all operators have a precedence
which is higher than zero the parser will only return once all
operators have been resolved and the entire expression has been built
into a tree.
Infix Operators
===============
Once prefix and postfix handling has been grasped the handling of
infix operators is easier. Parsing infix operators includes aspects of
parsing prefix and of parsing postfix operators.
When the parse reaches an infix operator it doesn't process that infix
operator until it has resolved the expression to its left which has
higher or equal precedence, just as it would do for a postfix
operator. In other words, if it comes across an infix operator which
has lower or equal precedence than it was called with it returns to
its caller leaving the infix operator to be dealt with later.
Similarly, once the expression to the left has been converted to a
tree then the infix operator is remembered and the parser calls itself
(as it did for a prefix operator) to parse the expression to the
right. The called activation continues until it comes to an operator
of lower or equal precedence and then it returns. The calling
activation combines both sides of the infix operator into a dyadic
operation subtree and then repeats (see below).
Note that because prefix operators are recognised in their own
distinct context - i.e. the start of an expression or subexpression -
the symbols used for prefix operators can overlap those used for
postfix and infix operators. This automatically distinguishes a unary
minus from an infix one. The same can be said for other operator
symbols.
Invalid Operator Combinations
=============================
Some operators may not be allowed to combine. For example,
A + B & C
In this expression there is a question over whether + takes precedence
over & or vice versa. I decided to make them incompatible - at least
in the initial version of the language - so they the above has to be
expressed as one of these
A + (B & C)
(A + B) & C
Either of these is acceptable.
To manage this in the parser I found it easiest to have a separate
routine to return true or false to identify compatibility and include
the routine in the postfix and infix processing. The test for
compatibility is made only after we know that the precedence of the
current operator is greater. If it is not greater then the source is
not saying to combine the operators.
Because of the need to compare operators and not just precedences the
algorithm passes both into recursive calls of itself. If your language
does not need to check for invalid operator combinations you won't
need to pass the operators in - though you might want to keep them if
they help with error reports.
Right-to-Left Operations
========================
While the above handles operators of equal precedence left-to-right I
wanted to allow for some of them to be processed right-to-left. I had
read somewhere that this could be done by masking the lowest bit on
one side of the comparison and chose to apply that idea as follows.
The precedence of most operators is an even number - i.e. its lowest
bit is zero. That bit gets defined as 1 on the few, if any, which need
right-to-left processing. Then all that is needed is to clear that bit
when calling the expression parser for the RHS of an infix expression.
If the same operator occurs next it will be seen effectively as of
higher precedence than itself. For example,
A ** B ** C ** D
If ** has precedence 9 (an odd number indicating right-to-left
associativity) the parser (because it clears the lowest bit for all
infix operators when calling itself) will feed 8 into each recursive
call. The subsequent ** operator with precedence 9 will then be seen
as of higher precedence than 8 and will be applied first. Thus this
operator will be applied right-to-left and there is no arbitrary limit
as to how far right this can extend. The above will build a node for C
** D first and then work left.
The principle of bit masking on the recursive call could be extended
to mask out a number of the bits of precedence (not necessarily all
adjacent) to achieve various effects. I don't have a use for that as
yet but because the language I have to compile defines operators in
groups or families I am thinking to define precedences in hexadecimal
as two digits where the leftmost is the family and the rightmost the
precedence of a specific operator within that family.
Repetition
==========
It took me a while to work out how the algorithm should proceed after
processing postfix and infix operators which had higher precedence
than that that the parser was called with. In the end I realised that
it had to get an expression as the first action and then to repeat the
second part of the parse. So the essential structure of the parser
became very simple:
Build a node for an operand
While there is an operator of higher precedence (postfix or infix)
Build a node combining the existing node and that operator
Return the expression node
With this scheme the parser gets an operand (which could be a
subexpression) then repeatedly applies itself to pairs of
(<infix operator>, <subexpression>)
until it hits an infix or postfix operator which has a precedence
which is not higher than the precedence it was called with or reaches
something that marks the end of the expression. The end of the
expression effectively has precedence 0 so that anything to the left
is resolved first.
In terms of building a tree the algorithm structure allows the initial
operand or subexpression to be resolved to a node and then, on each
iteration of the loop, applies either a prefix or an infix operator to
that node making a new node.
Since the new node thus created includes references to its subnodes
the algorithm has no need to build any stacks or other data structures
to represent unparsed parts of the expression. The only structure it
builds is the resolved expression subtree.
Caveats
=======
There are points of caution I should mention.
I have not shown any real operator precedences but some specific ones
are key to correct operation of the algorithm. I will mention them
here.
As stated, all operators must have precedences which are greater than
zero. This is so that the initial call to the routine only returns
once the whole expression has been parsed.
Some choices were made so that as well as handling mathematical
expressions a single algorithm could be suitable for the arguments to
function calls. Consider
f(a, b || c, d)
The algorithm as it stands can parse the arguments automatically if
comma is defined as an infix operator (normally of very low
precedence) but this does mean that a comma will also be accepted in a
mathematical expression. You may not want that.
C uses a comma operator this way and it may be fine for your language
but if not then you will need to recognise the presence of a comma at
some point. You could do this in the tree returned by the parser (it
will be an infix node) or by having a separate, possibly nested,
parser or by adding a boolean parameter to each call say whether a
comma was acceptable or not.
Similarly, consider these function calls.
f()
f(,2)
In either of these cases there is a not-present subexpression. So that
there is just one expression parser the algorithm accepts such not-
present subexpressions. For general use you will also need to test for
it. For example, in the contexts where not-present expressions are not
allowed, when walking the tree you could check for empty expressions.
There are alternatives such as having multiple parsers or passing in
an argument such as suggested for comma, above.
To combine common processing and keep the code short a left
parenthesis following an expression is treated as an infix operator.
It must therefore be defined in the infix operators table. You would
normally define it as having a high precedence so that it binds
tightly to what is on its left but you can choose how to prioritise
it. Contrast these
a + b(c)
a.b(c)
In the first case a tree for "b(c)" would be built first. In the
second you might want the "a.b" to be converted to a tree first then
the "(c)" applied. You can chose what to do by defining the precedence
of the left parenthesis.
Finally, there are some things you might want to add, depending on the
source language to be compiled. All I can think of at the moment is:
if your language allows assignments to be embedded in expressions
check that they are handled properly and, if you use square brackets
for array references add them in alongside the parentheses.
Summary
=======
Unless I have missed something the parser in the original post seems
to be a good solution for when a hand-written expression parser is
wanted. It is fast, short, reasonably understandable, flexible and
should be fairly easy to write in any language. The precedences are
defined in data and therefore easy to modify. Lastly, as the parser
has two main contexts I have been able to use it to generate some
helpful error messages telling the programmer what was expected when
an error is encountered.
James