Generalized operator-precedence parsing

Joachim Durchholz

unread,

Jun 28, 2001, 11:41:31 PM6/28/01

to

Hi all,

I've been thinking about a generalized approach to operator-precedence
parsing.

The basic idea is that an operator consists of a sequence of operator
tokens and operand places. Indicating operand places by dots, an
operator name might look like this:

Standard infix operator:
. + .
Prefix operators:
- .
sin .
The most basic "circumfix" operator:
( . )
A somewhat more complicated circumfix:
quadratic solution ( . , . , . )
Postfix operators:
. !
. ^
. *
Smalltalk-style operators (e.g. substring search):
find . in .

I'd like to see any pointers to literature (preferrably online) with
algorithms, experience reports, limits of the approach.

I already know some restrictions. For example, the grammar will become
ambiguous if a prefix and a postfix operator live at the same
precedence level, or if operator tokens are allowed to live in
operators of more than one precedence level. There's also the "core"
of an operator (everything except any initial or trailing operand
place). The operator core is its "circumfix" part, it's prefix if it
has a trailing operand, postfix with an initial operand, and infix if
it has an initial and a trailing operand.

Other things aren't so clear at all.
I'm pretty sure that a prefix and a right-associative infix operator on
the same precedence level generate an ambiguity, but I'm not sure why
(and I tend to confuse the association of fixity and associativeness).
I don't know what precedence to associate with the operand places within
the core of the operator, or how much sense precedence makes at all in
this context. The "inside" of a parenthese has lowest precedence, but is
it a good idea to make this a general rule?

Nesting Smalltalk-style operators will require lots of parentheses. Is
this avoidable? Or, more precisely: is there a way to reduce the number
of required parenthese without encouraging unreadable code? (I suspect
there isn't, but I may be wrong.)

Regards,
Joachim

Joachim Durchholz

unread,

Jul 6, 2001, 4:31:53 PM7/6/01

to

Does anybody know about generalizations of operator-precedence parsing?
Any information (preferrably available on the Internet) would be
appreciated.

For example, I know that Prolog uses such a technique, but I haven't
found any texts that explain the ins and outs.

Regards,
Joachim

Gregory Toomey

unread,

Jul 17, 2001, 11:16:29 PM7/17/01

to

"Joachim Durchholz" <joac...@gmx.de> wrote in message

> Does anybody know about generalizations of operator-precedence parsing?
> Any information (preferrably available on the Internet) would be
> appreciated.

You could try
http://www.eng.auburn.edu/users/wenchen/course/6210/week10/7.gif

You need to know the difference between LL(1) (used for recursive descent
compilers); and LALR(1) (used by the unix utilty yacc):
http://www.cs.caltech.edu/~kp/parserlib/lr.html

Most university couses on compilers go explain all this. A useful book is
Compilers: Principles, Techniques and Tools by Aho, Sethi & Ullman.

gtoomey

David Chase

unread,

Jul 17, 2001, 11:21:54 PM7/17/01

to

> Does anybody know about generalizations of operator-precedence parsing?
> Any information (preferrably available on the Internet) would be
> appreciated.

I saw your first post, meant to reply, but never quite got there. I'm
not 100% sure what you mean by "generalization". I thought your
syntax was a nice way to talk about prefix, postfix, infix, and
"surround-fix" expressions. I wrote an operator-precedence parser
(for FP84, in BCPL) that allowed the on-the-fly injection of new
operators, but your syntax for adding new operators is nicer.

I did find the (15 year old) source code listing for an OPP for
Fortran expressions. Sort of an interesting historical artifact,
though it isn't the finished product (in particular, it still has the
"all undeclared identifiers are undimensioned integers" stub in it).

Anyhow, I think your guesses about what to do with "surround-fix" are
roughly correct. The precedence of a closing parentheses is really
only interesting when there is an opening parentheses to match with
it, and at that point, it is as low as necessary to drive the enclosed
expressions to their evaluated form. There's obvious other
constraints that apply -- obviously, "[ ... )" (right here is where I
hit control-E to move to the end-of-line in Emacs, only I was
composing mail in Eudora. If I could disable that key, I would.) is
ill-formed, and should be rejected.

The generalization, something along the lines of

IF exp THEN exp ELSE exp

I think has the the THEN first playing the role of ")", but afterwards
the IF ... THEN acts as an opening parentheses to the trailing ELSE,
followed finally by the expression. If you let users define such
syntax, you have to consider whether you let them also say "IF exp
THEN exp ..." (you can imagine this in a language with "no value" as
an outcome) and then ensure that

IF exp THEN IF exp THEN exp ELSE exp

parses "properly", which (as I recall) is

(IF exp THEN (IF exp THEN exp ELSE exp ) )

but I think this would work that way if you simply followed the
treatment of "IF ... THEN" as an opening parentheses for a trailing
ELSE -- it would bind to the closer one.

(Yes, you could have a real party with an operator precedence parser.)

I think you also have to be a little careful. I am not a parsing
expert, but what I have heard is that OPPs can accept larger-than-
intended languages. I don't recall the example, and (given that
parsing atrocities that we've seen since I didn't learn enough about
parsing) it's not clear that anyone cares. I do have one opinion that
I think is worth listening to, which is that C (and hence, C++, and
Java) have too many prefix operators. In particular, casting should
not be a prefix operator, it should be some sort of a suffix operator,
so that you can read an expression from left to right and not worry
about how the parentheses group or what the relative precedence of
casting and other things is.

I worry a bit, also, that you might find yourself wanting to (ahem)
associate productions and what-not with your operator specifications,
so that you might say something like

. + . = plus(.,.)

so then you'll want to start identifying those non-terminals

<expr_1> + <expr_2> := plus(expr_1,expr_2)

and then you'll probably want to start attaching other information to
them, like whether they are expressions or statements, along the lines
of

WHILE <expr_1> DO <statement_1> OD := choose_your_semantic_poison

so that you can forbid (assuming you think this is a good
idea, which I don't always) things like

x := 1 + DO expression UNTIL done OD

(It would make more sense to forbid this if it is a zero-trip loop).

David Chase

--
David...@NaturalBridge.com

Dmitry Kazakov

unread,

Jul 17, 2001, 11:22:58 PM7/17/01

to

On 28 Jun 2001 23:41:31 -0400, "Joachim Durchholz" <joac...@gmx.de>
wrote:

>I've been thinking about a generalized approach to operator-precedence
>parsing.
>
>The basic idea is that an operator consists of a sequence of operator
>tokens and operand places. Indicating operand places by dots, an
>operator name might look like this:
>
>Standard infix operator:
> . + .
>Prefix operators:
> - .
> sin .
>The most basic "circumfix" operator:
> ( . )
>A somewhat more complicated circumfix:
> quadratic solution ( . , . , . )
>Postfix operators:
> . !
> . ^
> . *
>Smalltalk-style operators (e.g. substring search):
> find . in .

A generalized expression may look like:

<prefix> <operand> <postfix> <infix> <prefix> <operand> ...

where <prefix> is a list of any prefix operators and left braces,
<infix> is either an infix operation or a comma, <postfix> is a list
of any postfix operators and right braces. A generalized expression
can be parsed using the standard twin stack algorithm.

The conventional concept of operator precedence tells that each
operator has a priority controlling the operator's association with
its operands. It can be extended by splitting the single operator's
priority into two parts - the left and the right priorities. The left
priority controls the operator association with the left operand. The
right priority does it for the right operand. In the expression a*b+c,
the result will be (a*b)+c if the left priority of the + is not
greater than the right priority of the *. The advantages of the
extended concept:

(o) It is very easy to have left or right associated operators. For
example, if + has the left priority lower than the right one,
a+b+c will be associated as a+(b+c).

(o) Assignment operator. In C the assignment operator has a lower
priority than +. Therefore, a=b+c means a=(b+c). But a+b=c+d has a
meaningless interpretation as (a+b)=(c+d). Assigning to =
two priorities: very high left and very low right we could achieve a
more natural result: a+(b=(c+d)).

(o) Unary operator association can be more flexible. For example,
-++a-- can be interpreted as -(++(a)--) or even as ++(-(a--)).

There are three general classes of the operator context:

(1) Prefix context. A prefix operator (unary plus) or a left brace may
appear in this context. The parser remains in this context until there
are prefix things recognized. Then it tries to recognize an operand
(either a literal or an identifier). Then it switches to 3

(2) Infix context. A binary operator (+, * and so on), a comma or a
left brace of an indexing operator may be here. The parser recognizes
an infix thing and switches to 1.

(3) Postfix context. A postfix operator (++ in the C) or a right
brace is recognized in this context. The parser remains in this
context until there is a postfix thing. Then it switches to 2.

There are the following kinds of generalized operators:

(1) Prefix operators are recognizedr in the prefix context. They are
unary plus and minus, preincrement and predecrement (in C). Usually,
the left priority of a prefix operator is low. The right one is high.
Therefore, a-++b-c means a-(++b)-c. Although sometimes the right
priority is lower than the left priority of an infix operator. For
example, in -a**b, that should be treated as -(a**b).

(2) Infix operators are most of operators, like plus, minus and
others. Normally, the left and right priorities of an infix operator
are near to each other (for operator association see above). Some
operators may have unbalanced priorities, like the assignment
operator. There could be two kinds of infix operators:

(o) Normal infix operators.
(o) Meta-infix operators that cannot appear outside of braces. In
other things they does not differ from particular operators. An
example of a meta-infix operator is => (keyword parameter association)
in Ada.

(3) Postfix operators are recognized in the postfix context. They are
postincrement and postdecrement operators in C. Usually, the left
priority is high, but lower than the right priorities of prefix
operators. The right operator's priority should be slightly lower than
the left one, but higher than the right priorities of the infix
operators. Under these conditions a-++b--+c will mean
a-(((--(++b))++)--)+c.

(4) Left braces are recognized in prefix and infix contexts.They are
subdivided into two classes:

(o) Simple left braces appearing in the prefix context. They have no
priorities because their order cannot be violated. They are left
grouping and left aggregate braces. For example, ( in (a+b)*c and { in
{1,2,3}.
(o) Index braces appear in the infix context. They are left braces of
function calls and array index. For example, [ in a+b[2]. Here the
index has two operands b and 2. Their left priority is usually high.
Otherwise, a+b[2] were interpreted as (a+b)[2]. The right priority is
absent and assumed to be infinite.

(5) Commas have no priorities because their order cannot be violated.
They appear in the infix context and are subdivided into four classes:

(o) Simple commas like `,' in foo (a,2,b).
(o) Comma-brace separates a parameter from a sublist of parameters
and looks like a comma followed by a left brace. For instance, in
Ada's extension aggregates, the keyword `with' is a comma-brace. So
the extension aggregate (a with 1,2,3) means (a,(1,2,3)). Where "a" is
a value of the base type of the extended type and (1,2,3) is the
aggregate of the base type extension. A comma-brace can only appear
within a parameter list.
(o) Brace-comma closes a parameter sublist opened by a comma-brace and
separates the next parameter of the parent parameter list.
(o) Brace-comma-brace closes a parameter sublist opened by a
comma-brace and opens a new sublist.

(6) Right braces appear in the postfix context. They close a parameter
list opened by a left braces as well as all its sublists.

Instead of priorities braces and commas should have the group and the
type. They are used to separate different classes of braces like ()
and [] which should have different types. They may have the same group
to share the same comma.

Regards,
Dmitry Kazakov

Peter da Silva

unread,

Jul 17, 2001, 11:25:29 PM7/17/01

to

Joachim Durchholz <joachim....@gmx.de> wrote:
> Does anybody know about generalizations of operator-precedence parsing?
> Any information (preferrably available on the Internet) would be
> appreciated.

One of the better explanations of operator precedence parsing that I've found
is in Horowitz and Sahni, "Fundamentals of Data Structures". I don't have my
copy at hand to give you the ISBN, but I'm sure you can find it on the web.

--
`-_-' In hoc signo hack, Peter da Silva.
'U` "A well-rounded geek should be able to geek about anything."
-- nic...@esperi.org
Disclaimer: WWFD?

john slimick

unread,

Jul 18, 2001, 8:00:34 PM7/18/01

to

On 17 Jul 2001 23:16:29 -0400, Gregory Toomey <gto...@usa.net> wrote:
>"Joachim Durchholz" <joac...@gmx.de> wrote in message
>> Does anybody know about generalizations of operator-precedence parsing?
>> Any information (preferrably available on the Internet) would be
>> appreciated.

You might look for papers on extended operator precedence by Dr. Susan
Graham of Cal-Berkeley. In the late 60's -- early 70's she worked on
implementing one of the Wirth "enhanced" Algol's and, as I recall, to
continue the tradition of Dr. Wirth in using, in this case, a more
generalized operator precedence parsing.

john slimick
sli...@pitt.edu

Dennis Mickunas

unread,

Jul 23, 2001, 2:23:14 AM7/23/01

to

"john slimick" <sli...@jcs.upb.pitt.edu> wrote in message

> On 17 Jul 2001 23:16:29 -0400, Gregory Toomey <gto...@usa.net> wrote:
> >"Joachim Durchholz" <joac...@gmx.de> wrote in message
> >> Does anybody know about generalizations of operator-precedence parsing?
> >> Any information (preferrably available on the Internet) would be
> >> appreciated.
>
> You might look for papers on extended operator precedence by Dr. Susan
> Graham of Cal-Berkeley

Yes, indeed, Sue Graham's thesis, "Precedence Languages and Bounded Right
Context Languages" (Stanford, 1971) is a good source. Another is Alain
Colmerauer's "Total Precedence Relations," JACM (January 1970). As for
generalizations of operator precedence, there was an excellent treatment by
Jim Gray and Michael Harrison: "Canonical Precedence Schemes," JACM (April,
1973); from the abstract:
"A general theory of canonical precedence analysis is defined and
studied. The familiar types of precedence analysis such as operator
precedence or simple precedence occur as special cases of this theory. Among
the theoretical results obtained are a characterization of the structure of
precedence relations and the relation of canonical precedence schemes to
operator sets."

Both of the JACM papers are available on the ACM digital library.

There was also an interesting technical report by Vason P. Srini while he
was at Tennessee Tech: "The Classes of Environment T-Operator Precedence
Languages," (1978).