Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Compiling expressions

66 views
Skip to first unread message

James Harris

unread,
Jan 16, 2013, 4:20:24 AM1/16/13
to
A number of you will be aware that I have been trying to devise an
efficient expression parser that will handle varying precedences of
prefix, postfix and infix operators. Here is the algorithm I have come
up with. It produces a tree structure describing the expression.

The algorithm should be easy to render in different languages, even
assembly. In particular it does not build temporary data structures
either for the operands or for the operators. All it builds are nodes
for the tree; construction of tree nodes would normally already exist
in the rest of the compiler.

Instead it relies on the subroutine call stack - and doesn't use a lot
of that. The recursive calls are mainly for subexpressions and higher-
precedence operators. Adjacent operators of equal precedence do not
require recursive calls.

The Algorithm
=============

It helps to think of it as in two key sections. First, it gets the
initial subexpression which could be the A in A+B. Then it iterates
over the (infix operator, expression) pairs (the + and the B in A+B)
that remain until it gets to a lower-precedence operator than it was
called with. This is returned as a finished subexpression to the level
that called it.

It is started off by being called with a dummy operator and a
precedence of zero as in

expr_parse("", 0)

Here is the algorithm presented as fairly detailed pseudocode.
Comments are preceded by apostrophes.


expr_parse(op1, prec1)
tok = current token
if tok is a prefix operator
e1 = expr_parse(this operator, its precedence)
e1 = node(PREFIX_OP, e1)
else if tok is lparen
e1 = expr_parse("", precedence 0)
require rparen
else if tok is an identifier
e1 = a node for the identifier
else if tok is the start of a literal
e1 = a node for the parsed literal
else if tok is a terminating token 'e.g. comma, rparen
e1 = an empty node (and consume no input)
else
report that an operand was expected here

'At this point e1 is a node for the opening expression and
'prec1 still holds the precedence we were called with.

loop
op2 = current token
if op2 is a terminating token
break 'On rparen, end of statement or end of source
else if postfix operator
if op2's precedence is <= that of prec1
break
if not compatible(op1, op2)
report the error
e1 = node(POSTFIX_OP, op2, e1)
else if infix operator (incl comma and lparen)
if op2's precedence is <= that of prec1
break
if not compatible(op1, op2)
report the error
if op2 is lparen:
e2 = expr_parse(op2, 0)
require rparen
else:
e2 = expr_parse(op2, op2's precedence & ~1) 'Note 1
e1 = node(INFIX_OP, op2, e1, e2)
else
report an operator or end of expression expected
end loop
return e1

Unless someone can see an error in it that is the whole algorithm!

Note 1. The precedence of right-associative operators has the lowest
bit set. All other precedences are even numbers. The & ~1 clears the
lowest bit so that right-associative operators, if there are any, will
be handled in the correct order.

James

James Harris

unread,
Jan 18, 2013, 7:22:55 AM1/18/13
to
On Jan 16, 9:20 am, James Harris <james.harri...@gmail.com> wrote:

...

Here is some documentation to go with the previously-posted algorithm.
This post will likely only be of interest to someone else who wants to
write an expression parser.

I found that when an expression has a mix of precedences and operator
positions (prefix, postfix and infix) the structure of an expression
can be surprisingly complex and hard to understand and parse. I will
explain the relevant concepts I came up with. Hopefully this will make
the approach easier to see and easier to verify for correctness.

As may be expected, the higher-precedence operators have to be carried
out first. Equal-precedence operators are normally applied left-to-
right but provision is also made in the design so that right-to-left
operators are automatically handled when they are needed.

I needed a way to say that some operators cannot be combined (because
they have no precedence relationship defined between them) and so
allowance is made for detecting invalid operator combinations.

The correct application of precedences to the three types of operators
was, for me, very hard to grasp and I tried many examples to work out
how to boil the approach down to the essentials. So you can verify for
yourself and can understand the approach I will explain the key
principles - and try not to bore you in the process!


Prefix Operators
================

As a rule, a prefix operator (such as might be used in most familiar
programming languages) is not applied to the following *operand* but
to the following *expression*. That expression contains operators
which have a higher precedence than the prefix operator. To illustrate
take the example code

not A * B

In this case the idea is that the boolean "not" has to be applied to
the subexpression A * B to its right, not just to the operand A.

Conversely, if to the right there are operators of lower or equal
precedence they do not have that prefix operator applied to them. For
example,

- A or B

Here the unary minus is applied only to A. The parser has to stop
moving right when it sees that "or" has lower precedence than unary
minus. It can only consider "or" once it has built a node for (-A).

Of course, these may be just parts of bigger expressions. So

not A ** B > C and D

Here, the parser has to apply "not" to just the following
subexpression A ** B > C. The "and D" part of the expression has to be
applied later.

Your language may not use these specific precedences but they
illustrate the idea.

The principle for unary operators is recursive. The expression that
follows a unary operator is itself an expression and may or may not
start with another unary operator. For example,

- ~ abs A

Here, "abs" applies to A, "~" applies to "abs A" and "-" applies to "~
abs A".

Therefore, prefix operators are applied right-to-left.

The precedence, however, of each unary operator effectively determines
how far to the right the following subexpression that it applies to
extends. Basically it applies rightwards only to higher precedence
operators. Or looked at another way the parser has to accumulate an
expression until it gets to an operator (either postfix or infix)
which has lower or equal precedence. An example is given in the next
section.

Because the concept of applying prefix operators is recursive it makes
sense to have the program call itself to evaluate the expression to
the right of a prefix operator and then apply the prefix operator to
the result. For example, if a unary minus is seen the parser calls
itself to evaluate the expression to the right of the minus sign then
builds a node to apply the unary minus to the returned expression.


Postfix Operators
=================

These are unary operators that appear after the expression to which
they apply. Not all languages have them but you might want to read on
as the principles herein apply to infix too.

In an earlier version of the algorithm I had postfix operators
applying to only the *value* they followed. That was sufficient at the
time as the language to be compiled had only equal-highest precedence
postfix operators but it dawned on me that it was not sufficient for
general use. To be used more generally postfix operators have to apply
to the preceding *expression*. Of course, if they have highest
priority they will still apply to just the preceding value so the
algorithm can be applied in either case.

Given that a postfix operator applies to the expression to its left
how far left does that expression extend? Essentially it applies to
any expression which is made up of values tied together by operators
which have higher or equal precedence than the postfix operator.

So if there was a mid-precedence postfix operator it would apply to
(i.e. be used after) the high- and mid-precedence operators to its
left and would not apply to (i.e. would be used before) a low-
precedence operation to its left. To illustrate, if # was a postfix
operator of lower precedence than ** it would apply after it in

A ** B #

This would resolve to (A ** B)# because of ** having precedence
greater than or equal to that of #. Conversely, consider

A + B #

This would resolve to A + (B#) because of # having a higher precedence
than +.

Multiple postfix operators must be applied left-to-right regardless of
what their precedences are. So if the entire expression was

A ~ # ^

then the three operators ~, # and ^ have to be applied in order
regardless of their precedences relative to each other.

Things get more difficult when looking at prefix and postfix operators
either side of the same value. I tried loads of real-life examples to
try to work out the order the prefix and postfix operators had to be
applied. In the end I found it helped to represent a subexpression as
a letter and to represent prefix and postfix operators with the
numbers of their precedences. So, for example, 6v4 would indicate a
prefix operator of precedence 6, a subexpression v, and a postfix
operator of precedence 4.

Here are some examples using that notation to show how the precedences
determine the limits of an expression.

6v4 becomes (6v)4

I.e. apply 6 first as its priority is higher than or equal to 4.

2v3 becomes 2(v3)

I.e. apply the 3 first because it has higher precedence than 2.

What if the precedences on either side are the same? Because left
associativity is the default (but see below for how to specify right
associativity) we apply the left-hand side one first. So,

4v4 becomes (4v)4

I.e. the algorithm has to apply the prefix 4 before it applies the
postfix 4.

Given the above, consider the following which has two prefix operators
and one postfix operator.

37v4 becomes 3(7v)4 then becomes 3((7v)4)

Here the prefix 7 is applied first as its priority is higher than or
equal to that of the postfix 4. Once that has been done we are left
with 3(7v)4 and the postfix 4 has higher priority than the prefix 3 so
it is applied next.

This latter example shows that the direction of resolution can vary.
It can, in fact, shuttle back and forth until all prefix and postfix
operators have been used.

In terms of the limits of a subexpression consider

2323v19

In this case even though there is a very high precedence operator to
the right (the 9) all the left-hand prefix operators have to be
applied first because of the 1 immediately to the right of v. The 9
cannot be applied until the 1 has been applied and the 1 cannot be
applied until after the higher precedences on the left have been dealt
with.

Similarly, in

91v2323

All the RHS operators need to be applied first (left-to-right) because
of the low precedence operator to the left of v.

From this the rules are

* Only consider one precedence on either side of v
* If the LHS >= the RHS use the left operator
* If the LHS < the RHS use the right operator

That's the concept but how to parse it? Fortunately it turned out that
it could be done simply and efficiently. On each prefix operator the
parser has to call itself recursively with the precedence of that
operator as a parameter. If, for example, there is a unary minus
prefix we have to evaluate the expression to the right and then apply
the minus to it.

How far to the right, though, does the expression extend? On
encountering a postfix operator all that's needed is to compare the
precedence of that postfix operator with the precedence this
activation of the parser was called with. If the postfix operator has
a higher precedence then it gets applied to the expression as it
exists at the time and we repeat (see below). Alternatively, if the
postfix operator has a lower or equal precedence then it cannot be
applied yet and we have to return the expression as it stands leaving
the postfix operator to be dealt with later.

Returning the expression as it stands can be illustrated in the
processing of

3v1

Here, the parser has initially seen the 3 and called itself
recursively to process what remains. The called activation of the
parser begins at the v. It remembers the v and then sees the postfix
1. Because the postfix 1 does not have a higher precedence than the 3
it was called with the activation returns the v to the caller. The
caller, having had v returned to it applies the 3 prefix to get (3v).
It then continues with the parse and again sees the postfix 1. As this
is the only remaining operator it applies the 1 to the (3v) to get
(3v)1.

Given the above, the expression parser has to iterate over operators
which have a higher precedence than that with which it was called. It
only returns a subtree once it has resolved all higher precedence
operators. Therefore it can be started simply by calling it with a
pseudo precedence of zero. As long as all operators have a precedence
which is higher than zero the parser will only return once all
operators have been resolved and the entire expression has been built
into a tree.


Infix Operators
===============

Once prefix and postfix handling has been grasped the handling of
infix operators is easier. Parsing infix operators includes aspects of
parsing prefix and of parsing postfix operators.

When the parse reaches an infix operator it doesn't process that infix
operator until it has resolved the expression to its left which has
higher or equal precedence, just as it would do for a postfix
operator. In other words, if it comes across an infix operator which
has lower or equal precedence than it was called with it returns to
its caller leaving the infix operator to be dealt with later.

Similarly, once the expression to the left has been converted to a
tree then the infix operator is remembered and the parser calls itself
(as it did for a prefix operator) to parse the expression to the
right. The called activation continues until it comes to an operator
of lower or equal precedence and then it returns. The calling
activation combines both sides of the infix operator into a dyadic
operation subtree and then repeats (see below).

Note that because prefix operators are recognised in their own
distinct context - i.e. the start of an expression or subexpression -
the symbols used for prefix operators can overlap those used for
postfix and infix operators. This automatically distinguishes a unary
minus from an infix one. The same can be said for other operator
symbols.


Invalid Operator Combinations
=============================

Some operators may not be allowed to combine. For example,

A + B & C

In this expression there is a question over whether + takes precedence
over & or vice versa. I decided to make them incompatible - at least
in the initial version of the language - so they the above has to be
expressed as one of these

A + (B & C)
(A + B) & C

Either of these is acceptable.

To manage this in the parser I found it easiest to have a separate
routine to return true or false to identify compatibility and include
the routine in the postfix and infix processing. The test for
compatibility is made only after we know that the precedence of the
current operator is greater. If it is not greater then the source is
not saying to combine the operators.

Because of the need to compare operators and not just precedences the
algorithm passes both into recursive calls of itself. If your language
does not need to check for invalid operator combinations you won't
need to pass the operators in - though you might want to keep them if
they help with error reports.


Right-to-Left Operations
========================

While the above handles operators of equal precedence left-to-right I
wanted to allow for some of them to be processed right-to-left. I had
read somewhere that this could be done by masking the lowest bit on
one side of the comparison and chose to apply that idea as follows.

The precedence of most operators is an even number - i.e. its lowest
bit is zero. That bit gets defined as 1 on the few, if any, which need
right-to-left processing. Then all that is needed is to clear that bit
when calling the expression parser for the RHS of an infix expression.
If the same operator occurs next it will be seen effectively as of
higher precedence than itself. For example,

A ** B ** C ** D

If ** has precedence 9 (an odd number indicating right-to-left
associativity) the parser (because it clears the lowest bit for all
infix operators when calling itself) will feed 8 into each recursive
call. The subsequent ** operator with precedence 9 will then be seen
as of higher precedence than 8 and will be applied first. Thus this
operator will be applied right-to-left and there is no arbitrary limit
as to how far right this can extend. The above will build a node for C
** D first and then work left.

The principle of bit masking on the recursive call could be extended
to mask out a number of the bits of precedence (not necessarily all
adjacent) to achieve various effects. I don't have a use for that as
yet but because the language I have to compile defines operators in
groups or families I am thinking to define precedences in hexadecimal
as two digits where the leftmost is the family and the rightmost the
precedence of a specific operator within that family.


Repetition
==========

It took me a while to work out how the algorithm should proceed after
processing postfix and infix operators which had higher precedence
than that that the parser was called with. In the end I realised that
it had to get an expression as the first action and then to repeat the
second part of the parse. So the essential structure of the parser
became very simple:

Build a node for an operand
While there is an operator of higher precedence (postfix or infix)
Build a node combining the existing node and that operator
Return the expression node

With this scheme the parser gets an operand (which could be a
subexpression) then repeatedly applies itself to pairs of

(<infix operator>, <subexpression>)

until it hits an infix or postfix operator which has a precedence
which is not higher than the precedence it was called with or reaches
something that marks the end of the expression. The end of the
expression effectively has precedence 0 so that anything to the left
is resolved first.

In terms of building a tree the algorithm structure allows the initial
operand or subexpression to be resolved to a node and then, on each
iteration of the loop, applies either a prefix or an infix operator to
that node making a new node.

Since the new node thus created includes references to its subnodes
the algorithm has no need to build any stacks or other data structures
to represent unparsed parts of the expression. The only structure it
builds is the resolved expression subtree.


Caveats
=======

There are points of caution I should mention.

I have not shown any real operator precedences but some specific ones
are key to correct operation of the algorithm. I will mention them
here.

As stated, all operators must have precedences which are greater than
zero. This is so that the initial call to the routine only returns
once the whole expression has been parsed.

Some choices were made so that as well as handling mathematical
expressions a single algorithm could be suitable for the arguments to
function calls. Consider

f(a, b || c, d)

The algorithm as it stands can parse the arguments automatically if
comma is defined as an infix operator (normally of very low
precedence) but this does mean that a comma will also be accepted in a
mathematical expression. You may not want that.

C uses a comma operator this way and it may be fine for your language
but if not then you will need to recognise the presence of a comma at
some point. You could do this in the tree returned by the parser (it
will be an infix node) or by having a separate, possibly nested,
parser or by adding a boolean parameter to each call say whether a
comma was acceptable or not.

Similarly, consider these function calls.

f()
f(,2)

In either of these cases there is a not-present subexpression. So that
there is just one expression parser the algorithm accepts such not-
present subexpressions. For general use you will also need to test for
it. For example, in the contexts where not-present expressions are not
allowed, when walking the tree you could check for empty expressions.

There are alternatives such as having multiple parsers or passing in
an argument such as suggested for comma, above.

To combine common processing and keep the code short a left
parenthesis following an expression is treated as an infix operator.
It must therefore be defined in the infix operators table. You would
normally define it as having a high precedence so that it binds
tightly to what is on its left but you can choose how to prioritise
it. Contrast these

a + b(c)
a.b(c)

In the first case a tree for "b(c)" would be built first. In the
second you might want the "a.b" to be converted to a tree first then
the "(c)" applied. You can chose what to do by defining the precedence
of the left parenthesis.

Finally, there are some things you might want to add, depending on the
source language to be compiled. All I can think of at the moment is:
if your language allows assignments to be embedded in expressions
check that they are handled properly and, if you use square brackets
for array references add them in alongside the parentheses.


Summary
=======

Unless I have missed something the parser in the original post seems
to be a good solution for when a hand-written expression parser is
wanted. It is fast, short, reasonably understandable, flexible and
should be fairly easy to write in any language. The precedences are
defined in data and therefore easy to modify. Lastly, as the parser
has two main contexts I have been able to use it to generate some
helpful error messages telling the programmer what was expected when
an error is encountered.

James

BartC

unread,
Jan 18, 2013, 9:22:28 PM1/18/13
to
"James Harris" <james.h...@gmail.com> wrote in message
news:ef3b1364-0379-48c9...@g6g2000vbk.googlegroups.com...

> Things get more difficult when looking at prefix and postfix operators
> either side of the same value. I tried loads of real-life examples to
> try to work out the order the prefix and postfix operators had to be
> applied. In the end I found it helped to represent a subexpression as
> a letter and to represent prefix and postfix operators with the
> numbers of their precedences. So, for example, 6v4 would indicate a
> prefix operator of precedence 6, a subexpression v, and a postfix
> operator of precedence 4.

...

> Given the above, consider the following which has two prefix operators
> and one postfix operator.
>
> 37v4 becomes 3(7v)4 then becomes 3((7v)4)
>
> Here the prefix 7 is applied first as its priority is higher than or
> equal to that of the postfix 4. Once that has been done we are left
> with 3(7v)4 and the postfix 4 has higher priority than the prefix 3 so
> it is applied next.

I'm sure your algorithms for parsing such operators are fine.

But, in terms of language design, the problem might be having to keep these
algorithms in mind whenever anyone is writing expressions in the language. I
know I would just have to use parentheses everywhere because I would never
be able to remember.

Actually I've never come across varying precedences for operators that have
a single operand, especially when one might have a lower precedence than a
binary operator much later in the expression (or even, apparently, much
earlier).

(This is not the say the natural precedence order is that obvious: I've had
to do some tests to determine what I use. So if a, b, c etc represent unary
operators, and X is a closed sub-expression**, then:

a b c X d e

is executed as (a(b(c((X d)e)))). In other words, from the inside out, but
doing the ones on the right first. Even that is complex enough to have to
use parentheses from time to time!

(** By closed sub-expression I mean one with elements not connected with
ordinary binary operators. So X+Y is two sub-expressions, but (X+Y) is one.
Constructs such as X[Y] or X.Y are not considered binary ops so are
closed.))

--
Bartc

Dmitry A. Kazakov

unread,
Jan 19, 2013, 3:53:41 AM1/19/13
to
On Sat, 19 Jan 2013 02:22:28 -0000, BartC wrote:

> Actually I've never come across varying precedences for operators that have
> a single operand, especially when one might have a lower precedence than a
>binary operator much later in the expression (or even, apparently, much
> earlier).

- 2**3

Employee.Salary++

> (** By closed sub-expression I mean one with elements not connected with
> ordinary binary operators. So X+Y is two sub-expressions, but (X+Y) is one.

A trivial observation: + has two arguments, () has one. Both are
operations, when parsed they become their arguments associated. "Entities
must not be multiplied beyond necessity."

> Constructs such as X[Y] or X.Y are not considered binary ops so are
> closed.))

See, how simpler association priorities are than this mess.

Why + is an ordinary infix operation and . is not? Anyway your concept
leaks because of

(X).Y

Furthermore, brackets have association priorities when in the infix
context, e.g. X(Y). Because:

X+Y(Z) vs. X.Y(Z)

There are:

1. ordering and aggregate brackets:

(12) {1,3,4}

2. index and function calls brackets:

f(12) a[5]

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

James Harris

unread,
Jan 19, 2013, 8:05:10 AM1/19/13
to
On Jan 19, 2:22 am, "BartC" <b...@freeuk.com> wrote:
> "James Harris" <james.harri...@gmail.com> wrote in message
>
> news:ef3b1364-0379-48c9...@g6g2000vbk.googlegroups.com...
>
> > Things get more difficult when looking at prefix and postfix operators
> > either side of the same value. I tried loads of real-life examples to
> > try to work out the order the prefix and postfix operators had to be
> > applied. In the end I found it helped to represent a subexpression as
> > a letter and to represent prefix and postfix operators with the
> > numbers of their precedences. So, for example, 6v4 would indicate a
> > prefix operator of precedence 6, a subexpression v, and a postfix
> > operator of precedence 4.
>
> ...
>
> > Given the above, consider the following which has two prefix operators
> > and one postfix operator.
>
> >  37v4 becomes 3(7v)4 then becomes 3((7v)4)
>
> > Here the prefix 7 is applied first as its priority is higher than or
> > equal to that of the postfix 4. Once that has been done we are left
> > with 3(7v)4 and the postfix 4 has higher priority than the prefix 3 so
> > it is applied next.
>
> I'm sure your algorithms for parsing such operators are fine.
>
> But, in terms of language design, the problem might be having to keep these
> algorithms in mind whenever anyone is writing expressions in the language. I
> know I would just have to use parentheses everywhere because I would never
> be able to remember.

It's a good point. No one wants aspects of a language to be difficult
to understand or remember. However, a uniform algorithm does not
require the language be complex.

I believe it to be generally a good technique to separate mechanism
from policy. The algorithm is intended to supply the mechanism. The
language designer should then be able to use precedences to set the
policy. So just having a flexible mechanism available doesn't require
that the language using it be complex.

> Actually I've never come across varying precedences for operators that have
> a single operand, especially when one might have a lower precedence than a
> binary operator much later in the expression (or even, apparently, much
> earlier).

Languages vary but I think of a boolean "not" as having quite low
precedence so:

not a * a > 20

Here the unary prefix "not" (in a hypothetical language) is a low-
precedence boolean and applies to what follows including the ">" which
has higher precedence.

Like you I cannot think of an example of a low-precedence postfix
operator (but that doesn't mean the algorithm shouldn't cope with it).
There may be a related example. If you accept dot as an operator (I
know you resolve it in the syntax) and @ as a postfix operator:

a.b.c@

Here, the infix dots have to resolve before the postfix @. This can be
done naturally if the precedence of @ is less than or equal to the
precedence of dot. (Currently, at least, I have them equal and both
have the highest precedence.)

> (This is not the say the natural precedence order is that obvious: I've had
> to do some tests to determine what I use. So if a, b, c etc represent unary
> operators, and X is a closed sub-expression**, then:
>
>  a b c X d e
>
> is executed as (a(b(c((X d)e)))). In other words, from the inside out, but
> doing the ones on the right first. Even that is complex enough to have to
> use parentheses from time to time!

Agreed. I only have one postfix operator and it has equal-highest
precedence so only applies after such constructs as the dotted
expression above.

Although few languages have postfix operators I think the exact-same
principle applies also to the left-hand-side of infix operators so it
is relevant. Although there are no low- or mid-precedence postfix
operators there are various precedences of infix operators and we seem
to cope with them without thinking about it. For example,

a - b * c ** d + e

> (** By closed sub-expression I mean one with elements not connected with
> ordinary binary operators. So X+Y is two sub-expressions, but (X+Y) is one.
> Constructs such as X[Y] or X.Y are not considered binary ops so are
> closed.))

Understood. If you had a parser which treated a dot between two
operands as an operator would you consider defining the infix dot as
such or is there a particular advantage to resolving a dot separately?

James

BartC

unread,
Jan 19, 2013, 8:07:56 AM1/19/13
to


"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
news:1uucscucas6ab$.17jhioudt01bn.dlg@40tude.net...
> On Sat, 19 Jan 2013 02:22:28 -0000, BartC wrote:
>
>> Actually I've never come across varying precedences for operators that
>> have
>> a single operand, especially when one might have a lower precedence than
>> a
>>binary operator much later in the expression (or even, apparently, much
>> earlier).
>
> - 2**3

Yes, that's been mentioned recently. I think I'll leave it as a 'quirk'.

> Employee.Salary++

Employee.Salary forms a term by itself.

>> (** By closed sub-expression I mean one with elements not connected with
>> ordinary binary operators. So X+Y is two sub-expressions, but (X+Y) is
>> one.
>
> A trivial observation: + has two arguments, () has one. Both are
> operations,

I don't consider () an operation at all (how would you implement such an
operation, other than just returning the value of it's operand unchanged?).
() is syntax you use so that whatever mess is inside is treated as a single
term.

> when parsed they become their arguments associated. "Entities
> must not be multiplied beyond necessity."
>
>> Constructs such as X[Y] or X.Y are not considered binary ops so are
>> closed.))
>
> See, how simpler association priorities are than this mess.
>
> Why + is an ordinary infix operation and . is not?

Again, how would you write a function to do "."? If you try and do DOT(X,Y),
you quickly run into problems defining exactly what X is, and what Y is, and
whether you should allow DOT(4,5). So I don't consider it an operation like
ADD(X,Y) might be.

Where it *can* get tricky is in -X^.Y, because ^ is an ordinary postfix
operator, and there might be undue pressure to evaluate it as (-(X^)).Y,
which is usually meaningless. (However I treat this as would be expected,
as -((X^).Y); I can do this because "." isn't a binary op.)

> Anyway your concept
> leaks because of
>
> (X).Y

No problem; "." binds to any preceding term, in this case (X).

> Furthermore, brackets have association priorities when in the infix
> context, e.g. X(Y). Because:
>
> X+Y(Z) vs. X.Y(Z)

Y(Z) usually bind tightly. But so do X.Y. However, since X.(Y(Z)) is
generally meaningless (depending on language of course) then the latter is
parsed as (X.Y)(Z)

The thing is, whatever scheme is chosen, some expressions will always be a
bit of a mess. There will be some examples that you have to think twice
about, and that you will need to wrap in parentheses.

James Harris

unread,
Jan 19, 2013, 8:13:28 AM1/19/13
to
On Jan 19, 8:53 am, "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
wrote:

...

> A trivial observation: + has two arguments, () has one.

In this context I took left paren as an infix operator. It binds
function and the like to args. Right paren here is just a delimiter.
Are you saying that is wrong? Left and right paren may be separated so
not form a single operator.

> Both are
> operations, when parsed they become their arguments associated. "Entities
> must not be multiplied beyond necessity."
>
> > Constructs such as X[Y] or X.Y are not considered binary ops so are
> > closed.))
>
> See, how simpler association priorities are than this mess.

A harsh remark!

James

BartC

unread,
Jan 19, 2013, 9:02:30 AM1/19/13
to
"James Harris" <james.h...@gmail.com> wrote in message
news:d14b6efd-a28b-4abf...@f25g2000vby.googlegroups.com...
> On Jan 19, 2:22 am, "BartC" <b...@freeuk.com> wrote:

>> I'm sure your algorithms for parsing such operators are fine.

(I've just looked at mine and it's a mess! It doesn't do right-to-left
precedences properly, except for :=, while op:= operators (such as+:=) need
parentheses. I'll have to rewrite it I think. But not now..)

> Languages vary but I think of a boolean "not" as having quite low
> precedence so:
>
> not a * a > 20
>
> Here the unary prefix "not" (in a hypothetical language) is a low-
> precedence boolean and applies to what follows including the ">" which
> has higher precedence.

'Not' is a good example. I already treat it, with logical 'and' and 'or',
specially in some contexts (in a conditional expression used for program
flow). But they can also be inside ordinary expressions; so it's not so easy
to just treat 'not' as syntax for example, so that I don't need to fit it
into the priority scheme.

>> (** By closed sub-expression I mean one with elements not connected with
>> ordinary binary operators. So X+Y is two sub-expressions, but (X+Y) is
>> one.
>> Constructs such as X[Y] or X.Y are not considered binary ops so are
>> closed.))
>
> Understood. If you had a parser which treated a dot between two
> operands as an operator would you consider defining the infix dot as
> such or is there a particular advantage to resolving a dot separately?

You mean just for parsing purposes? It's probably workable. But I must have
found it easier to have dedicated parsing code for some constructs, rather
than trying to fit them in to a generalised expression parser.

(In my case "." is heavily used for many things: runtime properties such as
x.len; compile-time attributes such as x.minval; special indexing such as
x.[i] (to index non-array objects). These all map to the own syntax nodes
and could need special handling. It would be messy to fit them into a
generic expression parser. Although I suppose the parser can just handle the
general 'shape' and the specialisation can be moved to a later pass.)

--
Bartc

Dmitry A. Kazakov

unread,
Jan 19, 2013, 9:05:45 AM1/19/13
to
On Sat, 19 Jan 2013 13:07:56 -0000, BartC wrote:

> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
> news:1uucscucas6ab$.17jhioudt01bn.dlg@40tude.net...
>> On Sat, 19 Jan 2013 02:22:28 -0000, BartC wrote:
>>
>>> Actually I've never come across varying precedences for operators that have
>>> a single operand, especially when one might have a lower precedence than a
>>>binary operator much later in the expression (or even, apparently, much
>>> earlier).
>>
>> - 2**3
>
> Yes, that's been mentioned recently. I think I'll leave it as a 'quirk'.

https://www.youtube.com/watch?v=EYPapE-3FRw

The first minute summarizes it all, but it is worth to listen completely,
of course.

>> Employee.Salary++
>
> Employee.Salary forms a term by itself.

Get_Employee (DB, "Mike").Salary++

>>> (** By closed sub-expression I mean one with elements not connected with
>>> ordinary binary operators. So X+Y is two sub-expressions, but (X+Y) is
>>> one.
>>
>> A trivial observation: + has two arguments, () has one. Both are
>> operations,
>
> I don't consider () an operation at all (how would you implement such an
> operation, other than just returning the value of it's operand unchanged?).

From mathematics:

[1.3] may mean 1.0 (truncation)

From Ada

(1,2,3) means array of three elements

> () is syntax you use so that whatever mess is inside is treated as a single
> term.

Idempotent operation is still an operation. X+Y is parsed as

"+"
|_ X
|_ Y

(X+Y) does as

"()"
|_"+"
|_ X
|_ Y

It is up to the language semantics to decide what "()" means.

>> when parsed they become their arguments associated. "Entities
>> must not be multiplied beyond necessity."
>>
>>> Constructs such as X[Y] or X.Y are not considered binary ops so are
>>> closed.))
>>
>> See, how simpler association priorities are than this mess.
>>
>> Why + is an ordinary infix operation and . is not?
>
> Again, how would you write a function to do "."?

Where is a problem?

You should not mix syntax with semantics. Whatever X.Y may mean, that is
not the parser's business. He should simply spit out "."(X,Y) and let the
semantic analysis to interpret that.

Practically any OO language lets user-defined "." in the form of "method"
or "virtual" functions.

The technique of table-driven parsers allows complete separation of parsing
from later analysis. You can have it reentrant, parsing absolutely
unrelated languages, languages as ugly as C++, all without parser code
modifications.

> If you try and do DOT(X,Y),
> you quickly run into problems defining exactly what X is, and what Y is, and
> whether you should allow DOT(4,5).

Why not? Actually, I wrote once a language where 4 . 5 would mean a
bit-substring of the binary representation of 4 starting from the 5th bit.
[I am not proud of it, but anyway.]

>> Anyway your concept
>> leaks because of
>>
>> (X).Y
>
> No problem; "." binds to any preceding term, in this case (X).

But (X) is not a term. Terms are "(", "X", ")".

>> Furthermore, brackets have association priorities when in the infix
>> context, e.g. X(Y). Because:
>>
>> X+Y(Z) vs. X.Y(Z)
>
> Y(Z) usually bind tightly. But so do X.Y. However, since X.(Y(Z)) is
> generally meaningless (depending on language of course) then the latter is
> parsed as (X.Y)(Z)

X(Y).Z

> The thing is, whatever scheme is chosen, some expressions will always be a
> bit of a mess.

Nope, as Feynmann said, if the theory is wrong it is a wrong theory.

> There will be some examples that you have to think twice
> about, and that you will need to wrap in parentheses.

That would be a language design fault to me. Either the meaning of the
expression should be obvious to almost everyone or else it should be
illegal. The worst examples of permissive language design are C and PL/1.

Dmitry A. Kazakov

unread,
Jan 19, 2013, 9:20:30 AM1/19/13
to
On Sat, 19 Jan 2013 05:13:28 -0800 (PST), James Harris wrote:

> On Jan 19, 8:53�am, "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
> wrote:
>
> ...
>
>> A trivial observation: + has two arguments, () has one.
>
> In this context I took left paren as an infix operator. It binds
> function and the like to args. Right paren here is just a delimiter.
> Are you saying that is wrong?

He meant ordering (), "(" of which is in the prefix context. "(" of a
function call is in the infix context.

> Left and right paren may be separated so
> not form a single operator.

Yes, "()" is an "operator." "(]" could be another "operator," e.g. if the
language allowed interval notations like (1.0,6.55].

BTW, the beauty of the approach is that left and right brackets may be
same. We could have |-3|, with "||" meaning absolute value, as in
mathematics.

BartC

unread,
Jan 19, 2013, 11:33:57 AM1/19/13
to
"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
news:covytol1fgze.17p2i3jg2kfhv$.dlg@40tude.net...
> On Sat, 19 Jan 2013 13:07:56 -0000, BartC wrote:
>> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message

>>> Employee.Salary++
>>
>> Employee.Salary forms a term by itself.
>
> Get_Employee (DB, "Mike").Salary++

So? This is still a term. Except that Get_Employee needs to return an lvalue
to make it work.

>> Again, how would you write a function to do "."?
>
> Where is a problem?
>
> You should not mix syntax with semantics. Whatever X.Y may mean, that is
> not the parser's business. He should simply spit out "."(X,Y) and let the
> semantic analysis to interpret that.

As I suggested in another post, perhaps the parser can just concentrate on
the 'shape' of the input, and leave it until a later pass to decide what it
all means.

But you can take this too far: you can perhaps create an entire parser in a
few dozen lines of code based on prioritised terms and operators, but all
that happens is that you end up with a new pass which now has to do all the
work! You've just shifted the problem elsewhere.

If you want to first obtain an abstract syntax tree, and you use a dedicated
node for "." in that tree which is different for the one for "binary
operator", then it's natural to make this distinction in the parser.

> The technique of table-driven parsers allows complete separation of
> parsing
> from later analysis. You can have it reentrant, parsing absolutely
> unrelated languages, languages as ugly as C++, all without parser code
> modifications.

Since parsers are not that difficult to write (even if a few bits are
tricky), I use a dedicated parser for each language I use. Since they are
all based on the previous one, I'd have to go back to around 1980 to find
out where I went wrong..

>> If you try and do DOT(X,Y),
>> you quickly run into problems defining exactly what X is, and what Y is,
>> and
>> whether you should allow DOT(4,5).
>
> Why not?

OK, it's not impossible, but dot-selection is different enough to deserve
special treatment. And that includes treating it differently in the syntax
too, otherwise you're going to be having to sorting out input such as
x++.++y later instead of nipping it in the bud.

> Actually, I wrote once a language where 4 . 5 would mean a
> bit-substring of the binary representation of 4 starting from the 5th bit.
> [I am not proud of it, but anyway.]

(I can write that now, except that it needs to be written as (4).[5], which
means bit 5 of the representation of 4, so is really an indexing operation.
The dot is used to emphasise this is indexing a single object rather than an
array, but it otherwise optional. So it's 4[5] (which looks bad). However
that's another use for 'dot' which doesn't fit into the concept of
operator.)
>>> Anyway your concept
>>> leaks because of
>>>
>>> (X).Y
>>
>> No problem; "." binds to any preceding term, in this case (X).
>
> But (X) is not a term. Terms are "(", "X", ")".
>
>>> Furthermore, brackets have association priorities when in the infix
>>> context, e.g. X(Y). Because:
>>>
>>> X+Y(Z) vs. X.Y(Z)
>>
>> Y(Z) usually bind tightly. But so do X.Y. However, since X.(Y(Z)) is
>> generally meaningless (depending on language of course) then the latter
>> is
>> parsed as (X.Y)(Z)
>
> X(Y).Z

None of these is a problem: you need some harder examples! This code gives
the following syntax tree after the first pass:

type r = record (x,y,z)
external function y(var)

x+y(z)
x.y(z)
x(y).z

-1 Sunit
-----1 Binopc add (+)
---------1 Name 'x'
---------2 Call
-------------1 Name 'y'
-------------2 Name 'z'
-----1 Call
---------1 Dot
-------------1 Name 'x'
-------------2 Name 'y'
---------2 Name 'z'
-----1 Dot
---------1 Call
-------------1 Name 'x'
-------------2 Name 'y'
---------2 Name 'z'

>> There will be some examples that you have to think twice
>> about, and that you will need to wrap in parentheses.

> That would be a language design fault to me. Either the meaning of the
> expression should be obvious to almost everyone or else it should be
> illegal. The worst examples of permissive language design are C and PL/1.

You can write indecipherable expressions in any language, completely
legally.

--
bartc

Dmitry A. Kazakov

unread,
Jan 19, 2013, 12:36:02 PM1/19/13
to
On Sat, 19 Jan 2013 16:33:57 -0000, BartC wrote:

> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
> news:covytol1fgze.17p2i3jg2kfhv$.dlg@40tude.net...
>> On Sat, 19 Jan 2013 13:07:56 -0000, BartC wrote:
>>> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
>
>>>> Employee.Salary++
>>>
>>> Employee.Salary forms a term by itself.
>>
>> Get_Employee (DB, "Mike").Salary++
>
> So? This is still a term.

It does not make sense. Term is something which cannot be split into
smaller parts, which is clearly not the case here.

> Except that Get_Employee needs to return an lvalue
> to make it work.

Get_Employee returns an object of record type with the component Salary
from which the operation "." extracts the component object to which "++" is
applied:

"++"
|_ "."
|_Get_Employee
| |_ DB
| |_ "Mike"
|_Salary

How the whole subtree is a term?

> But you can take this too far: you can perhaps create an entire parser in a
> few dozen lines of code based on prioritised terms and operators, but all
> that happens is that you end up with a new pass which now has to do all the
> work! You've just shifted the problem elsewhere.

Parser must postpone it because it cannot resolve "." in the example with
Get_Employee, since it does not know the type of the result before semantic
analysis. Parser knows nothing about types.

> If you want to first obtain an abstract syntax tree, and you use a dedicated
> node for "." in that tree which is different for the one for "binary
> operator", then it's natural to make this distinction in the parser.

No, I don't want it because there is no difference in the way these
operations are handled.

>> The technique of table-driven parsers allows complete separation of parsing
>> from later analysis. You can have it reentrant, parsing absolutely
>> unrelated languages, languages as ugly as C++, all without parser code
>> modifications.
>
> Since parsers are not that difficult to write (even if a few bits are
> tricky),

Why writing something new (needs tests, debugging, deployment, maintenance
etc), which can be reused instead?

And, again, table-driven parsers are reentrant. In an automation system
there are typically a dozen of very ugly domain-specific languages (like
ASAP2) to translate from in the same system. Shared table-driven parser is
a big advantage there. I presume it also would for GCC.

>>> If you try and do DOT(X,Y),
>>> you quickly run into problems defining exactly what X is, and what Y is,
>>> and whether you should allow DOT(4,5).
>>
>> Why not?
>
> OK, it's not impossible, but dot-selection is different enough to deserve
> special treatment. And that includes treating it differently in the syntax
> too, otherwise you're going to be having to sorting out input such as
> x++.++y later instead of nipping it in the bud.

Provided "." means "member selection", then the second argument must be an
identifier. That is precisely the job of semantics analysis to tell such
things.

>>>> Furthermore, brackets have association priorities when in the infix
>>>> context, e.g. X(Y). Because:
>>>>
>>>> X+Y(Z) vs. X.Y(Z)
>>>
>>> Y(Z) usually bind tightly. But so do X.Y. However, since X.(Y(Z)) is
>>> generally meaningless (depending on language of course) then the latter
>>> is parsed as (X.Y)(Z)
>>
>> X(Y).Z
>
> None of these is a problem: you need some harder examples! This code gives
> the following syntax tree after the first pass:

The example debunks the point you made about X.Y(Z), that "." should
precede "()". This is wrong for X(Y).Z where "()" precedes ".". The proper
association rules are

"." > "("
"(" < "."

Left and right priorities are different.

> type r = record (x,y,z)
> external function y(var)
>
> x+y(z)
> x.y(z)
> x(y).z
>
> -1 Sunit
> -----1 Binopc add (+)
> ---------1 Name 'x'
> ---------2 Call
> -------------1 Name 'y'
> -------------2 Name 'z'
> -----1 Call
> ---------1 Dot
> -------------1 Name 'x'
> -------------2 Name 'y'
> ---------2 Name 'z'
> -----1 Dot
> ---------1 Call
> -------------1 Name 'x'
> -------------2 Name 'y'
> ---------2 Name 'z'

See, you have Dot in the syntax tree, so all special treatment in the
parser was just in vain, the dot leaked into the semantic analysis. A
parser that considers "." a plain binary operation would produce the same
tree without any extra coding.

>>> There will be some examples that you have to think twice
>>> about, and that you will need to wrap in parentheses.
>
>> That would be a language design fault to me. Either the meaning of the
>> expression should be obvious to almost everyone or else it should be
>> illegal. The worst examples of permissive language design are C and PL/1.
>
> You can write indecipherable expressions in any language, completely
> legally.

Not quite. Of course you still can use unreadable identifiers obscure
function calls etc. The point is that in a properly designed language
adding parentheses could not make a indecipherable expression more
decipherable.

Dmitry A. Kazakov

unread,
Jan 19, 2013, 12:38:32 PM1/19/13
to
On Sat, 19 Jan 2013 18:36:02 +0100, Dmitry A. Kazakov wrote:

> "." > "("
> "(" < "."

Rather:

"." > "("
"(" > "."

BartC

unread,
Jan 19, 2013, 1:27:43 PM1/19/13
to


"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
news:mspcvq1vanez.c...@40tude.net...
> On Sat, 19 Jan 2013 16:33:57 -0000, BartC wrote:
>
>> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
>> news:covytol1fgze.17p2i3jg2kfhv$.dlg@40tude.net...
>>> On Sat, 19 Jan 2013 13:07:56 -0000, BartC wrote:
>>>> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
>>
>>>>> Employee.Salary++
>>>>
>>>> Employee.Salary forms a term by itself.
>>>
>>> Get_Employee (DB, "Mike").Salary++
>>
>> So? This is still a term.
>
> It does not make sense. Term is something which cannot be split into
> smaller parts, which is clearly not the case here.

Perhaps I use Term to mean something different from what you do. Maybe
Operand then, but that doesn't quite express the fact that the above is
parsed as:

(Get_Employee (DB, "Mike").Salary)++

>> Except that Get_Employee needs to return an lvalue
>> to make it work.
>
> Get_Employee returns an object of record type with the component Salary
> from which the operation "." extracts the component object to which "++"
> is
> applied:

(I'm not sure that will work; what happens to the employee record after the
salary has been incremented? It can't return the actual record, but some
sort of reference.)

> "++"
> |_ "."
> |_Get_Employee
> | |_ DB
> | |_ "Mike"
> |_Salary
>
> How the whole subtree is a term?

By being slightly more indented than the "++". Perhaps 'subtree' is the term
I'm looking for ...


>> OK, it's not impossible, but dot-selection is different enough to deserve
>> special treatment. And that includes treating it differently in the
>> syntax
>> too, otherwise you're going to be having to sorting out input such as
>> x++.++y later instead of nipping it in the bud.
>
> Provided "." means "member selection", then the second argument must be an
> identifier.

"." can mean all sorts of things (in my syntax). You can lump it in with the
mathemetical operators, but it's probably not a good idea.

> The example debunks the point you made about X.Y(Z), that "." should
> precede "()". This is wrong for X(Y).Z where "()" precedes ".". The proper
> association rules are

> "." > "("
> "(" < "."

Actually both those examples I just parse from left to right; table-driven
operator prioritities don't come into it:

X.Y(Z) Dot first then Call
X(Y).Z Call first then Dot

Seems to work!

>> -----1 Dot
>> ---------1 Call
>> -------------1 Name 'x'
>> -------------2 Name 'y'
>> ---------2 Name 'z'
>
> See, you have Dot in the syntax tree, so all special treatment in the
> parser was just in vain, the dot leaked into the semantic analysis. A
> parser that considers "." a plain binary operation would produce the same
> tree without any extra coding.

Yeah, maybe. But since Dot is very different from Binopc, I then have to
make the distinction, less conveniently, in the next pass.

(My next pass after parsing is to do with name resolution, where Dot figures
heavily, as it's also used to qualify names in different namespaces, as well
as for member selection of a record. Other uses of Dot were separated out in
the parser. If, at this point, all those different guises were hidden away
amongst the dozens of ordinary operators, then it would get very untidy
indeed!

Also there are subtle differences in the way priority is handled between Dot
and other binary ops, with the purpose of making things work, for the most
part, as might be expected.)

--
Bartc

Dmitry A. Kazakov

unread,
Jan 19, 2013, 3:33:02 PM1/19/13
to
On Sat, 19 Jan 2013 18:27:43 -0000, BartC wrote:

> Perhaps I use Term to mean something different from what you do. Maybe
> Operand then, but that doesn't quite express the fact that the above is
> parsed as:
>
> (Get_Employee (DB, "Mike").Salary)++

Of course it does not. It must be parsed the way the language requires it
to be parsed.

>>> Except that Get_Employee needs to return an lvalue
>>> to make it work.
>>
>> Get_Employee returns an object of record type with the component Salary
>> from which the operation "." extracts the component object to which "++"
>> is applied:
>
> (I'm not sure that will work; what happens to the employee record after the
> salary has been incremented? It can't return the actual record, but some
> sort of reference.)

It can return Salary by reference, it can deploy a more complex scheme when
a fat reference to Employee record is returned with an index referencing to
Salary. It can even start a transaction attached to that reference which is
committed when the reference is destructed. An all this is not the parser's
business.

>> "++"
>>|_ "."
>> |_Get_Employee
>> | |_ DB
>> | |_ "Mike"
>> |_Salary
>>
>> How the whole subtree is a term?
>
> By being slightly more indented than the "++". Perhaps 'subtree' is the term
> I'm looking for ...

Rather a [sub]expression

[Sub]tree is a representation of the expression useful for semantic
analysis.

>>> OK, it's not impossible, but dot-selection is different enough to deserve
>>> special treatment. And that includes treating it differently in the syntax
>>> too, otherwise you're going to be having to sorting out input such as
>>> x++.++y later instead of nipping it in the bud.
>>
>> Provided "." means "member selection", then the second argument must be an
>> identifier.
>
> "." can mean all sorts of things (in my syntax). You can lump it in with the
> mathemetical operators, but it's probably not a good idea.

Though in some languages "." is indeed overloaded. E.g. Ada uses "." for
the purpose C++ does "::" (namespaces). Not a big deal.

>> The example debunks the point you made about X.Y(Z), that "." should
>> precede "()". This is wrong for X(Y).Z where "()" precedes ".". The proper
>> association rules are
>
>> "." > "("
>> "(" < "."
>
> Actually both those examples I just parse from left to right;

Hmm, right-to-left parsers are not very common. I read somewhere about how
to convert left-to-right to right-to-left, I vaguely remember, that it is
not trivial.

> table-driven
> operator prioritities don't come into it:
>
> X.Y(Z) Dot first then Call
> X(Y).Z Call first then Dot
>
> Seems to work!

I never doubted that! A hand-written parser can parse anything that can be
parsed. (:-))

>>> -----1 Dot
>>> ---------1 Call
>>> -------------1 Name 'x'
>>> -------------2 Name 'y'
>>> ---------2 Name 'z'
>>
>> See, you have Dot in the syntax tree, so all special treatment in the
>> parser was just in vain, the dot leaked into the semantic analysis. A
>> parser that considers "." a plain binary operation would produce the same
>> tree without any extra coding.
>
> Yeah, maybe. But since Dot is very different from Binopc,

What is different?

(The point is that syntactically it is not different)

> I then have to make the distinction, less conveniently, in the next pass.

Since in my parser treats it an operation, I can declare it
pseudo-commutative, which has the effect of fusion on the tree:

A.B.C.D

generates straight

"."
|_ A
|_ B
|_ C
|_ D

rather than

"."
|_"."
| |_ "."
| | |_ A
| | |_ B
| |_ C
|_ D

> (My next pass after parsing is to do with name resolution, where Dot figures
> heavily, as it's also used to qualify names in different namespaces, as well
> as for member selection of a record. Other uses of Dot were separated out in
> the parser. If, at this point, all those different guises were hidden away
> amongst the dozens of ordinary operators, then it would get very untidy
> indeed!

No difference. The parser places lexical tokens at the tree branches. "."
is such a token. It is not an identifier. So if the language does not
overloads the operator "." with any operations other than "member
selection" there is nothing to resolve about it.

Again, because the trees are just same there cannot be anything special in
".".

BartC

unread,
Jan 19, 2013, 4:35:40 PM1/19/13
to
"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
news:1o72c17frijid.4pmf1qrxjhtf$.dlg@40tude.net...
> On Sat, 19 Jan 2013 18:27:43 -0000, BartC wrote:

>> Yeah, maybe. But since Dot is very different from Binopc,
>
> What is different?
>
> (The point is that syntactically it is not different)

But in what way is it the same? In that it separates from part of an
expression from another? So does a comma, or semicolon; you can probably
treat these as operators too. (And yes I can use semicolons inside an
expression.)

If the purpose is just to verify the syntax of some source code, then you
might just do that. But without a bit of special treatment for certain
combinations of tokens, more errors can make it through to a later pass.
(For example, if "." can only be followed by an identifier.)

While pretending that opening brackets are operators glosses over the fact
that they have to be balanced in a way that proper operators don't have to
be.

>> I then have to make the distinction, less conveniently, in the next pass.
>
> Since in my parser treats it an operation, I can declare it
> pseudo-commutative, which has the effect of fusion on the tree:
>
> A.B.C.D
>
> generates straight
>
> "."
> |_ A
> |_ B
> |_ C
> |_ D
>
> rather than
>
> "."
> |_"."
> | |_ "."
> | | |_ A
> | | |_ B
> | |_ C
> |_ D


If I wrote A.B.C.D in my syntax, it could be that each of those dots means
something different! I don't know if linearising the subtree would help.

(Eg. A.B.C.D.E could mean E(A::B.C'D) using a selection of alternative
forms. Attempting a linear representation such as (Binop_dot A B C D E)
is probably not too helpful.)

--
Bartc

Dmitry A. Kazakov

unread,
Jan 19, 2013, 5:21:49 PM1/19/13
to
On Sat, 19 Jan 2013 21:35:40 -0000, BartC wrote:

> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
> news:1o72c17frijid.4pmf1qrxjhtf$.dlg@40tude.net...
>> On Sat, 19 Jan 2013 18:27:43 -0000, BartC wrote:
>
>>> Yeah, maybe. But since Dot is very different from Binopc,
>>
>> What is different?
>>
>> (The point is that syntactically it is not different)
>
> But in what way is it the same? In that it separates from part of an
> expression from another?

It does not separate, it binds to the left and right arguments as any infix
operator does.

> So does a comma, or semicolon; you can probably
> treat these as operators too. (And yes I can use semicolons inside an
> expression.)

No, comma is a delimiter, it is not an operator (except for in broken C).

> If the purpose is just to verify the syntax of some source code, then you
> might just do that. But without a bit of special treatment for certain
> combinations of tokens, more errors can make it through to a later pass.
> (For example, if "." can only be followed by an identifier.)

Which is a desired effect, because otherwise you could stop syntax analysis
prematurely not reporting really serious syntax errors. E.g.

A.+B

I prefer it reporting "missing operand" rather than totally misleading
"record field cannot be expression."

A."what?

"Missing right quote"

Identifier is a constraint, of which violation should not be a syntax
error. It would makes the language syntax less regular, harder to
understand. Technically you can stuff a lot of semantics into the syntax,
but that is a wrong way to do it, IMO.

> While pretending that opening brackets are operators glosses over the fact
> that they have to be balanced in a way that proper operators don't have to
> be.

Brackets are not treated as operators. But they do have association
priorities.

>>> I then have to make the distinction, less conveniently, in the next pass.
>>
>> Since in my parser treats it an operation, I can declare it
>> pseudo-commutative, which has the effect of fusion on the tree:
>>
>> A.B.C.D
>>
>> generates straight
>>
>> "."
>> |_ A
>> |_ B
>> |_ C
>> |_ D
>>
>> rather than
>>
>> "."
>> |_"."
>> | |_ "."
>> | | |_ A
>> | | |_ B
>> | |_ C
>> |_ D
>
>
> If I wrote A.B.C.D in my syntax, it could be that each of those dots means
> something different!

You claimed that "." means something special. Now it goes overloaded. Fine,
just another reason to treat it as a plain operator!

> I don't know if linearising the subtree would help.

It does when matching of the sequence is left to right, which is the case
when dot means any of you wrote about. First you look for A, then for B in
A and so on.

BartC

unread,
Jan 19, 2013, 6:12:18 PM1/19/13
to
"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
news:6zgm1tw0xfc$.16vjstn2jmw1u.dlg@40tude.net...
It was special *because* it was overloaded so many ways!

--
Bartc

James Harris

unread,
Jan 19, 2013, 6:37:59 PM1/19/13
to
On Jan 19, 9:35 pm, "BartC" <b...@freeuk.com> wrote:

...

> If I wrote A.B.C.D in my syntax, it could be that each of those dots means
> something different! I don't know if linearising the subtree would help.
>
> (Eg. A.B.C.D.E could mean E(A::B.C'D) using a selection of alternative
> forms.

On your point about the dots meaning different things I presume you
mean that their specific meanings depend on the operand types - so one
might be a namespace dereference, another a field selection etc.

Whether that's the case or not could you not regard them all as
indicating either "subordinacy" or perhaps "directed binding" - i.e.
they simply bind their left and right operands/expressions? Then they
could be all parsed in the same way and you could assign more specific
meanings in your semantic analysis. That would be natural anyway
because that's where the types are determined.

Compare the different potential meanings of the plus sign in

A+B+C+D+E

Here, like the dot above, + might mean a number of things. Probably
number addition or string concatenation or maybe something else
depending on its operands. The point is that it's the same operator
symbol but has different meanings and we seem to cope with that
easily. Why not do the same for dot?

James

BartC

unread,
Jan 19, 2013, 8:13:28 PM1/19/13
to


"James Harris" <james.h...@gmail.com> wrote in message
news:aeb944d3-4726-4e2e...@i1g2000vbp.googlegroups.com...
> On Jan 19, 9:35 pm, "BartC" <b...@freeuk.com> wrote:

>> (Eg. A.B.C.D.E could mean E(A::B.C'D) using a selection of alternative
>> forms.
>
> On your point about the dots meaning different things I presume you
> mean that their specific meanings depend on the operand types - so one
> might be a namespace dereference, another a field selection etc.
>
> Whether that's the case or not could you not regard them all as
> indicating either "subordinacy" or perhaps "directed binding" - i.e.
> they simply bind their left and right operands/expressions? Then they
> could be all parsed in the same way and you could assign more specific
> meanings in your semantic analysis. That would be natural anyway
> because that's where the types are determined.

Possibly. But one small problem is I don't have a semantic analysis pass!
It's sort of shared between the passes that do exist (for my main language,
that would be parsing, name-resolution, very basic type-analysis, and code
generation).

The thing is, there are lots of ways to approach this; they will all no
doubt work. I just happen to find that introducing extra structure/variation
early on helpful.

> Compare the different potential meanings of the plus sign in
>
> A+B+C+D+E
>
> Here, like the dot above, + might mean a number of things. Probably
> number addition or string concatenation or maybe something else
> depending on its operands. The point is that it's the same operator
> symbol but has different meanings and we seem to cope with that
> easily. Why not do the same for dot?

With +, what it means doesn't need to be resolved until much later (in fact,
not until runtime with my dynamic language). So writing a program consisting
solely of:

a+b+c+d+e

compiles (but gives a runtime error). But writing:

a.b.c.d.e

gives a compile error in the name-resolving pass ('no field called b').

The differences between the various + symbols are largely to do with type.
Between the various dot symbols, it can be structural as well as type. I
suppose if Dot only means field selection, then it would be closer to a
binary operator (but the handling is still somewhat different IMO).

--
Bartc

Dmitry A. Kazakov

unread,
Jan 20, 2013, 4:09:01 AM1/20/13
to
So are all other operators. Alphabet is limited which counts on both the
character set and human ability to memorize symbols.
0 new messages