Here's a (much) reduced grammar demonstrating the issue (the constant
and identifier non-terminals are omitted, but I can provide the
complete grammar and lexer input if that would help):
%right UNARYOP;
%left BINARYOP1;
%left BINARYOP2;
tuple
-> expression
| '(' expression (',' expression)* ')'
;
expression
-> %prec(BINARYOP1) expression '*' expression
| %prec(BINARYOP1) expression '/' expression
| %prec(BINARYOP2) expression '+' expression
| %prec(BINARYOP2) expression '-' expression
| %prec(BINARYOP2) expression 'CONCAT' expression
| %prec(UNARYOP) '+' expression
| %prec(UNARYOP) '-' expression
| '(' expression ')'
| constant
| identifier
;
Naturally, this results in an ambiguity between a tuple with a single
expression, and a parenthesized expression:
tuple -> '(' expression ')'
expression -> '(' expression ')'
(PyGgy reports two ambiguities, both shift/reduce)
Unfortunately, such a construct appears in several places throughout
the dialect, so I figured before going any further I'd better learn how
to deal with it. One example of the use of the construct is:
INSERT INTO <table> VALUES <tuple>
I can't seem to solve the ambiguity using explicit precedence or
preference. For example, the following gives the same two ambiguities
(looking at the debug output):
%pref TUPLE GROUP;
%right UNARYOP;
%left BINARYOP1;
%left BINARYOP2;
tuple
-> expression
| %prec(TUPLE) '(' expression (',' expression)* ')'
;
expression
-> %prec(BINARYOP1) expression '*' expression
| %prec(BINARYOP1) expression '/' expression
| %prec(BINARYOP2) expression '+' expression
| %prec(BINARYOP2) expression '-' expression
| %prec(BINARYOP2) expression 'CONCAT' expression
| %prec(UNARYOP) '+' expression
| %prec(UNARYOP) '-' expression
| %prec(GROUP) '(' expression ')'
| constant
| identifier
;
I also tried using %nonassoc instead of %pref, which gives the same
ambiguity (though I don't think %nonassoc would be correct in this case
anyway?)
I managed to change the debug output to report a single ambiguity
(reduce/reduce) by altering the tuple production to:
tuple
-> expression
| '(' expression ')'
| '(' expression (',' expression)+ ')'
;
But that still doesn't help much and making the precedence explicit (in
the same way as above) doesn't seem to help either (same reduce/reduce
ambiguity). I'm sure there must be a way to make the parser prefer to
reduce a tuple instead of an expression in the case of parentheses at
the top level of the input, but I haven't figured it out yet ... any
ideas?
Thanks,
Dave.
As you describe it, the grammar is ambiguous. How do you tell the
difference between a singleton tuple and a parenthesized expression? As a
human, I cant tell the difference, so I can't really tell you how to make
a machine tell the difference. Can you provide an example of a case where
it is clearly a tuple and another case where it is clearly a parenthesized
expression? Once we sort out what the difference is maybe we can come
up with a grammar that isnt ambiguous.
> Dave.
Tim Newsham
http://www.lava.net/~newsham/
Thanks for the quick response. I think it is possible to tell the
difference between a singleton tuple and a parenthesized expression,
but only within the context of some other operation. I'll try and
expand on my prior example to demonstrate:
The VALUES statement in DB2 allows one to generate a temporary table on
the fly. For example, to generate a table with two rows and three
columns, one could specify:
VALUES (1, 2, 3), (4, 5, 6)
Alternatively, to generate a table with one row and one column
(essentially a single value, although in relational theory this is
simply a "special case" of a table):
VALUES (1)
Hence, in the context of a VALUES statement a parenthesized expression
reduces to a tuple, not an expression. This syntax alone would be
simple; I could specify the VALUES statement unambiguously as:
values
-> VALUES values_row (',' values_row)*
;
values_row
-> '(' expression (',' expression)* ')'
;
In other words, the top-level parentheses in a VALUES statement always
indicate a tuple, never an expression.
Unfortunately, DB2 (for some unfathomable reason, unless it's to drive
grammar writers up the wall :-) also permits the following syntax to
generate a table with one row and one column:
VALUES 1
Furthermore, one can specify a table of three rows and one column with:
VALUES 1, 2, 3
The official syntax diagram for the VALUES statement (from the DB2 Info
Center at http://tinyurl.com/zv9pj) is as follows:
values-clause:
.-,--------------.
V |
|--VALUES----| values-row |-+-----------------------------------|
values-row:
|--+-+-expression-+-----------+---------------------------------|
| '-NULL-------' |
| .-,--------------. |
| V | |
'-(----+-expression-+-+--)-'
'-NULL-------'
(you'll need to switch to a fixed-width font view to render it
correctly)
I naively translated the diagram above (ignoring the NULL alternatives)
into the following grammar specification:
values_clause
-> VALUES values_row (',' values_row)*
;
values_row
-> expression
| '(' expression (',' expression)* ')'
;
Resulting in the ambiguity. I think it's possible to see how a human
can tell the difference in that within the context of a VALUES
statement an expression within *top-level* parentheses is a tuple; it's
only parentheses *below* the top-level that reduce to an expression.
Hence the following parentheses reduce to a tuple:
VALUES (1)
Whereas in the following statement, the top-level parentheses reduce to
a tuple, while the internal parentheses reduce to an expression:
VALUES ((1))
This is why I originally stated that I think it's possible to resolve
the ambiguity, but only within the context of something else.
Unfortunately, I'm at a loss as to how this could be represented in a
PyGgy grammar.
Anyway, many thanks for your assistance in this problem (and indeed for
writing PyGgy in the first place - it's the "cleanest" Python
implementation I've seen thus far). If there's any further information
I can provide, please let me know.
Thanks,
Dave.
Sorry for the delay in replying. Rewriting your grammar in this
way should remove the ambiguity from your grammar. Its only the
case where something could be either a tuple (starting with
a parenthesis) or an expression (potentially starting with
a parenthesis) that should give rise to ambiguity.
> I naively translated the diagram above (ignoring the NULL alternatives)
> into the following grammar specification:
>
> values_clause
> -> VALUES values_row (',' values_row)*
> ;
>
> values_row
> -> expression
> | '(' expression (',' expression)* ')'
> ;
Here the parser cannot disambiguate look at the lookahead
character alone (the parenthesis). Its not until you hit
the comma that it becomes aparent that you're parsing a tuple
rather than an expression. And there is still an ambiguity
in the case of a singleton tuple or a parenthesized expression.
One thing you can do here is to force all tuples to have at
least one comma. That gets rid of the ambiguity. The parsing
engine will still not be able to tell which way its going
when it hits a parenthesis. What will happen is it will parse
both alternatives and drop one of them when it is no longer
viable. Ie. when it hits a comma, the expression parser will
not be able to proceed. If it hits the matching closing parenthesis
without hitting a comma, the tuple parser will not be able to
proceed.
The only minor detail here is that you cant tell the difference
between singleton tuples and parenthesized expressions (what
is "(1)"?). You can fix this in your grammar by picking one
or the other, or you can leave it as ambiguous and pick later
from the multiple parse trees you get back.
> Resulting in the ambiguity. I think it's possible to see how a human
> can tell the difference in that within the context of a VALUES
> statement an expression within *top-level* parentheses is a tuple; it's
> only parentheses *below* the top-level that reduce to an expression.
> Hence the following parentheses reduce to a tuple:
>
> VALUES (1)
This could be a tuple or an expression, really.
> Whereas in the following statement, the top-level parentheses reduce to
> a tuple, while the internal parentheses reduce to an expression:
>
> VALUES ((1))
This could also be a tuple or an expression :) I agree, as a human,
I'm inclined to call it a tuple, but I don't completely rule out
it being a tuple. Therein lies the problem.
> This is why I originally stated that I think it's possible to resolve
> the ambiguity, but only within the context of something else.
> Unfortunately, I'm at a loss as to how this could be represented in a
> PyGgy grammar.
Disallow singletons (you can treat all expressions as singletons
in the context of VALUES if you want) or defer the decision till
after parsing.