Grammar feedback for query language.

Andrew Walker

unread,

Jun 4, 2020, 1:59:16 PM6/4/20

to antlr-discussion

Hello, Antlrites! I have been working on formalizing a query language and it's reached the point where I have questions and would like to solicit feedback.

So, first, a brief description. Queries are made up of (optionally groupled) clauses that take the form of

<property>:<operator><term> where the operator can potentially be omitted for a simple equality

or

<property>:[<terms>, <in>, <a>, <set>] where the query is assumed to be an "in" query.

And booleans are symbolized by a '+', and or queries with a ','

The requirement is to allow terms to be bare, providing they do not contain spaces (in which case quoted strings are supported), and that level of inclusivity has led me to parse both properties and terms as "identifier". I will present the grammar as I have it here, and then ask some questions.

grammar QL;

statement
    : statement op=AND statement
    | statement op=OR statement
    | L_PAREN statement R_PAREN
    | comparison
    ;

comparison
    : property=IDENTIFIER SEPARATOR L_BRACKET value=m_value (',' value=m_value )+ R_BRACKET #MultiComp
    | property=IDENTIFIER SEPARATOR op=NE? value=c_value #EqualityComp
    | property=IDENTIFIER SEPARATOR op=( GT | GTE | LT | LTE) value=NUMERIC_LITERAL #NumericComp
    | property=IDENTIFIER SEPARATOR op=( MT | NMT) value=m_value #MatchComp
    ;

c_value
    : STRING_LITERAL
    | BOOLEAN
    | IDENTIFIER
    | NUMERIC_LITERAL
    ;

m_value
    : STRING_LITERAL
    | NUMERIC_LITERAL
    | IDENTIFIER
    ;

WS : [ \t]+ -> skip ;

L_PAREN: '(';
R_PAREN: ')';
SEPARATOR: ':';
L_BRACKET: '[';
R_BRACKET: ']';
NE: '!';
GT: '>';
GTE: '>=';
LT: '<';
LTE: '<=';
MT: '~';
NMT: '!~';
AND: '+';
OR: ',';

BOOLEAN : 'true' | 'false';

NUMERIC_LITERAL
    : [+-]? ( DIGIT | (DIGIT_1_9 DIGIT+))('.' DIGIT+)? ;

STRING_LITERAL
 : '\'' ( ~'\'' | '\\\'' )* '\''
 | '"' ( ~'"' | '\\"')* '"'
 ;

IDENTIFIER
 : ( LITERAL_ESCAPE | ~[ :\r\n'"+,()><=[\]] )+
 ;

fragment LITERAL_ESCAPE
    : '\\'
    (
        '\\'
        '\''
        | '"'
        | '+'
        | ','
        | '('
        | ')'
        | '>'
        | '<'
        | '='
        | '['
        | ']'
        | ':'
    )
    ;

fragment DIGIT : [0-9];
fragment DIGIT_1_9 : [1-9];

Questions:

- I have made the distinction between c_value and m_value to exclude booleans from "in" queries. This seems arbitrary especially since I will already need to check all of the terms in the set to verify they're the same type and might as well ensure that booleans aren't present then. Should I remove this distinction?

- I am unsure if my toplevel rule is correct. Ideally, I would combine multiple and clauses into one for rendering into queries that support it, but currently what I get for something like "a:a + b:b + c:c" is:

+----+

|AND |

+----+

/\

/ \

+----+ \

|AND | \

+----+ \

/\ \

/ \ \

+----+ +----+ +----+

|a:a | |b:b | |c:c |

+----+ +----+ +----+

Rather than

+----+

|AND |

+----+

/|\

/ | \

+----+ +----+ +----+

|a:a | |b:b | |c:c |

+----+ +----+ +----+

Like I might want. Is this something I have to live with or is there a better way?

- Rather than attempt to assert what can and can not be a string escape sequence in the grammar, I have opted to just eat whatever comes between quotes and handle the unescaping later. Where is the best place to do this step? Ideally I would hook the tokens as they emerge to remove their enclosing quotes and unescape them rather than handle them separately in each comparison rule.

- Speaking of the comparison rule, can it/should it be simplified? Really any feedback is welcome, I would love to learn more.

Federico Tomassetti

unread,

Jun 10, 2020, 12:38:31 AM6/10/20

to antlr-discussion

> I have made the distinction between c_value and m_value to exclude booleans from "in" queries

Yes, I would drop that distinction. In general, one thing I am seeing when I review grammars is people trying to do too much in the parsing. I think there is no benefit in the distinction here: it leads to duplicate code and more confusing error messages, while you can easily check that later, after parsing

> Like I might want. Is this something I have to live with or is there a better way?

You can change that here or in a second step, in which you transform the tree. If you want to do it here you can do:

statement (op=AND statement)+

However, what about OR and precedence?

> Rather than attempt to assert what can and can not be a string escape sequence in the grammar, I have opted to just eat whatever comes between quotes and handle the unescaping later. Where is the best place to do this step?

The way I do it is by translating the parse-tree into an AST and then perform validation there

> Speaking of the comparison rule, can it/should it be simplified? Really any feedback is welcome, I would love to learn more.

I would add the expression rule and replace m_value, c_value, and direct references to literals

Andrew Walker

unread,

Jun 10, 2020, 9:09:39 AM6/10/20

to antlr-di...@googlegroups.com

On Wednesday, June 10, 2020 at 12:38:31 AM UTC-4, Federico Tomassetti wrote:

> I have made the distinction between c_value and m_value to exclude booleans from "in" queries

Yes, I would drop that distinction. In general, one thing I am seeing when I review grammars is people trying to do too much in the parsing. I think there is no benefit in the distinction here: it leads to duplicate code and more confusing error messages, while you can easily check that later, after parsing

> Like I might want. Is this something I have to live with or is there a better way?

You can change that here or in a second step, in which you transform the tree. If you want to do it here you can do:

statement (op=AND statement)+

However, what about OR and precedence?

Yes, OR precedence is important, so I'm guessing this is a step best taken care of post-parsing? Is that the general manner of such things?

> Rather than attempt to assert what can and can not be a string escape sequence in the grammar, I have opted to just eat whatever comes between quotes and handle the unescaping later. Where is the best place to do this step?
The way I do it is by translating the parse-tree into an AST and then perform validation there

> Speaking of the comparison rule, can it/should it be simplified? Really any feedback is welcome, I would love to learn more.

I would add the expression rule and replace m_value, c_value, and direct references to literals

Thanks for your feedback! The only things I'm unclear on are the "expression rule" and the "direct references to literals". Do you mean removing the NUMERIC_LITERAL token from the "comparison" rule or something else?

Mike Cargal

unread,

Jun 11, 2020, 8:18:34 AM6/11/20

to antlr-discussion

Not to speak for Federico... but...

I would suggest you definitely handle precedence in the grammar. This ensures that your parse tree accurate reflects how to evaluate your expressions.

So, re: precedence, assuming that you want AND to have higher precedence than OR (the norm), then I think Federico is just suggesting that you don't forget about the OR operator. Your grammar already has AND as a higher priority than OR, so just use the same technique for OR's.

I'm not quite sure what "add the expression rule" is intended to mean unless it's a slip because you have AND, OR, and parentheses as part of your statement rule (normally, these would be considered part of an expression rule.

"direct references to literals", I believe would be something like the ',' in the first line of your comparison rule. They're not really wrong (they work fine), but ANTLR will be forced to create token names for them that won't be easy to use in your code. However, if you define the token, and then reference the token in the parser rule, you'll get nicely named tokens. That comma literal token is the only example of that I see.

It actually conflicts with your OR token definition, both tokens are ','. I suspect ANTLR would just consider it an OR token (but I'm not sure about that). Keep in mind that ANTLR first tokenizes you input stream and then matches your parse rules against the token stream that results. By the time the parser sees anything, tokenization has already been decided.

(As an aside, I'd probably also define tokens for TRUE and FALSE and then a parser rule for boolean. It'll be easier to deal with in your code. As written, you'll have a BOOLEAN token, that you'll have to interrogate the value of to know whether you're dealing with true or false.

Federico Tomassetti

unread,

Jun 11, 2020, 8:57:06 AM6/11/20

to antlr-di...@googlegroups.com

Hi,
regarding the expression rule, I was suggesting to create a rule named
"expression" which could be defined as c_value is defined.
Then in comparison replace all usages of c_value, m_value, and
NUMERIC_LITERAL with expression

Cheers,
Federico

> --
> You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/f134f8ea-7bf6-43a5-a2b8-87ebf6a4ceaeo%40googlegroups.com.

--
Website at https://tomassetti.me
GitHub https://github.com/ftomassetti

Reply all

Reply to author

Forward