Quoted and unquoted IDs doubt

14 views
Skip to first unread message

Andres Solenzal

unread,
Mar 20, 2019, 5:12:29 PM3/20/19
to antlr-discussion
Maybe the title does not make honor to what I'm posting but let's give this a shot. I have created a grammar for a custom query language on my company, everything was going well until now, I had two terms definition one for quoted terms and other for safe unquoted terms like this:

QUOTED_STRING: '"' (ESC | ~["\\])+ '"';


fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;

PARAM: (LOWERCASE | UPPERCASE | DIGITS)+;

Now I was asked to support any character inside PARAM and did this:

PARAM: (ESC | ~(' ' | '\t' | '\n')+)+

And as the lexer decided that the token will be the one that makes the longest match PARAM will ingest everything as long as no spaces are on it. But the most important thing here is that my grammar also haves definitions for function like terms like this one:

FUNCTION_ONE: 'function_one:';

And that will be consumed by PARAM also.

I have searched for a way to exclude terms like 'function_one:' from the PARAM definition but without luck. 

Is there a way to give more precedence to function definitions over PARAM?

John B Brodie

unread,
Mar 20, 2019, 10:30:01 PM3/20/19
to antlr-di...@googlegroups.com, Andres Solenzal
Greetings!

I have a short answer, but also have many worries and doubts regarding
your grammar also...

First the short answer:

Put the PARAM rule as the last lexer rule.

ANTLR lexers are greedy, the rule matching the longest input sequence
wins. But when 2 rules match exactly the same input sequence, then the
lexer rule appearing first in the grammar file wins.

And now my worries:

how should the input sequence `format_one:format_one: ` be interpreted
by your lexer? is it 2 FORMAT_ONE tokens and a blank or is it 1 PARAM
token and a blank?

my question begins to get to the issue of punctuation characters as
operators.

should the `:` be considered a post-fix operator indicating a keyword,
and thus excluded from consumption by PARAM?

and do you have other operators? + * % ....whatever that need to be
excluded from the PARAM rule?

how should the input `one+"two"+three` be interpreted? 5 tokens or 3?

Hope this helps...
-jbb
> --
> You received this message because you are subscribed to the Google
> Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to antlr-discussi...@googlegroups.com
> <mailto:antlr-discussi...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Andres Solenzal

unread,
Mar 21, 2019, 11:08:10 AM3/21/19
to antlr-discussion
Hi John, 

Thanks for your answer. Putting param at the bottom will not make any change because no matter what it will always win with the longest match. I will give more context to this question, I have a grammar rule called expression and it's defined as follows:

expression: FUNCTION argument;

argument: PARAM | QUOTED_STRING;  // Param is the super greedy rule.

FUNCTION: 'calc:'; // note the usage of colon(:) as a delimiter for function name


This way if I throw calc:somevalue then the lexer will think that the whole statement is a PARAM. The only way that works now is if I exclude colon(:) from the PARAM definition. That way we will get a tree where function will be calc: and param will be somevalue.

But you know the bosses, they want somevalue to include any character without quotation eg, some:value, adding this will make expressions like calc:some:value to be valid. AND the most important part is that allowing colon in the PARAM definition for that will make PARAM to consume the whole statement again. I tried playing with lexical modes but with no luck, because no matter what lexical mode I'm in the PARAM token will continue with it's gula.

Hope this makes sense and thanks again.

Andres.

Reply all
Reply to author
Forward
0 new messages