Antrl grammar to map a token to multiple lexer rules

380 views
Skip to first unread message

Anil Dasari

unread,
Jan 22, 2020, 3:28:49 PM1/22/20
to antlr-discussion
HI, 

Good morning.

Below is my grammar

grammar Filter;
// Tokens.
AND: 'AND';
OR: 'OR';
LITERAL: (('a'..'z'| 'A'..'Z'|'\\'|'/'|'_'| '0'..'9' |'.')+);
OBJECT: (('a' ..'z' | 'A' ..'Z' | '\\' | '/' | '_')+);
OPERATOR: '=' | '!=';
WHITESPACE: [ \r\n\t]+ -> skip;
OPEN_CURLY: '(';
CLOSE_CURLY: ')';
fragment TRUE: 'true';
fragment FALSE: 'false';
fragment STRING: '"' (ESC | .)*? '"';
fragment NUMBER: [0-9.]+;
fragment UNICODE: ('\u0000' ..'\u00FF')+;
fragment ESC: '\\"' | '\\\\';
VALUE: (STRING | NUMBER | TRUE | FALSE);

// Rules.
start: expression EOF;
expression:
LITERAL                                  # literal
| OBJECT OPERATOR VALUE                  # comparison
| expression AND expression # expAND
| expression OR expression     # expOR
| OPEN_CURLY expression CLOSE_CURLY # nestedComparisons;


basically input text is a match with LITERAL token and OBJECT. antlr takes only the first one which is causing the problem is my case.

Is there any way to make lexer match a input with multiple tokens and finally apply one of the token based on parser rule ?

Eg: 

1. input : textstring1234 . -> it should match with Literal token as it match with parser rule #literal
2. input: test=1234   -> test should match token object as it is more of a match with comparrision parser rule

 i am trying to build a grammar to support following queries.

1. testing
2. testing AND/OR test=1345
3. testing AND/OR test1="dafsa"
4. test=1234 AND/OR test1="dafsa"
5. nested expressions 

The above grammar splitting testing AND test=1234 into following tokens

LITERAL ("testing")
AND ("AND")
LITERAL ("test")
OPERATOR ("=")
LITERAL ("1234") 

the expectation is test should be OBJECT and 1234 is a value. 


Can you share your thoughts on fixing the grammar please ? 

Thanks


John B Brodie

unread,
Jan 22, 2020, 4:47:47 PM1/22/20
to antlr-di...@googlegroups.com, Anil Dasari

Greetings!

Recall that ANTLR lexers are greedy, matching the longest possible input sequence for each token recognized. Further, when 2 (or more) Lexer rules match exactly the same input sequence; ANTLR disambiguates this collision by selecting the Lexer rule that appears first in the Lexer grammar.

Move LITERAL to the end of the Lexer grammar (i usually have the WHITESPACE rule at the end also, but that probably doesn't matter, just something i do).

Delete OBJECT as a token. Have all Parser rules recognize LITERAL and then give a semantic constraint upon those instances in the Parse where the extra characters in LITERAL are not permitted (in my opinion, doing this also has the benefit of possibly providing a more meaningful error message in this case).

Note that all of the above is UNTESTED, just my experience with ANTLR.

Hope this helps...

   -jbb

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/5af8e44d-92b6-4d99-a2b5-178c1d4fc53b%40googlegroups.com.

Anil Dasari

unread,
Jan 23, 2020, 11:08:44 AM1/23/20
to antlr-discussion
Hi John and all,

i have modified the grammar to keep it simple.

grammar Filter;

fragment DIGIT : [0-9];
fragment LETTER : [a-zA-Z];
fragment ESC : '\\"' | '\\\\';
fragment TRUE: 'true';
fragment FALSE: 'false';
fragment UNICODE: ('\u0000' ..'\u00FF')+;

//WHITESPACE: ' ' -> skip;
WS: [ \r\n\t]+ -> skip;

AND: 'AND';
OR: 'OR';
OPEN_CURLY: '(';
CLOSE_CURLY: ')';
OPERATOR: '=' | '!=';
STRING: '"' (ESC|.)*? '"';
NUMBER: DIGIT+ ([.,] DIGIT+)?;
FIELD: LETTER+ ([._] LETTER+)?;
VALUE: (STRING | NUMBER | TRUE | FALSE);

TEXT: ~[\\\r\n"];
//ANY: .;


start: expression EOF ;

expression  : OPEN_CURLY expression CLOSE_CURLY # nestedExpression
    | expression AND expression   # comparison
| FIELD OPERATOR VALUE # simple
| VALUE # literal
| FIELD # literal
    | TEXT # literal
    ;

i see tokens are identified correctly.

input testing12324 AND test=1234 is returning following tokens

FIELD ("testing")
NUMBER ("12324")
AND ("AND")
FIELD ("test")
OPERATOR ("=")
NUMBER ("1234")

i am not sure why testing1234 is split into FILED and NUMBER tokens, it is supposed to be TEXT token. lexer should split the token by space ? is that anything i am missing ? 

Can you point the issue in above grammar please?


Thanks




John B Brodie

unread,
Jan 23, 2020, 3:27:32 PM1/23/20
to antlr-di...@googlegroups.com, Anil Dasari

TEXT matches just 1 input character at a time. Since FIELD and/or NUMBER match multiple characters, the greedy lexer matches those over TEXT. Hope this helps.

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

Anil Dasari

unread,
Jan 23, 2020, 3:50:09 PM1/23/20
to antlr-discussion
Thanks John. understood.

I changed TEXT to  TEXT: ~[ \r\t\n]+; and tokens are. 

TEXT ("testing12324")
AND ("AND")
TEXT ("test=1234")

not sure why test=1234 treated as TEXT token, it should split into FIELD, OPERATOR and NUMBER tokens as per order.  because of longest token rule if i understand it correctly?

is there any way to treat any string literal as TEXT token when none of listed token rules are applied ?



John B Brodie

unread,
Jan 23, 2020, 4:13:34 PM1/23/20
to antlr-di...@googlegroups.com, Anil Dasari


On 1/23/20 5:50 PM, Anil Dasari wrote:
Thanks John. understood.

I changed TEXT to  TEXT: ~[ \r\t\n]+; and tokens are. 

TEXT ("testing12324")
AND ("AND")
TEXT ("test=1234")

not sure why test=1234 treated as TEXT token, it should split into FIELD, OPERATOR and NUMBER tokens as per order.  because of longest token rule if i understand it correctly?


the sequence matched by TEXT in this case test=1234 is longer than the sequence matched by the other tokens individually. so TEXT is the longest match and that is what the greedy lexer reports.



is there any way to treat any string literal as TEXT token when none of listed token rules are applied ?


not that i know of. need to be very specific about what TEXT should match and not try to be a catchall rule, in my opinion.





--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

Eric Vergnaud

unread,
Jan 23, 2020, 7:41:31 PM1/23/20
to antlr-di...@googlegroups.com
Your TEXT rule only consumes 1 character

Envoyé de mon iPhone

Le 24 janv. 2020 à 00:08, Anil Dasari <dasaria...@gmail.com> a écrit :


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

Anil Dasari

unread,
Jan 23, 2020, 9:56:11 PM1/23/20
to antlr-discussion
HI Eric,

Yes. i changed the TEXT rule. seeing another problem. 

 TEXT: ~[ \r\t\n]+; and tokens are. 

TEXT ("testing12324")
AND ("AND")
TEXT ("test=1234")

expecting test=1234 returns FIELD, OPERATOR and NUMBER tokens. but ANTLR considering longest match as John mentioned in thread.

i couldn't use '!=' in ~ group. TEXT: ~['\r' | '\t' | '\n' | '=' ] (~['\r' | '\t' | '\n' | '=' ])* works for test=1234. but fail for test!=12234.  ~['\r' | '\t' | '\n' | '='| '!=' ]+ is an invalid rule.

Do you have any suggestions ? thanks.





Reply all
Reply to author
Forward
0 new messages