[Java] Infinite loop in parser

85 views
Skip to first unread message

Ori Doolman

unread,
Apr 30, 2020, 5:58:03 AM4/30/20
to antlr-discussion

Hi,
I'm using the attached grammar.

The parser is entering an infinite loop and eventually I get java.lang.OutOfMemoryError.

(I also created this as a github issue - https://github.com/antlr/antlr4/issues/2815 ).


When trying to parse the text:
releaseDate GE 2020-04-28T00:00:00
(without double quotes surrounding the date), the parser enters infinite loop.


When I parse the text:
releaseDate "GE 2020-04-28T00:00:00" 
(with double quotes surrounding the date), everything is OK.


I'm using latest antlr-4.8 and Java 11.


The infinite loop is in DefaultErrorStrategy.consumeUntil:


		while (ttype != Token.EOF && !set.contains(ttype) ) {
            //System.out.println("consume during recover LA(1)="+getTokenNames()[input.LA(1)]);
//			recognizer.getInputStream().consume();
            recognizer.consume();
            ttype = recognizer.getInputStream().LA(1);
        }


I'm quite new to Antlr, so it's hard for me to locate the root cause. It seems like a bug,

Can anyone assist in resolving that?


Thank you,

Ori.



Query.txt

John B. Brodie

unread,
Apr 30, 2020, 10:48:43 PM4/30/20
to antlr-di...@googlegroups.com, Ori Doolman

Greetings!

your "2020-04-28T00:00:00" input string is a mal-formed date-time-string because the trailing 'Z' is missing

i am surprised you did not see errors reported from the lexer when the ':' were encountered...

and so the parser tries to discard tokens in order to attempt to re-synchronize with the input stream

but it can not

because DATE_TIME_STRINGS recognize the EMPTY string. so we try continuously try to discard (possibly) erroneous input characters. And DATE_TIME_STRINGS represent ZERO characters that must be discarded, so we discard them. And then DATE_TIME_STRINGS represent ZERO characters that must be discarded, so we discard them. And then DATE_TIME_STRINGS represent ZERO characters ------ and so the infinite loop. I see the loop as a stack overflow rather than out of memory, again maybe check your error reporting.


I modified your DATE_TIME_STRING stuff alot. Making it Parser rules, mostly, rather than Lexer rules.


Also while I was playing with your grammar, I thought that maybe ENCODED_STRING token was being toooo agresive in consuming characters. So I added the LineTerminator and \t characters to its exclusion.


Attached please find the grammar file I used in my tests.

Hope this helps...

   -jbb

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/b59c1d9c-749d-4808-979b-067aafe5f876%40googlegroups.com.
Query.g4

Mike Lischke

unread,
May 1, 2020, 5:02:20 AM5/1/20
to antlr-discussion, Ori Doolman
I wish such questions would be asked on Stackoverflow. I would immediately upvote your answer John!

Ori Doolman

unread,
May 1, 2020, 11:42:07 AM5/1/20
to antlr-discussion
Hi John,
Thank you so much for the detailed answer. I couldn't expect better - both thorough explanation and a provided solution :)
I'm quite new to Antlr (I wasn't the one writing the original grammar and code). 

I have additional questions (hope it is ok)::
  1. After applying your grammar, it still fails to parse list of dates: "releaseDate IN 2020-04-29T20:59:44Z,2020-04-29T20:59:41Z
    line 1:28 token recognition error at: ':'
    line 1:31 token recognition error at: ':'
    line 1:29 mismatched input '59' expecting {<EOF>, AND, OR}

    The grammar looks OK to me .

  2. Missing the 'Z' is just one possibility of providing an illegal input. How can I guarantee that no input would cause my server to hang?
    Can I limit the count of the parser reattempts somehow? I prefer throwing an exception rather than infinite loop, of course.

Some more background: I'm trying to write a URL filter expression for my REST API. Therefore, possible filters can be:
  1. /products?filter=(title EQ car AND name EQ ford)
  2. /products?filter=(releaseDate GE 2020-05-01T03:00:00Z AND title EQ car) OR (type IN bike,boat,airplane)
  3. /products?filter=releaseDate IN 2020-04-29T20:59:44Z,2020-04-29T20:59:41Z


Regards,
Ori.


To unsubscribe from this group and stop receiving emails from it, send an email to antlr-di...@googlegroups.com.

John B. Brodie

unread,
May 1, 2020, 2:06:18 PM5/1/20
to antlr-di...@googlegroups.com

the second date is scanned as an ENCODED_STRING token because the comma is not in its exclusion set.

it is usually a really good idea to dump the token stream in order to be really sure of the Lexer's activity...

To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/8fb8b487-691f-4d2b-a1e0-afad7677fec7%40googlegroups.com.

Ori Doolman

unread,
May 1, 2020, 3:20:11 PM5/1/20
to antlr-discussion
Thanks again, John. It works !!

What about my second question? Any way to limit the parser and avoid infinite loops on illegal syntax?

Regards,
Ori.

John B. Brodie

unread,
May 1, 2020, 6:15:55 PM5/1/20
to antlr-di...@googlegroups.com

the only infinite loop i am aware of is the one you tripped upon --- a mal-formed lexer that wants to match the zero-length input string.

so i guess your best protection against loops is to have a well-formed grammar.

not much help, i suspect...

   -jbb

To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/d2b199a9-12e9-473d-886d-29e8ee4cb193%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages