UnbufferedCharStream IllegalArgumentException: invalid interval help

Ryan Sommers

unread,

Jul 14, 2017, 5:35:59 PM7/14/17

to antlr-discussion

Greetings,

I'm working on a parser that needs to take in very large files. I'm having trouble with a file throwing the following exception after getting up to about 2G of bytes (between 62M and 63M lines):

Exception in thread "main" java.lang.IllegalArgumentException: invalid interval

at org.antlr.v4.runtime.UnbufferedCharStream.getText(UnbufferedCharStream.java:325)

at org.antlr.v4.runtime.CommonTokenFactory.create(CommonTokenFactory.java:77)

at org.antlr.v4.runtime.CommonTokenFactory.create(CommonTokenFactory.java:16)

at org.antlr.v4.runtime.Lexer.emit(Lexer.java:245)

at org.antlr.v4.runtime.Lexer.nextToken(Lexer.java:156)

at org.antlr.v4.runtime.UnbufferedTokenStream.fill(UnbufferedTokenStream.java:179)

at org.antlr.v4.runtime.UnbufferedTokenStream.sync(UnbufferedTokenStream.java:164)

at org.antlr.v4.runtime.UnbufferedTokenStream.consume(UnbufferedTokenStream.java:154)

at org.antlr.v4.runtime.Parser.consume(Parser.java:571)

at org.antlr.v4.runtime.Parser.match(Parser.java:203)

at ZoneParseParser.init(ZoneParseParser.java:171)

at MyParse.main(MyParse.java:38)

If I pull off the last million lines where the error occurs and run them separate, it parses the file fine. I also use the parser on other test files that have about 40M lines and they complete. Therefore, it seems to be related to the size of the file. This also doesn't happen near the end of the file (about 305M lines).

I'm not much of a Java person so I was wondering what steps might help me diagnose what's causing this issue with the Java runtime.

Thanks,

Ryan

Jan Mikkelsen

unread,

Jul 14, 2017, 8:12:13 PM7/14/17

to antlr-di...@googlegroups.com

Hi,

It looks like _tokenStartCharIndex in Lexer has overflowed. It is of type “int” which is consistent with your 2GB input file.

A correct fix for this would involve a lot of “int” to “long” changes throughout the Java runtime.

The C++ runtime uses ssize_t for the same field, which is 64 bits on modern 64 bit platforms. If you’re able to try a C++ version your parser I’d be curious if you see the same problem.

I think there’s an argument for using int64_t/uint64_t instead of ssize_t/size_t in the C++ runtime to avoid a similar problem on 32 bit C++ platforms.

Jan.

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Lischke

unread,

Jul 14, 2017, 9:35:42 PM7/14/17

to antlr-di...@googlegroups.com

uint64_t etc has been considered, but if you are intentionally compile a 32 bit target you usually want everything to be 32bit. (s)size_t is consistent with the STL and compiles dynamically to the target arch, hence I set on that in the C++ runtime.

For Java I think you hit the point and I believe that runtime should be reworked to use 64 bit integers and and also be more specific about signedness.

Mike

--

http://www.soft-gems.net

Jan Mikkelsen

unread,

Jul 16, 2017, 7:55:38 PM7/16/17

to antlr-di...@googlegroups.com

Hi Mike,

Yes, I that makes sense — On 32 bit platforms you can also start running out of address space which makes them less attractive for parsing very large files. There are ways around this, though, but I suspect it would take more changes to the Antlr runtime than a typedef on the file offset.

Regards,

Jan.

Sam Harwell

unread,

Jul 17, 2017, 8:58:30 AM7/17/17

to antlr-di...@googlegroups.com

64-bit integers in the token would result in tremendous memory overhead for the 99.999% case to support a scenario which really isn’t intended to be supported at all.

The C++ runtime would be better equipped to handle this through the use of type traits to configure the token size at compile time. Some other languages like C# could do something similar in theory, but in practice it becomes quite burdensome even for relatively simple cases.

Sam

Sam Harwell

unread,

Jul 17, 2017, 9:10:36 AM7/17/17

to antlr-di...@googlegroups.com

As a quick clarification on “not supported”:

Due to unbounded lookahead as part of the adaptive prediction algorithm, ANTLR makes no guarantees regarding the memory usage with unbuffered streams. For the purposes of conserving memory, the current BufferedCharStream a valid implementation of UnbufferedCharStream. Any manipulation of the grammar and/or input in order to prevent UnbufferedCharStream from reading to the end of the file is relying on implementation details of ANTLR which are subject to change between releases. There are only two ways to force ANTLR to conserve memory in order to parse streaming input:

Break the input into chunks prior to passing them to ANTLR (e.g. break a stream of data into lines, and parse one line at a time
Override ParserATNSimulator’s prediction algorithm to throw an exception if some application-specific lookahead distance is exceeded

Due to the difficulty of implementing the second item correctly, I would only really expect users to go with the first approach. When the first approach is used, many side benefits are realized:

The worst-case complexity goes from O(n^4) on the unbounded input, to O(n^4) on the length of a line. Expected complexity is typically between O(n) and O(n²) at runtime, as is similarly reduced.
The need for UnbufferedCharStream disappears.
The bounded input stream size during parsing means issues like the one below cannot occur.

Sam

Mike Lischke

unread,

Jul 17, 2017, 10:31:34 AM7/17/17

to antlr-di...@googlegroups.com

About what token count are we talking here, Sam? 10000 or 100000 (which would be a lot for a single parse run)? That's not even 3MB with the current implementation. Not really tremendous if you ask me. Or do you have something else in mind here? Also keep in mind that with smaller field sizes you probably won't save memory since the compiler will likely align fields at boundaries bigger than the field size.

Mike

--

http://www.soft-gems.net

Mike Lischke

unread,

Jul 17, 2017, 10:36:09 AM7/17/17

to antlr-di...@googlegroups.com

I wonder if it would be possible to limit the lookahead at the cost of precision (e.g. ambiguities not resolved when that limit was hit are "resolved" by taking the first alt). With the help of ANTLR IDEs it should be possible to determine such cases in the grammar and help the grammar author to use a different approach, avoiding so to hit that limit.

Mike

--

http://www.soft-gems.net

Sam Harwell

unread,

Jul 17, 2017, 11:37:02 AM7/17/17

to antlr-di...@googlegroups.com

I believe the regular test cases I use have around 10,000,000 tokens. More to the point though is the fact that ANTLR 4 wasn’t designed to handle multi-gigabyte input streams, so 32-bit indexes should never be out of range.

The Java and C# runtime libraries would have 4-byte alignment for these fields in practice. Other runtimes may use different values.

Sam

From: 'Mike Lischke' via antlr-discussion [mailto:antlr-di...@googlegroups.com]
Sent: Monday, July 17, 2017 9:31 AM
To: antlr-di...@googlegroups.com
Subject: Re: [antlr-discussion] UnbufferedCharStream IllegalArgumentException: invalid interval help

About what token count are we talking here, Sam? 10000 or 100000 (which would be a lot for a single parse run)? That's not even 3MB with the current implementation. Not really tremendous if you ask me. Or do you have something else in mind here? Also keep in mind that with smaller field sizes you probably won't save memory since the compiler will likely align fields at boundaries bigger than the field size.

--

Reply all

Reply to author

Forward