i'm going to write a parser for files containig data for industrial milling machines. The special problem with this files is : they can be really large, up to several gigabytes. Thus, buffering the complete character content or even all tokens in memory is impossible.On the other side, this files do not have very much structure, they mainly contain long sequences of literals and arithmetic expressions, albeit with some framing around them, e.g: A sequence of some million statements could be enclosed in a loop.
My current idea is to process these files with antlr, but not to build a parse tree. Instead i want to take apart the structural information (which is small and surely fits into memory) from the seqential mass data, which will be converted and written to disk files during the parse.
I found a post in the old antlr-3 mailing list archive about a somewhat similar project: http://antlr.1301665.n2.nabble.com/Parsing-large-files-A-trip-report-td7454456.html Having read this post, i suspect that it might be infeasible or even impossible to process arbitrary large data streams with antlr4. Has anybody succeeded in a similar undertaking ?
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Thanks a lot for the advice, and sorry for my impatience. Indeed, Terences's excellent book has all infomation i was asking for. Currently proceeded to chapter 7.3, i just didn't notice. I hope to get my first 'endless file' translator running within a few days now - phantastic !
kind regards, Martin
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Hi Martin,
There is no way to guarantee that ANTLR 4 will not examine an entire file for making individual decisions. The ALL(*) parsing algorithm requires that the implementation be able to operate on the entire input as necessary to make accurate decisions. When working with the unbuffered char/token streams, the chance of ANTLR 4 loading the entire contents of a file into memory is reduced, but not eliminated. To ensure that streaming input is successful, you will need to do one of the following:
1. Instead of feeding an entire file to ANTLR, find a way to break the input into smaller segments which do fit in memory, and process those segments with ANTLR one-at-a-time.
2. Create new implementations of CharStream and TokenStream which operate as random-access views of files on disk, with a large in-memory buffer. In essence, this is a buffered char stream that limits the size of the memory buffer used, with the rest of the data located on disk.
The latter sounds like a general solution that could help many people trying to parse very large inputs. I’d be interested in attempting an implementation if I ever find the time, or perhaps evaluating a pull request containing an implementation for inclusion in a future release of the ANTLR 4 runtime.
Thank you,
Sam Harwell
--
The only way to guarantee that ANTLR will not examine every token from the decision token through the EOF symbol is to have a syntax error in the input stream.
The specification for ParserATNSimulator.adaptivePredict does not require that it return as early as possible. Its only requirement is accuracy, which means it is allowed to examine the entire input (which for the streams currently provided with the runtime means the entire input is loaded in memory). In practice, adaptivePredict often returns earlier than that, but that is an implementation detail that you cannot rely on and could freely change in a future release.
Sam
--
Just curious -- are these GCODE files? If so, what dialect?
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Hi, why is this impossible? The computer hardware is limited? Can't get a memory upgrade on the machine without big expense?
Or are you likely to approach terabyte sizes?