Is processing LAAARGE Files with Antlr4 feasible ?

1,498 views
Skip to first unread message

Martin Cornelius

unread,
Aug 30, 2013, 10:12:30 AM8/30/13
to antlr-di...@googlegroups.com
Hi there,

i'm going to write a parser for files containig data for industrial milling machines. The special problem with this files is : they can be really large, up to several gigabytes. Thus, buffering the complete character content or even all tokens in memory is impossible.On the other side, this files do not have very much structure, they mainly contain long sequences of literals and arithmetic expressions, albeit with some framing around them, e.g: A sequence of some million statements could be enclosed in a loop.

My current idea is to process these files with antlr, but not to build a parse tree. Instead i want to take apart the structural information (which is small and surely fits into memory) from the seqential mass data, which will be converted and written to disk files during the parse.

I found a post in the old antlr-3 mailing list archive about a somewhat similar project: http://antlr.1301665.n2.nabble.com/Parsing-large-files-A-trip-report-td7454456.html Having read this post, i suspect that it might be infeasible or even impossible to process arbitrary large data streams with antlr4. Has anybody succeeded in a similar undertaking ?

Thanks in advance for any hints !


 

Mike Lischke

unread,
Aug 30, 2013, 10:30:29 AM8/30/13
to antlr-di...@googlegroups.com

i'm going to write a parser for files containig data for industrial milling machines. The special problem with this files is : they can be really large, up to several gigabytes. Thus, buffering the complete character content or even all tokens in memory is impossible.On the other side, this files do not have very much structure, they mainly contain long sequences of literals and arithmetic expressions, albeit with some framing around them, e.g: A sequence of some million statements could be enclosed in a loop.

My current idea is to process these files with antlr, but not to build a parse tree. Instead i want to take apart the structural information (which is small and surely fits into memory) from the seqential mass data, which will be converted and written to disk files during the parse.

I found a post in the old antlr-3 mailing list archive about a somewhat similar project: http://antlr.1301665.n2.nabble.com/Parsing-large-files-A-trip-report-td7454456.html Having read this post, i suspect that it might be infeasible or even impossible to process arbitrary large data streams with antlr4. Has anybody succeeded in a similar undertaking ?


I'm no expert with ANTLR but from my visits in the lexer code (C target for that matter) I don't see why the lexer *has* to load all content upfront. I can imagine with a different lexer it could be possible to feed the parser on the fly. But maybe I overlook something.

Terence Parr

unread,
Aug 30, 2013, 1:07:30 PM8/30/13
to antlr-di...@googlegroups.com
Not sure it made it into the book but I expressly allow unbuffered char, token streams.

Ter


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Dictation in use. Please excuse homophones, malapropisms, and nonsense. 

osca...@gmail.com

unread,
Sep 1, 2013, 4:48:35 PM9/1/13
to antlr-di...@googlegroups.com
Hello

Look the book "The definitive ANTLR4 Reference", sections 13.3 (page 138) and 13.8, pages 243-246.

Download the file with code examples for the book and look at the directory code/actions for the example of reading a CSV file with unbuffered reading.

Take in account that in unbuffered read you can'tt use the normal syntax to access the tokens (note the $a.start.getText() to access the value, that perhaps you'll need to store)

Greetings,

Oscar

Martin Cornelius

unread,
Sep 2, 2013, 2:39:08 AM9/2/13
to antlr-di...@googlegroups.com
Thanks a lot for the advice, and sorry for my impatience. Indeed, Terences's excellent book has all infomation i was asking for. Currently proceeded to chapter 7.3, i just didn't notice. I hope to get my first 'endless file' translator running within a few days now - phantastic !

kind regards, Martin



Terence Parr

unread,
Sep 2, 2013, 11:43:54 AM9/2/13
to antlr-di...@googlegroups.com
Cool. :)  Could I entice you (or anybody else) to add a review to amazon? 


Doesn't even have to be long. More reviews help!

Thanks,
Ter


On Sun, Sep 1, 2013 at 11:39 PM, Martin Cornelius <martin.c...@datron.de> wrote:
Thanks a lot for the advice, and sorry for my impatience. Indeed, Terences's excellent book has all infomation i was asking for. Currently proceeded to chapter 7.3, i just didn't notice. I hope to get my first 'endless file' translator running within a few days now - phantastic !

kind regards, Martin



--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sam Harwell

unread,
Sep 2, 2013, 11:58:54 PM9/2/13
to antlr-di...@googlegroups.com

Hi Martin,

 

There is no way to guarantee that ANTLR 4 will not examine an entire file for making individual decisions. The ALL(*) parsing algorithm requires that the implementation be able to operate on the entire input as necessary to make accurate decisions. When working with the unbuffered char/token streams, the chance of ANTLR 4 loading the entire contents of a file into memory is reduced, but not eliminated. To ensure that streaming input is successful, you will need to do one of the following:

 

1.       Instead of feeding an entire file to ANTLR, find a way to break the input into smaller segments which do fit in memory, and process those segments with ANTLR one-at-a-time.

2.       Create new implementations of CharStream and TokenStream which operate as random-access views of files on disk, with a large in-memory buffer. In essence, this is a buffered char stream that limits the size of the memory buffer used, with the rest of the data located on disk.

 

The latter sounds like a general solution that could help many people trying to parse very large inputs. I’d be interested in attempting an implementation if I ever find the time, or perhaps evaluating a pull request containing an implementation for inclusion in a future release of the ANTLR 4 runtime.

 

Thank you,

Sam Harwell

--

osca...@gmail.com

unread,
Sep 3, 2013, 3:13:03 AM9/3/13
to antlr-di...@googlegroups.com

Hello Martin

I want to notice that using unbuffered I needed some workarounds, and that I have problems getting values of some fragments of  a parse rule (field_uso in my case) that have several alternatives and where one alternative (field_cast_uso) can be made of several pieces. In my concrete case, field_cast_uso is define to match strings like CAST(0x9D3A048A AS SmallDateTime) )


My parser tries to parse lines of a dump of a MSSQL Server database, like

INSERT [dbo].[t_uNoticias] ([IdArea], [Categoria], [CategoriaDes], [Cat2], [IdNota], [Fecha], [Autor], [Orden], [Titulo], [Borrar], [Activo], [IdForo], [IdReal], [IdEncu], [IdGale], [imgG], [imgP], [IdOld], [Hora], [Doctipo], [DocAnexo], [DocSize], [Desarollo], [Resumen], [terminos], [Aux1], [Aux2], [IdiId], [DesaNumPag], [IdiomasId], [IdVersion], [ImgDes], [GrpCab], [auhoraini], [auhorafin], [auactivo], [subtitulo], [fechaUPD], [fechaGEN], [Params], [VideoTipo], [VideoAncho], [VideoAlto], [VideoUrl], [VideoDes], [FechaExp], [Tags], [IdEnciclomedia], [expira], [comentable], [cintillo], [PiePag], [geobloqueo]) VALUES (175, 714, N'Game40 Noticias     ', 4050, 969586, CAST(0x9C4C0000 AS SmallDateTime), 977557, 0, N'Wii Sports Resort y Wii Motion Plus harán más realistas 12 deportes    ', 0, 1, 0, 973846, 0, 0, N'969586_imgg.jpg', N'969586_imgp.jpg', 0, CAST(0x9D3A048A AS SmallDateTime), NULL, NULL, NULL, N'<p>Ambientado en la isla Wuhu.</p>', N'Cientos de millones de usuarios.', NULL, NULL, NULL, 1, 0, 0, 0, NULL, 0, N'00:00:00', N'00:00:00', 0, NULL, CAST(0x9D3A048A AS SmallDateTime), CAST(0x7223003C AS SmallDateTime), N'XXORIG=20090720l40l40g40_2.Tes', 0, 0, N'0         ', NULL, NULL, CAST(0xAB3502D0 AS SmallDateTime), NULL, NULL, 0, 1, NULL, NULL, NULL)

where there are several fields with a format like  CAST(0x7223003C AS SmallDateTime)

I have defined a parser rule field_cast_uso to match it. When using small data files with buffered mode (in a simpler .g4 file, without asignations like d=field_uso in parser rules, I had no problems to get all the tokens in field_cast_uso. But as the dump files where several hundred megabypes, the parser crashed for memory compsumtion, so I changed to unbuffered mode. And then was I had problems to get all the "tokens" of a parser rule.
 
What I don't know is how to make to recover "all the tokens" in field_cast_uso, because in the "most external" parser rule of insertStat, using

 d=field_uso

I just recover the text fragment

CAST

and in the parser rule

field_cast_uso: OPER_CAST OPEN_PAR z=todo_no_avaricioso CLOSE_PAR { System.out.println("todo_no_avaricioso: " + $z.start.getText() ) ; };

in z I just recover the text fragment

0x9D450455



Next it is a fragment of my .g4 file:

insertStat:        a=cmd_insert {System.out.println("cmd_insert: " + $a.start.getText() + ", numlinea: " + numlinea);
                               numlinea++;
                              }
                   dbo_con_corchetes_punto
                   '[' tablename       {
                                   // System.out.println("ID: " + $a.start.getText());
                                }
                       ']'
                   '('          {
                                    // System.out.println("OPEN_PAR: " + '(' );
                                    nombrecampo.clear();                                      
                                }
                       '[' b=nombrecampo   {
                                   // System.out.println("ID: " + $a.start.getText());
                                   nombrecampo.add($b.start.getText());
                                }
                           ']'
                       (
                         COMMA  {
                                    // System.out.println("COMMA: " + $a.start.getText());
                                }
                         WS?
                         '['
                             c=nombrecampo {
                                   // System.out.println("ID: " + $a.start.getText());
                                   nombrecampo.add($c.start.getText());
                                }
                             ']'
                       )*
                       ')'      {
                                   // System.out.println("OPEN_PAR: " + ')' );
                                 
                                }
                   VALUES
                   // LISTA_VALORES
                   '('              {
                                       // System.out.println("OPEN_PAR: " + '(' );
                                       i = 0;
                                       // System.out.println("size: " + nombrecampo.size() ) ;
                                      
                                    }
                       d=field_uso    {
                                      
                                       // !! Por algún motivo desconocido, que supongo que tiene
                                       // !! que ver con que usamos modo unbuffered, el valor del
                                       // !! campo capturado cuando la regla es del tipo field_cast_uso
                                       // !! solo emparaja el 'CAST' en vez de todo el texto
                                      
                                       // System.out.println("ncampo: " + i) ;                                    
                                       // i++;
                                       // System.out.print(nombrecampo.get(i++).replace("\n", '\n') + ": " ) ;   

                                       System.out.print( nombrecampo.get(i++) + ": " ) ;
                                       mystring = $d.start.getText();                                    
                                       System.out.println(mystring);
                                    }
                       (
                          COMMA     {
                                       // System.out.println("COMMA: " + $a.start.getText());
                                    }
                          WS?
                          e=field_uso {
                                       // System.out.println("ncampo: " + i) ;                                    
                                       // i++;
                                       // System.out.print(nombrecampo.get(i++).replace("\n", '\n') + ": " ) ;   

                                       System.out.print( nombrecampo.get(i++) + ": " ) ;
                                       mystring = $e.start.getText();                                  
                                       System.out.println(mystring);
                                    }
                       )*
                       ')'          {
                                       // System.out.println("CLOSE_PAR: " + ')' );
                                       System.out.println("-----------------");
                                    }
                   EOL
                   ;

[... some lines deleted...]

dbo_con_corchetes_punto : '[dbo]' '.' ;

longcampoarray      : '(' INT ')' ;

id_con_corchetes    : '[' ID ']' ;

nombrecampo         : ID ;
tablename           : ID ;

field_uso         : INT
                  | field_cast_uso
                  | STRING_TSQL
                  | 'NULL'
                  ;

field_cast_uso: OPER_CAST OPEN_PAR z=todo_no_avaricioso CLOSE_PAR { System.out.println("todo_no_avaricioso: " + $z.start.getText() ) ; };
 
todo_no_avaricioso: .*? ;

OPEN_PAR       : '(' ;
CLOSE_PAR      : ')' ;
cmd_insert     : 'INSERT' ;
OPER_CAST      : 'CAST' ;
VALUES         : 'VALUES' ;

EOL            : '\r'? '\n' ;
COMMA          : ',' ;

INT            : [0-9]+ ;
ID             : [a-zA-Z0-9_]+ ;

fragment OPEN_COMMENT        : '/*' ;
fragment CLOSE_COMMENT       : '*/' ;
fragment TODO_NONGREEDY      : .+?  ;

COMMENT        : OPEN_COMMENT TODO_NONGREEDY CLOSE_COMMENT EOL? -> skip ;

STRING_TSQL    : 'N\'' ('\'\''|~'\'')* '\'' ; // backslash + comilla simple es una comilla simple escapada
STRING         : '\'' ('\'\''|~'\'')* '\'' ; // backslash + comilla simple es una comilla simple escapada


Has anybody any recommendation for getting all the values of the field in this (unbuffered mode) reading?

Thanks a lot in advance,

Oscar

Martin Cornelius

unread,
Sep 3, 2013, 4:55:21 AM9/3/13
to antlr-di...@googlegroups.com
Hi Sam,

While i could imagine to go path 1 (breaking up the file before ANTLR4 processing), this is not really satifsying. To break up the file, i would have to create another kind of parser 'manually', what is actually the daunting task i hoped to avoid by using a parser generator. Thus, path 2 currently looks much more attractive to me, especially as this is a more general solution. I could ask for being allowed to spend work on this topic (for a future ANTLR release), but i'm not clear about the circumstances, especially :

- what is a 'pull request' ?'
- would it be mandatory to provide a java version (not feasible for me), or would .NET be sufficient ?

cheers, Martin

Terence Parr

unread,
Sep 3, 2013, 12:24:53 PM9/3/13
to antlr-di...@googlegroups.com
I will add that a quick look at the grammar will often tell you that ANTLR will not ever read to EOF while making a decision. If you have

start : supercomplicatedrule* ;

then it might.

Ter

Martin Cornelius

unread,
Sep 4, 2013, 4:14:26 AM9/4/13
to antlr-di...@googlegroups.com
Hmm....

my knowledge about the internals of LL-parsing is not that profound, but perhaps i could easily gain insights by experimenting a litte. I'm wondering if there is an easy way to find out what was the maximum buffer size needed by an ANTLR4 parser for a specific input. I would not hesitate to add some code to the source of the runtime for that purpose, or just set a breakpoint in the debugger and look at some variables. Could anybody give me a hint where to look or to tweak ?

Martin

Terence Parr

unread,
Sep 4, 2013, 2:27:47 PM9/4/13
to antlr-di...@googlegroups.com
A simple way to look at this is that if you have a grammar with a trailing token then the parser prediction will never have to look forever into the future. Imagine that your top-level rule matches a single class

start : 'class' ID '{' … '}' ;

prediction would generally never look past the trailing '}'.

Ter

Sam Harwell

unread,
Sep 5, 2013, 12:35:31 AM9/5/13
to antlr-di...@googlegroups.com

The only way to guarantee that ANTLR will not examine every token from the decision token through the EOF symbol is to have a syntax error in the input stream.

 

The specification for ParserATNSimulator.adaptivePredict does not require that it return as early as possible. Its only requirement is accuracy, which means it is allowed to examine the entire input (which for the streams currently provided with the runtime means the entire input is loaded in memory). In practice, adaptivePredict often returns earlier than that, but that is an implementation detail that you cannot rely on and could freely change in a future release.

 

Sam

--

Martin Cornelius

unread,
Sep 5, 2013, 2:52:49 AM9/5/13
to antlr-di...@googlegroups.com
Hi Terence  and Sam, just to make sure i got you right: My grammar has a rule like this

loop : 'repeat' INT '(' millingcommand+ ')'

A typical input fragment to the parser will look like this:

repeat 4
(
  // sequence of millions of millingcommands
)


The semantic of this fragment: The milling machine shall repeat the given sequence of several million commands for 4 times.

As i understand, the parser will always buffer every character (and/or Token ?) from the opening '(' to the trailing ')' in this case, even if i use an UnbufferedCharStram and an UnbufferedTokenStream. Correct ?

According to Sam's most recent note, it could generally be even worse. The parser might also decide to buffer even more tokens, in the worst case until EOF is seen. Also correct ?

Many thanks for taking your time to answer my dummy questions, Martin

phur...@gmail.com

unread,
Sep 5, 2013, 10:51:50 AM9/5/13
to antlr-di...@googlegroups.com
> i'm going to write a parser for files containig data for industrial milling machines.

Just curious -- are these GCODE files? If so, what dialect?

Martin Cornelius

unread,
Sep 6, 2013, 3:40:27 AM9/6/13
to antlr-di...@googlegroups.com, phur...@gmail.com
Yes and No. In the first place, i am crafting a proprietary DSL that combines common features of procedural languages with functionality known from GCODE. Later on, i plan to develop a translator that will convert from GCODE to this language. What dialect is not already clear, but i guess in the first shot it will be just 'minimal' GCODE as specified in the german national standard DIN 66025. In later phases, proprietray extensions to GCODE (perhaps SIEMENS) might be supported - but that's pie in the sky.

Terence Parr

unread,
Sep 16, 2013, 5:34:37 PM9/16/13
to antlr-di...@googlegroups.com
 The answer here is to look at the individual milling commands. If each one is lexically or syntactically easy to distinguish from the next, such as a terminating ';' or '\n' thenit will never look very far ahead.

Ter



--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

waldensi...@gmail.com

unread,
Sep 24, 2013, 11:52:24 AM9/24/13
to antlr-di...@googlegroups.com
On Friday, August 30, 2013 10:12:30 AM UTC-4, Martin Cornelius wrote:
> Hi there,
>
> i'm going to write a parser for files containig data for industrial milling machines. The special problem with this files is : they can be really large, up to several gigabytes.Thus, buffering the complete character content or even all tokens in memory is impossible.

Hi, why is this impossible? The computer hardware is limited? Can't get a memory upgrade on the machine without big expense?

Or are you likely to approach terabyte sizes?

Jim Idle

unread,
Sep 24, 2013, 9:52:20 PM9/24/13
to antlr-di...@googlegroups.com
Not all deployments are made to PCs/Linux systems where such quantities of RAM are available. Embedded Linux in industrial systems for instance. 

What we should do is use MappedByteBuffer to read only map a RandomAccessFile as the input stream. This will also use NIO and be very fast. Ideally this would be the default input stream (it should not break anything), but it is easy to implement this as a separate input stream. 

A memory mapped read/write file could probably be used for the token stream, but I suspect that it could be reasonable to provide a discard() method that the user can call within action code to say they have processed everything up to the current input point and will no longer try to access the tokens prior to that. This precludes using the parse tree walker of course and assumes that the token stream is one that does not ask the lexer for all the tokens at once when it is asked for the first token (unbuffered).

Another option is to use Apache JCS with a disk file backing, but I that could be slow as it is not an ordered list. Then again if you have trillions of tokens and need to keep them around, that sacrifice to speed would not be too important.

Apache JCS with a backup file is a good option if your program must assemble the data about the millions of lines of code in the input all at once (a symbol table or representative object for instance).

The new input stream is a trivial thing to implement, a mapped token stream would be a little more involved on the face of it.

Jim




Martin Cornelius

unread,
Sep 25, 2013, 5:01:32 AM9/25/13
to antlr-di...@googlegroups.com
Actually, my target system is an embedded Windows machine (running Windows 7 embedded). Of course, RAM is somewhat limited on this system (currently 2GB), and the files we're going to process are several GB. Anyways, i meanwhile decided to follow Sam Harwells advice to split the input files in some way before parsing. However, a truly 'streaming' ANTLR parser is still a very interesting thought...

waldensi...@gmail.com

unread,
Sep 25, 2013, 9:38:08 AM9/25/13
to antlr-di...@googlegroups.com
On Tuesday, September 24, 2013 9:52:20 PM UTC-4, Jim Idle wrote:
> Not all deployments are made to PCs/Linux systems where such quantities of RAM are available. Embedded Linux in industrial systems for instance. 
>
>
> What we should do is use MappedByteBuffer to read only map a RandomAccessFile as the input stream. This will also use NIO and be very fast. Ideally this would be the default input stream (it should not break anything), but it is easy to implement this as a separate input stream. 
>
>
>
> A memory mapped read/write file could probably be used for the token stream, but I suspect that it could be reasonable to provide a discard() method that the user can call within action code to say they have processed everything up to the current input point and will no longer try to access the tokens prior to that. This precludes using the parse tree walker of course and assumes that the token stream is one that does not ask the lexer for all the tokens at once when it is asked for the first token (unbuffered).

Hi Martin and Jim, for super large(greater than 128 Gb ram) I've been applying HDF5(http://www.hdfgroup.org/) which runs on windows/linux and has all the file handling needed for hyperslabbing/regionblocking for large data(not ANTLR input) which provides all the mechanism for mapping files. It has Java, Python, C and .Net glue available. It has worked well for massive hierarchical data. Parallel filesystem support if useful. A possibility to avoid writing such a layer

Jim Idle

unread,
Sep 25, 2013, 8:16:21 PM9/25/13
to antlr-di...@googlegroups.com, antlr-di...@googlegroups.com
I will look into it. :)

Ross Youngblood

unread,
Feb 24, 2017, 9:10:58 PM2/24/17
to antlr-discussion
Hi Terrance,
Thanks for the ANTLR4 work, it's great stuff.  
I've had tremendous success with ANTLR4 so far, and starting to benchmark it against some legacy C++ compiler (ASCII->Binary translator) code.  The small file performance is great.  When I hit 110MB ASCII inputs, I'm running into the constraints, which I can fix with unbuffered token streams... but that isn't likely to get me to be able to read 1GB input files without trimming the parse tree somehow.

I'm trying to consider a "generic" way of breaking the input stream into parse tree blocks.  I think the "elegent" solution isn't too far away.
I was wondering about some kind of *.g4 rule/directive that would cause a "pause" in the input stream (simulated EOF).   This would then allow processing the current parse tree, then the existing parse tree could be "pruned".  Then some other API/directive to allow "pruning" the nodes of the existing parse tree.  This would allow me to keep my parse tree listener code, and run huge files with loss of "parse tree history".  Assuming I don't care/know what I'm doing (I don't yet)... it seems one could "prune" a parse tree and release the memory, and keep going.  Without all the overhead of shutting down/restarting all the input streams to actually break the file into chunks.

--Ross

Mike Lischke

unread,
Feb 25, 2017, 4:46:05 AM2/25/17
to antlr-di...@googlegroups.com
Hi Ross,

> I'm trying to consider a "generic" way of breaking the input stream into parse tree blocks. I think the "elegent" solution isn't too far away.

This is indeed the way to go. You can create a scanner which finds block boundaries (or just base it on a certain amount of chars read). Then feed that in a loop to your parser. I'm doing that even for smaller files (from just a few KBs up to 256MB database dumps).

> I was wondering about some kind of *.g4 rule/directive that would cause a "pause" in the input stream (simulated EOF).

There is no such thing for ANTLR4 grammars, but you can of course create your own lexer or token stream which returns certain special tokens to the parser to make it stop at certain points.

> This would then allow processing the current parse tree, then the existing parse tree could be "pruned". Then some other API/directive to allow "pruning" the nodes of the existing parse tree.

After each parse run your previous parse tree becomes invalid, mostly because the lexer tokens linked to by the parser rule contexts disappear and new ones are created. That means you have to find an own storage method for your current parse tree, before you start parsing the next block.


Mike
--
www.soft-gems.net

Reply all
Reply to author
Forward
0 new messages