Help optimize grammar

77 views
Skip to first unread message

Covaci Hajni

unread,
Mar 19, 2018, 5:22:19 AM3/19/18
to antlr-discussion

Hi,
I'm using ANTLR 4.6.4 on Windows with Visual Studio 2017 (I've added Antlr to my project through Nuget Manager) .
I'm writing a grammar to parse a file with a specific format. There is a file with the format attached.


My problem is that the generated parser and lexer performs slowly for larger files. The tests that I did came out like this:
for a 500 KB file, running the lexer and parser takes 4 s and for a 22 MB file it takes 467 seconds.


Using the Unbuffered streams didn't help improve performance; neither did disabling the generation of the parse tree.


Here is my grammar:


grammar test;

/*
*Parser rules
*/


testFile                    
: elementCollection COMMENT
EOF;
elementCollection  : element + ;
element                  : WHITESPACE* OPEN content CLOSE WHITESPACE* COMMENT*;
content                   : WHITESPACE* name ( value | elementCollection ) ;
name                       : ANYCHAR+ ;
value                       : ( WHITESPACE | VALUE | ANYCHAR )+ ;

/*
*Lexer rules
*/


fragment NEWLINE    
: '\r'? '\n' | '\r' ;
fragment WS              
: ' ' | '\t' ;
OPEN                        
: '<';
CLOSE                        
: '>';
COMMENT                
: '#' ~[\r\n]
'\r'? '\n' -> skip;
WHITESPACE             : WS | NEWLINE ;
VALUE                       : '' (WHITESPACE* ~['] WHITESPACE*)* '\'' ; ANYCHAR : ~[<>'] ;



This is a snippet of the format I'm trying to parse:
<DMathShowCustom No>
<DMathFunctions `'>
<DMathLargeVert  0.0 pt>
<DFCLMinimums 
  <TestFIndent  0.0">
  <TestLIndent  0.0">
  <CollectionName 
    <elem1 value>
    <elem2 value>
  >
> #end of DFCLMinimums 



Here is the code that I use to run the lexer and parser:


 // using (var fileStream = new FileStream(input, FileMode.Open))   
//{
   var inputstream = new AntlrFileStream(input); 
   var lexer = new TestLexer(inputstream);// new TestLexer(new UnbufferedCharStream(fileStream));
   var stream = new UnbufferedTokenStream(lexer);
   var parser = new TestParser(stream);
   parser.BuildParseTree = false;

   var output = parser.testFile();
 //}



I've read in several places that optimizing the grammar will optimize the lexer and parser. The suggestions for optimizing grammars, that I found, didn't help me.

I would appreciate any suggestion that could help improve the performance of my grammar.


Thank you,
Hajni 

test.txt

John B Brodie

unread,
Mar 19, 2018, 12:49:02 PM3/19/18
to antlr-di...@googlegroups.com, Covaci Hajni

Greetings!

I have only spent a very few minutes looking at your grammar; so apologies if I have misunderstood what is going on...


I truly believe that you are not obtaining an error free parse of ANY input. I am not familiar with Visual Studio (i do not use any microsoft development products) but is there, perhaps, a separate STDERR display (vs STDOUT) that you have overlooked?

I say this because your testFile parser rule requires a COMMENT token after the elementCollection. However your COMMENT lexer rule contains the `-> skip` meta-notation, so COMMENTs are not delivered from the lexer to the parser. BOOM!


a very brief scan of your other rules shows a VERY ambiguous grammar, expensive for ANTLR parsers to act upon. In particular your treatment of WHITESPACE seems to be very inefficient.


maybe once you have a correctly functioning parser/lexer, we can look at performance issues again.


Hope this helps...

   -jbb

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Covaci Hajni

unread,
Mar 20, 2018, 3:49:51 AM3/20/18
to antlr-discussion
Hello,

I had several versions of the grammar and somehow I managed to mix them up when I wrote the post. Sorry for causing confusion.

Here is the grammar that I tested
grammar Test;

/*
* Parser rules
*/
TestFile :  elementCollection  EOF; 
elementCollection       :  element +  ;
element :  WHITESPACE* OPEN   content    CLOSE WHITESPACE*;
content :  WHITESPACE* name  ( value | elementCollection ) ;
name :  ANYCHAR+ ;
value :  ( WHITESPACE | VALUE | ANYCHAR )+  ;

/*
*Lexer rules
*/
fragment NEWLINE : '\r'? '\n' | '\r' ;
fragment WS : ' ' | '\t' ;
OPEN :'<';
CLOSE :'>';
COMMENT : '#' ~[\r\n]* '\r'? '\n' -> skip; 
WHITESPACE : WS | NEWLINE ;
VALUE : '`' (WHITESPACE* ~['] WHITESPACE*)*  '\'' ;
ANYCHAR :  ~[<>`'] ;



And I would like to discuss further about what you mean by "ambiguous grammar" and why the whitespaces rule is inefficient. I'm hopping you have some suggestions (to improve the grammar) that I could try out. 

Thank you,
Hajni 

John B Brodie

unread,
Mar 20, 2018, 9:22:49 AM3/20/18
to antlr-di...@googlegroups.com, Covaci Hajni

Sorry, still very confused...

TestFile rule is a lexer rule because it begins with an uppercase letter. Yet it references a parser rule. Does that actually work?

Sorry for the noise....

Covaci Hajni

unread,
Jun 14, 2018, 9:59:07 AM6/14/18
to antlr-discussion
Hi, 

I succeeded in optimizing my grammar. Thank you John Brodie for your suggestions.
  I'll quote his suggestion here, since it's not listed as a reply; it might help others out. 


There is a fine line between work to be performed in the lexer vs parser. Parsers work best to resolve patterns of strings, e.g. Tokens. While lexers ease the parser's task by grouping sequences of characters into Tokens. You are, except for the quoted string VALUE, passing a single character as a Token - either WHITESPACE or ANYCHAR. This leads to lots of Token Object construction and gives more work to the parser.

So try to move the loops on WHITESPACE and ANYCHAR away from the parser and into the lexer.

However caution is required due to an ANTLR quirk. Whenever 2 lexer rules match exactly the same sequence of characters, the rule appearing first in the grammar file is used to construct the corresponding Token. Your WHITESPACE and ANYCHAR both match blank, carriage return, or newline; but because WHITESPACE is first it wins. You probably should revise ANYCHAR to exclude whitespace.

Also observe that in your VALUE rule you have WHITESPACE* and ~[']* however blanks, carriage returns and newlines are indeed not a ' so these WHITESPACE loops are redundant. And they are spectacularly inefficient because the lexer will construct a Token object for each blank and then THROW IT AWAY in order to construct the VALUE Token. 

And by the way, is a NewLine character really permitted within a VALUE quoted string in your language, very strange.

So VALUE becomes:

VALUE : '`' (~'\'')* '\'' ;



is (foo ) a valid element in your language? eg a name may have a blank as its value. if not maybe you can ->skip whitespace in the lexer. if you need to recover the whitespace in an element like (foo bar baz) - so you want to see the whitespace between the two words of the value; then use {CHANNEL=HIDDEN;} instead of ->skip in the lexer. NB i probably have the syntax of the hidden channel assignment wrong, probably should look it up...



CONCLUSION
My grammar performs much better now. For a 20 MB file it's under 5 seconds and the memory usage is grate too, up to 15 MB. 
I believe two things contributed to this: 1. I've revised my grammar as suggested by John and 2. I've found some info on how to use UnbufferedCharStream and UnbufferedTokenStream and still be able to use the parser ANTLR generates. 

THE FIXED GRAMMAR
grammar test;
 
/*
* Parser rules
*/
 
testFile         :   elementCollection  EOF;
 
elementCollection: element +  ;
element         :  OPEN   content    CLOSE ;
content         :  name  ( value | elementCollection ) ;
name            :  WORD;
value           :  VALUE | WORD* ;
 
/*
*Lexer rules
*/
fragment NEWLINE : '\r'? '\n' | '\r' ;
fragment WS : ' ' | '\t' ;
 
OPEN            :'<';
CLOSE           :'>';
COMMENT         : ('#' ~[\r\n]* NEWLINE) -> skip; 
WHITESPACE      : (WS | NEWLINE) -> skip;
VALUE           : '`' (~'\'')* '\'' ;
WORD            :  ~[# <>`'\t\r\n] +;

THE CODE
  using (var fileStream = new FileStream(inputPath, FileMode.Open))
            {
                var inputstream = new UnbufferedCharStream(fileStream); // this way the file is not loaded into the memory - that happens when using AntlrInputStream
                var lexer = new testLexer(inputstream);
                lexer.TokenFactory = new CommonTokenFactory(true); // make sure to keep the value of the tokens - Otherwise, the getText() method for tokens would try to access the input character stream, which probably would no longer be available since it's unbuffered, so an exception would be thrown
                var stream = new UnbufferedTokenStream(lexer); 

                var parser = new testParser(stream);
                parser.BuildParseTree = false; // stop creating the AST - it takes up too much memory
var listener = new TestParserListener();// my custom listener; where I customize what happens whenever a rule is entered by the parser and exited; (inherits testBaseListener)
                parser.AddParseListener(listener);//the listener will be called whenever a parser rule starts or ends

                parser.testFile();// start parsing 
            }

OTHER LINKS THAT HELPED
Reply all
Reply to author
Forward
0 new messages