Hi,
I'm using ANTLR 4.6.4 on Windows with Visual Studio 2017 (I've added Antlr to my project through Nuget Manager) .
I'm writing a grammar to parse a file with a specific format. There is a file with the format attached.
My problem is that the generated parser and lexer performs slowly for larger files. The tests that I did came out like this:
for a 500 KB file, running the lexer and parser takes 4 s and for a 22 MB file it takes 467 seconds.
Using the Unbuffered streams didn't help improve performance; neither did disabling the generation of the parse tree.
Here is my grammar:
grammar test;
/*
*Parser rules
*/
testFile : elementCollection COMMENT EOF;
elementCollection : element + ;
element : WHITESPACE* OPEN content CLOSE WHITESPACE* COMMENT*;
content : WHITESPACE* name ( value | elementCollection ) ;
name : ANYCHAR+ ;
value : ( WHITESPACE | VALUE | ANYCHAR )+ ;
/*
*Lexer rules
*/
fragment NEWLINE : '\r'? '\n' | '\r' ;
fragment WS : ' ' | '\t' ;
OPEN : '<';
CLOSE : '>';
COMMENT : '#' ~[\r\n] '\r'? '\n' -> skip;
WHITESPACE : WS | NEWLINE ;
VALUE : '' (WHITESPACE* ~['] WHITESPACE*)* '\'' ; ANYCHAR : ~[<>
'] ;
<DMathShowCustom No>
<DMathFunctions `'>
<DMathLargeVert 0.0 pt>
<DFCLMinimums <TestFIndent 0.0">
<TestLIndent 0.0">
<CollectionName
<elem1 value>
<elem2 value>
>
> #end of DFCLMinimums
Here is the code that I use to run the lexer and parser:
// using (var fileStream = new FileStream(input, FileMode.Open))
//{ var inputstream = new AntlrFileStream(input); var lexer = new TestLexer(inputstream);// new TestLexer(new UnbufferedCharStream(fileStream)); var stream = new UnbufferedTokenStream(lexer); var parser = new TestParser(stream); parser.BuildParseTree = false;
var output = parser.testFile(); //}
I've read in several places that optimizing the grammar will optimize the lexer and parser. The suggestions for optimizing grammars, that I found, didn't help me.
I would appreciate any suggestion that could help improve the performance of my grammar.
Thank you,
Hajni
Greetings!
I have only spent a very few minutes looking at your grammar; so apologies if I have misunderstood what is going on...
I truly believe that you are not obtaining an error free parse of ANY input. I am not familiar with Visual Studio (i do not use any microsoft development products) but is there, perhaps, a separate STDERR display (vs STDOUT) that you have overlooked?
I say this because your testFile parser rule requires a COMMENT token after the elementCollection. However your COMMENT lexer rule contains the `-> skip` meta-notation, so COMMENTs are not delivered from the lexer to the parser. BOOM!
a very brief scan of your other rules shows a VERY ambiguous grammar, expensive for ANTLR parsers to act upon. In particular your treatment of WHITESPACE seems to be very inefficient.
maybe once you have a correctly functioning parser/lexer, we can look at performance issues again.
Hope this helps...
-jbb
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
grammar Test;
/** Parser rules*/TestFile : elementCollection EOF; elementCollection : element + ;element : WHITESPACE* OPEN content CLOSE WHITESPACE*;
content : WHITESPACE* name ( value | elementCollection ) ;name : ANYCHAR+ ;value : ( WHITESPACE | VALUE | ANYCHAR )+ ;
/**Lexer rules*/fragment NEWLINE : '\r'? '\n' | '\r' ;fragment WS : ' ' | '\t' ;OPEN :'<';CLOSE :'>';
COMMENT : '#' ~[\r\n]* '\r'? '\n' -> skip;
WHITESPACE : WS | NEWLINE ;VALUE : '`' (WHITESPACE* ~['] WHITESPACE*)* '\'' ;
ANYCHAR : ~[<>`'] ;
Sorry, still very confused...
TestFile rule is a lexer rule because it begins with an uppercase letter. Yet it references a parser rule. Does that actually work?
Sorry for the noise....
There is a fine line between work to be performed in the lexer vs parser. Parsers work best to resolve patterns of strings, e.g. Tokens. While lexers ease the parser's task by grouping sequences of characters into Tokens. You are, except for the quoted string VALUE, passing a single character as a Token - either WHITESPACE or ANYCHAR. This leads to lots of Token Object construction and gives more work to the parser.
So try to move the loops on WHITESPACE and ANYCHAR away from the parser and into the lexer.
However caution is required due to an ANTLR quirk. Whenever 2 lexer rules match exactly the same sequence of characters, the rule appearing first in the grammar file is used to construct the corresponding Token. Your WHITESPACE and ANYCHAR both match blank, carriage return, or newline; but because WHITESPACE is first it wins. You probably should revise ANYCHAR to exclude whitespace.
Also observe that in your VALUE rule you have WHITESPACE* and ~[']* however blanks, carriage returns and newlines are indeed not a ' so these WHITESPACE loops are redundant. And they are spectacularly inefficient because the lexer will construct a Token object for each blank and then THROW IT AWAY in order to construct the VALUE Token.
And by the way, is a NewLine character really permitted within a VALUE quoted string in your language, very strange.
So VALUE becomes:
VALUE : '`' (~'\'')* '\'' ;
is (foo ) a valid element in your language? eg a name may have a blank as its value. if not maybe you can ->skip whitespace in the lexer. if you need to recover the whitespace in an element like (foo bar baz) - so you want to see the whitespace between the two words of the value; then use {CHANNEL=HIDDEN;} instead of ->skip in the lexer. NB i probably have the syntax of the hidden channel assignment wrong, probably should look it up...
grammar test;
/** Parser rules*/
testFile : elementCollection EOF; elementCollection: element + ;element : OPEN content CLOSE ;content : name ( value | elementCollection ) ;name : WORD;value : VALUE | WORD* ;
/**Lexer rules*/fragment NEWLINE : '\r'? '\n' | '\r' ;fragment WS : ' ' | '\t' ; OPEN :'<';CLOSE :'>';
COMMENT : ('#' ~[\r\n]* NEWLINE) -> skip; WHITESPACE : (WS | NEWLINE) -> skip;VALUE : '`' (~'\'')* '\'' ;WORD : ~[# <>`'\t\r\n] +;
using (var fileStream = new FileStream(inputPath, FileMode.Open))
{ var inputstream = new UnbufferedCharStream(fileStream); // this way the file is not loaded into the memory - that happens when using AntlrInputStream var lexer = new testLexer(inputstream); lexer.TokenFactory = new CommonTokenFactory(true); // make sure to keep the value of the tokens - Otherwise, the getText() method for tokens would try to access the input character stream, which probably would no longer be available since it's unbuffered, so an exception would be thrown
var stream = new UnbufferedTokenStream(lexer);
var parser = new testParser(stream); parser.BuildParseTree = false; // stop creating the AST - it takes up too much memory var listener = new TestParserListener();// my custom listener; where I customize what happens whenever a rule is entered by the parser and exited; (inherits testBaseListener) parser.AddParseListener(listener);//the listener will be called whenever a parser rule starts or ends
parser.testFile();// start parsing }