Generated code is very slow, bad performance.

200 views
Skip to first unread message

CerebroTecnologico

unread,
May 20, 2015, 5:38:39 PM5/20/15
to antlr-di...@googlegroups.com
I generated a ECMAScript parser and lexer using the following grammar using ANTLR 4.5 


I am only using the generated ECMAScriptLexer to get a stream of token types. 

ECMAScriptLexer ecmaScriptLexer = new ECMAScriptLexer(new ANTLRInputStream(myJavaScriptInputStream));
ecmaScriptLexer.setStrictMode(true);                
for (Token token = ecmaScriptLexer.nextToken(); token.getType() != ECMAScriptLexer.EOF; token = ecmaScriptLexer.nextToken())
{                
                if ( token.getChannel() != Lexer.HIDDEN )                         
                {
                    writer.write( ECMAScriptLexer.VOCABULARY.getSymbolicName(token.getType()) );                   
                }
}

I found that the lexer has a very bad performance.  ( Even for a small JavaScripts of 50k, it takes a second to get the stream of tokens. I need a much faster lexer.).

I profiled the code and I found that most of the time is spent on MurmuHash. This class is used to compute the hashCode of org.antlr.v4.runtime.atn.LexerATNConfig objects:

                 @Override
public int hashCode() {
int hashCode = MurmurHash.initialize(7);
hashCode = MurmurHash.update(hashCode, state.stateNumber);
hashCode = MurmurHash.update(hashCode, alt);
hashCode = MurmurHash.update(hashCode, context);
hashCode = MurmurHash.update(hashCode, semanticContext);
hashCode = MurmurHash.update(hashCode, passedThroughNonGreedyDecision ? 1 : 0);
hashCode = MurmurHash.update(hashCode, lexerActionExecutor);
hashCode = MurmurHash.finish(hashCode, 6);
return hashCode;
}

Are LexerATNConfig objects mutable ? 
If they are not, could it be possible to compute the hashCode once rather than every time that hashCode() is called ?





Terence Parr

unread,
May 20, 2015, 6:19:47 PM5/20/15
to antlr-di...@googlegroups.com
Very interesting. It might be the case that the lexer is extremely collocated for that grammar. Oh. It has predicates on the left edge of the lexer rules:

Implements : {strictMode}? 'implements’;

put that predicate after the keyword in all similar rules and see if it makes a difference in the speed. You should see a massive increase. If it works, can you send a pull request for that grammar?

Ter
> --
> You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Terence Parr

unread,
May 20, 2015, 6:24:22 PM5/20/15
to antlr-di...@googlegroups.com
actually, it looks like there are plenty of semantic predicates in there. Those should all be moved after at least the recognition of a single character otherwise they are in the main evaluation loop at the start of every token fetch.

Ter

CerebroTecnologico

unread,
May 20, 2015, 6:35:01 PM5/20/15
to antlr-di...@googlegroups.com
Hi Ter, 

Hmm...I'm not an ANTLR grammar expert. I'm not the author of the grammar. 
Could you tell me more specifically what changes I should do to the grammar ?

I know that ECMAScript has some problems in its specification, making it harder to perform the lexical analysis.  (a "/" can be a division or the start of a regular expression literal ). 
I am actually not interested in parsing ECMAScript, just the token recognition. (I need a stream of token types).
Is there any good source of ANTLR4 grammars? I don't need to keep the current grammar file, I could switch to any other grammar recognizing ECMAScript.



-Jorge

CerebroTecnologico

unread,
May 20, 2015, 6:38:06 PM5/20/15
to antlr-di...@googlegroups.com
I'm actually always parsing in strict mode, so I can get rid of  {strictMode}? on those 8 rules. 

CerebroTecnologico

unread,
May 20, 2015, 6:39:13 PM5/20/15
to antlr-di...@googlegroups.com
Getting rid of {strictMode}? in those 8 rules did not make a difference.

Terence Parr

unread,
May 20, 2015, 6:45:42 PM5/20/15
to antlr-di...@googlegroups.com
that’s because there are predicates elsewhere. I’m afraid I would have to dig into this to solve it and I don’t have time.
Ter

CerebroTecnologico

unread,
May 20, 2015, 6:49:54 PM5/20/15
to antlr-di...@googlegroups.com
Thanks anyway. I'll keep searching for grammars without predicates.

Sam Harwell

unread,
May 21, 2015, 1:10:19 AM5/21/15
to antlr-di...@googlegroups.com

If any lexer rule has a predicate at the beginning of the rule, it will prevent ANTLR from ever caching the start state for the lexer DFA. While this is not necessarily a problem (e.g. for very small inputs), it will quickly dominate performance for anything of reasonable size.

 

The other predicate you need to move is the one in RegularExpressionLiteral. The following should work. All I did was move the first predicate after the `/`, so it will only be evaluated when a token starts with a `/` character.

 

RegularExpressionLiteral

: '/' {isRegexPossible()}? RegularExpressionBody '/' RegularExpressionFlags

;

 

Sam

CerebroTecnologico

unread,
Jun 4, 2015, 6:02:23 PM6/4/15
to antlr-di...@googlegroups.com
I needed a fast way to tokenize JavaScript (ECMAScript). 
Unfortunately, I did not have the time to write my own rules, so I was relying on existing ANTLR ECMAScript lexer rules and grammar for ECMAScript.

I ended up not using ANTLR. I took the Rhino Parser and modified it a little bit so that I could attach a listener when a token was recognized. I had to keep a one-token buffer to be able to distinguish between / and /= and a REGEXP token. 

Thanks to all!  

Raz Friman

unread,
Jun 5, 2015, 12:20:05 AM6/5/15
to antlr-di...@googlegroups.com
Sam, Would it be possible for ANTLR to provide an error/warning message if a lexer rule contains a predicate before any other patterns in its definition?

This would let us know if our grammar is affected by this DFA caching issue.

Sam Harwell

unread,
Jun 5, 2015, 12:40:12 AM6/5/15
to antlr-di...@googlegroups.com

Hi Raz,

 

I think it would make sense for IDE extensions to report such an error. For example ANTLRWorks 2 already reports 3 different warnings related to grammar performance.

 

A better long-term solution in the tool/runtime itself would be modifying the behavior to support LL(1) filtering (past the predicate), and only execute predicates if they remain following the lookahead symbol.

 

Thanks,

Sam

Reply all
Reply to author
Forward
0 new messages