Antlr parsing performance affected by unicode input

Andres Solenzal

unread,

Jan 7, 2019, 12:04:18 PM1/7/19

to antlr-discussion

Hi, I have developed a grammar using antlr4 with javascript as target language. The grammar works pretty well and is really fast. But on some inputs that have a mix of unicode and ASCII chars the performance is dropped.

For example:

An expression with 500 different terms with only ascii chars takes 100ms to be parsed.

An expression with 63 different terms in an arabic language takes 4 seconds.

On both expressions a few tokens are used, one to match the OR string and the other one to match a quotedString, the lexer rules are as follows

QUOTED_STRING: '"' (ESC | ~["\\])+ '"';
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;

OR: 'OR';

QUOTED_STRING was defined as '"' .+ '"'; before but getting advice from the Definitive Antlr Reference lead me to change it to the actual version. With this definition the performance was still an issue.

Mike Lischke

unread,

Jan 7, 2019, 12:19:41 PM1/7/19

to 'Fred Curts' via antlr-discussion

Hi, I have developed a grammar using antlr4 with javascript as target language. The grammar works pretty well and is really fast. But on some inputs that have a mix of unicode and ASCII chars the performance is dropped.

For example:

An expression with 500 different terms with only ascii chars takes 100ms to be parsed.

An expression with 63 different terms in an arabic language takes 4 seconds.

What target runtime is that, Java? How do you create your lexer and have you tried to time the lexer alone? You can just set it up as usual and call CommonTokenStream.fill() to have it pull in all input and lex it. Does this take almost 4 secs alone?

Mike
--
www.soft-gems.net

Andres Solenzal

unread,

Jan 7, 2019, 12:25:09 PM1/7/19

to antlr-di...@googlegroups.com

The target runtime is Javascript, the antlr version I'm using is 4.7.2.

And yes, one expression with terms in English like "operand" OR "operand2" using 500+ terms parses in ~100ms. The same version but using arabic characters("كالإخوان المسلمين" OR "للإخوان المسلمين") takes 4 seconds with only 60 terms.

--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/B6OBAUrITy0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andres Solenzal

unread,

Jan 7, 2019, 12:27:55 PM1/7/19

to antlr-discussion

The target runtime is Javascript, the antlr version I'm using is 4.7.2.

And yes, one expression with terms in English like "operand" OR "operand2" using 500+ terms parses in ~100ms. The same version but using arabic characters("كالإخوان المسلمين" OR "للإخوان المسلمين") takes 4 seconds with only 60 terms.

Also the grammar is a combined one, lexer and grammar on the same file.

Reply all

Reply to author

Forward