Hi, I have developed a grammar using antlr4 with javascript as target language. The grammar works pretty well and is really fast. But on some inputs that have a mix of unicode and ASCII chars the performance is dropped.
For example:
An expression with 500 different terms with only ascii chars takes 100ms to be parsed.
An expression with 63 different terms in an arabic language takes 4 seconds.
On both expressions a few tokens are used, one to match the OR string and the other one to match a quotedString, the lexer rules are as follows
QUOTED_STRING: '"' (ESC | ~["\\])+ '"';
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
OR: 'OR';
QUOTED_STRING was defined as '"' .+ '"'; before but getting advice from the Definitive Antlr Reference lead me to change it to the actual version. With this definition the performance was still an issue.