AdaptivePredict slowdown

64 views
Skip to first unread message

Harry Cordewener

unread,
Aug 26, 2024, 12:56:53 AM8/26/24
to antlr-discussion
Hey group,

I'm running into issues with AdaptivePredict slowing down severely after about 20+ levels of recursion. And I am suspect this is because my grammar is at least somewhat ambiguous in nature.

add(1,add(1,add(1,add(1,1)))) -- The deeper this goes, the longer it takes.

https://github.com/SharpMUSH/SharpMUSH/blob/main/SharpMUSH.Generated/SharpMUSHParser.g4
https://github.com/SharpMUSH/SharpMUSH/blob/main/SharpMUSH.Generated/SharpMUSHLexer.g4

Take note of some ambiguity around 'genericText' because of it being able the match the pattern of a function, and the splitting of a function's arguments with COMMAWS. As that may or may not be relevant. I managed to remove ambiguity here already by stopping 'genericText' from having CPAREN as a valid token when in a function, which helped a lot.

From what I can tell using LL_EXACT_AMBIG_DETECTION, there isn't any ambiguity being hit anymore, but I still suspect this is because of the sheer amount of look-ahead needed to deal with the )s. It handles `add(1,add(1,add(1,add(1,1))))` just fine, but as I hit 100 levels of this, that is already taking 59 seconds to evaluate and get a result for.

I would love some guidance on how to best approach this problem. 

Ken Domino

unread,
Aug 27, 2024, 12:29:38 AM8/27/24
to antlr-discussion
While your grammar has no ambiguity and no full-context fallbacks, there are many issues, some of which explain the performance. In no particular order, here are some of the issues.

1) OBRACE, CBRACE, UNESCAPE are lexer rules marked "skip". But these symbols are used in parser rules! By definition, skip lexer rules do not generate a token, so a parser rule alt that uses the skipped symbol cannot match.

escapedText
    : ESCAPE UNESCAPE
    | ESCAPE ESCAPING_OTHER
    ;

explicitEvaluationStringContents
    : OBRACE explicitEvaluationString CBRACE explicitEvaluationStringContents2*?
...

2) OTHER: ~('\\'|'['|']'|'{'|'}'|'('|')'|'<'|'>'|','|'='|'$'|'%'|';'|':'|'\u001B'|' ')+;

Rule OTHER is completely impossible to read. I cannot tell the alts from the string literals. Use a lexer character set, or if you insist a list of string literals, separate each string literal with a space before and after. Better would be to format this using antlr-format. https://github.com/mike-lischke/antlr-format

3) There are six start rules--too many to know what to use. Define only one EOF-terminated start rule.

4) commaCommandArgs is an EOF-terminated start rule.

commaCommandArgs
    : singleCommandArg (COMMAWS singleCommandArg)*? EOF
    ;

But, it is used on the right-hand side of a parser rule. You should avoid using an EOF-terminated rule on the right-hand side of another parser rule, especially if it not the last symbol on the right-hand side.

eqsplitCommandArgs
    : singleCommandArg (EQUALS commaCommandArgs)? EOF
    ;

6) Parser rule "command" has a common (left) prefix. This rule needs to be left factored.

command
    : firstCommandMatch (RSPACE evaluationString)?
    ;

7) Closures of closures of lists. Please look at the rule for explicitEvaluationStringContents2.

explicitEvaluationStringContents2
    : OBRACE explicitEvaluationString CBRACE explicitEvaluationStringContents2*?
    | explicitEvaluationStringFunction explicitEvaluationStringContents2*?
    | explicitEvaluationStringSubstitution explicitEvaluationStringContents2*?
    | explicitEvaluationText explicitEvaluationStringContents2*?
    ;

explicitEvaluationStringContents2 is right recursive because it is the last symbol on the right-hand side of some (all) alts. Consequently, it defines a list. But, you also take the *-closure of the symbol explicitEvaluationStringContents2, i.e., you are taking the closure of a list. You also take the closure of  explicitEvaluationStringContents2 in explicitEvaluationStringContents, meaning you take the closure of a closure of a list. Antlr cannot decide what derivation to choose in a combinatorial explosion.

8) There are 29 parser rules in the grammar, of which 14 are used on the right-hand side of a parser rule only once. Not counting the five superfluous and confusing start rules, over half of the parser rules are single-use. These rules add nothing to the readability or understanding of the grammar.

9) I recommend that you rewrite the grammar and start from a standard expression grammar. Then, expand on the syntax.

lexer grammar SharpMUSHLexer;
WS:  [ \r\n\f\t]+ -> skip;
OPAREN: '(';
ESCAPE: '\\' -> pushMode(ESCAPING);
OBRACK: '[';
CBRACK: ']';
OBRACE: '{';
CBRACE: '}';
CPAREN: ')';
CCARET: '>';
COMMA: ',';
EQUALS: '=';
PERCENT: '%' -> pushMode(SUBSTITUTION);
SEMICOLON: ';';
COLON: ':';
OANSI: '\u001B' -> pushMode(ANSI);
FUNCHAR: [a-zA-Z][a-zA-Z0-9]+;
NUM: [0-9];
OTHER: ~('\\'|'['|']'|'{'|'}'|'('|')'|'<'|'>'|','|'='|'$'|'%'|';'|':'|'\u001B'|' ')+;

// --------------- SUBSTITUTION MODE -------------
mode SUBSTITUTION;
REG_STARTCARET: [qQ]'<' -> popMode;
REG_NUM: [qQ][0-9] -> popMode;
VWX: [vwxVWX][a-zA-Z] -> popMode;
ARG_NUM: [0-9] -> popMode;
SPACE: [bB] -> popMode;
BLANKLINE: [rR] -> popMode;
TAB: [tT] -> popMode;
DBREF: '#' -> popMode;
ENACTOR_NAME: 'n' -> popMode;
CAP_ENACTOR_NAME: 'N' -> popMode;
ACCENT_NAME: '~' -> popMode;
MONIKER_NAME: [kK] -> popMode;
SUB_PRONOUN: [sS] -> popMode;
OBJ_PRONOUN: [oO] -> popMode;
POS_PRONOUN: [pP] -> popMode;
ABS_POS_PRONOUN: [aA] -> popMode;
CALLED_DBREF: '@' -> popMode;
EXECUTOR_DBREF: '!' -> popMode;
LOCATION_DBREF: [lL] -> popMode;
LASTCOMMAND_BEFORE_EVAL: [cC] -> popMode;
LASTCOMMAND_AFTER_EVAL: [uU] -> popMode;
INVOCATION_DEPTH: '?' -> popMode;
CURRENT_ARG_COUNT: '+' -> popMode;
ITEXT_NUM: [iI][0-9]+ -> popMode;
STEXT_NUM: '$'[0-9]+ -> popMode;
OTHER_SUB: . -> popMode;

// --------------- ESCAPING MODE -----------------
mode ESCAPING;
UNESCAPE: '\\' -> skip, popMode;
ESCAPING_OTHER: ~'\\' -> popMode;

// --------------- ANSI MODE ---------------------
mode ANSI;
CANSI: 'm' -> popMode;
ANSICHARACTER: ~'m'+;

parser grammar SharpMUSHParser;
options { tokenVocab=SharpMUSHLexer; }
commandList : command (SEMICOLON command)*? EOF ;
command : expr ;
expr
    : OPAREN exprList? CPAREN
    | OBRACE exprList? CBRACE
    | PERCENT (
REG_STARTCARET expr* CCARET
| REG_NUM
| ITEXT_NUM
| STEXT_NUM
| VWX
| SPACE
| BLANKLINE
| TAB
| COLON
| DBREF
| ENACTOR_NAME
| CAP_ENACTOR_NAME
| ACCENT_NAME
| MONIKER_NAME
| PERCENT
| SUB_PRONOUN
| OBJ_PRONOUN
| POS_PRONOUN
| ABS_POS_PRONOUN
| ARG_NUM
| CALLED_DBREF
| EXECUTOR_DBREF
| LOCATION_DBREF
| LASTCOMMAND_BEFORE_EVAL
| LASTCOMMAND_AFTER_EVAL
| INVOCATION_DEPTH    
| EQUALS
| CURRENT_ARG_COUNT
| OTHER_SUB
)
    | FUNCHAR OPAREN exprList? CPAREN
    | NUM
    ;
exprList : expr (COMMA expr)* ;

This grammar parses `add(1,add(1,add(1,add(1,add(1,add(1,add(1,add(1,add(1,add(1,add(1,add(1,1))))))))))))` in  ~0.02s, whereas for your original grammar, it took 0.75s.

Harry Cordewener

unread,
Aug 27, 2024, 12:51:26 AM8/27/24
to antlr-discussion
Thank you so much for your in-depth analysis, Ken. 

I went through just moments ago removing quite a bit of ambiguity, to the point that I can use SLL mode, and fixing some of the items that also show up in your reply - which has improved the speed notably, but is still taking 'too long'.

I will take some time to go through and make adjustments on my side using what you've offered up and make further adjustments.
Some parser rules have named parser-rules in order to better cater to the Visitor Pattern. 

I will come back to this again soon.

Harry Cordewener

unread,
Aug 27, 2024, 2:32:27 AM8/27/24
to antlr-discussion
I believe I resolved a good majority of the issues you've seen - and they've been updated in my repository.
Let me comment on one, and inquire about another.

#3 - that is correct. There's a few parser modes for this language, as this is a dynamic DSL. That's where there's multiple start positions. I can definitely understand the confusion at this.
#7 - I would love to learn more about this, as I believe there's definitely something I could be doing better here. After all, I am still facing ambiguity around a few of my tests - and I believe this is the most relevant item. Most of the testing I am doing performance wise, is under 'startPlainString'. As that is the busiest parser mode. 
The language is largely used to create custom strings, so it has an inherent string concatenation to it. To explain what the 'genericText' vs 'beginGenericText' and 'explicitEvaluationString' vs 'explicitEvaluationStringContentsConcatenated' is all about, when it is evaluating from startPlainString:

add(1,2)add(3,4) --> 3add(3,4)
add(1,2)[add(3,4)] --> 37

-----
v I believe this is one of the obvious signs that I have some ambiguity around the concatenation, which makes me narrow in on wanting to learn most about #7 - the closures of lists.

 Add ("add(-1.5,5)","3.5")
   Source: MathFunctionUnitTests.cs line 12
   Duration: 1 ms

  Standard Output: 
Testing: add(-1.5,5)
[01:21:18 INF] VisitFunction: Fun: add, Args: ["-1.5", "5"]


  Standard Error: 
line 1:5 reportAttemptingFullContext d=24 (beginGenericText), input='1'
line 1:11 reportAmbiguity d=24 (beginGenericText): ambigAlts={1, 2}, input='1.5,5)'
line 1:6 reportAttemptingFullContext d=24 (beginGenericText), input='.'
line 1:11 reportAmbiguity d=24 (beginGenericText): ambigAlts={1, 2}, input='.5,5)'
line 1:7 reportAttemptingFullContext d=24 (beginGenericText), input='5'
line 1:11 reportAmbiguity d=24 (beginGenericText): ambigAlts={1, 2}, input='5,5)'
line 1:1 reportAttemptingFullContext d=24 (beginGenericText), input='1'
line 1:4 reportAmbiguity d=24 (beginGenericText): ambigAlts={1, 2}, input='1.5'
line 1:2 reportAttemptingFullContext d=24 (beginGenericText), input='.'
line 1:4 reportAmbiguity d=24 (beginGenericText): ambigAlts={1, 2}, input='.5'
line 1:3 reportAttemptingFullContext d=24 (beginGenericText), input='5'
line 1:4 reportAmbiguity d=24 (beginGenericText): ambigAlts={1, 2}, input='5'

Harry Cordewener

unread,
Aug 27, 2024, 7:06:06 PM8/27/24
to antlr-discussion
I think I worked out what the meaning was, and adjusted the rules further. 
This has monstrously increased the speed of my parser. 
Now to just figure out how to stop it from permitting any function depth higher than like, 100, because at a certain point it gets silly, and it approaches Stack Overflows.

explicitEvaluationString:
    OBRACE explicitEvaluationString CBRACE explicitEvaluationStringConcatenatedRepeat*?
    | OBRACK evaluationString CBRACK explicitEvaluationStringConcatenatedRepeat*?
    | PERCENT validSubstitution explicitEvaluationStringConcatenatedRepeat*?
    | beginGenericText explicitEvaluationStringConcatenatedRepeat*?
;

explicitEvaluationStringConcatenatedRepeat:
OBRACE explicitEvaluationString CBRACE
    | OBRACK evaluationString CBRACK
    | PERCENT validSubstitution
    | genericText
;

Reply all
Reply to author
Forward
0 new messages