antlr4 performance optimzation for open source project?

Ryan Hamilton

unread,

May 16, 2020, 6:55:54 AM5/16/20

to antlr-discussion

Hi,

I'm looking for some assistance and advice for speeding up parsing/lexing of an open source language.

For background, the language repo and g4 file:

https://github.com/timestored/jq/blob/master/jq/src/main/antlr/com/timestored/jq/Hello.g4

I've attempted to speed up performance before:

1. I ran it multithreaded, this was 6x faster but returns occasionally incorrect results.

2. I "simplified" the branches in the Hello.g4 file, particularly where the intellij plugin said it was slow. This got a 50x speedup BUT only worked on single statements.

I think the main issue is that essentially the language should be interpreted almost entirely right to left.

e.g. 2*10-3

would equal 14.

The only way I know how to express it is with this: "expr expr+"

But that causes a lot of possibilities to be examined.

A file I use to check parsing and benchmarking is here:

http://www.timestored.com/jq/online/largeParserFile.q

Ideally the output would match exactly.

Any general advice or exact help would be greatly appreciated.

Thanks,

Ryan

Mike Cargal

unread,

May 16, 2020, 8:01:27 AM5/16/20

to ANTLR List

Not sue how helpful this will be… is this “jq” as in https://stedolan.github.io/jq/ ?? (A quick glance at the grammar leads me to think that’s possibly just a naming coincidence)

If it is, then a bit of playing at https://jqplay.org/jq?q=10%20%2F%20.%20*%203&j=5 indicates that expressions are not all right associative.

If it’s not, then, just can’t help but comment that the right to left evaluation must be about the most counter-intuitive/confusing way to evaluate expressions I’ve ever seen. If this is a grammar you have discretion over (rather than implementing something that’s already established, I’d suggest changing that (and I wouldn’t be surprised if that’s a factor in your performance.

Leaving that aside, I do see a few other things that seem “unusual” (at least to me)

-------------------

I see this pattern in a few places:

multiAssign:(assign* (';' assign)*);

It appears that you’re trying to say that a multiAssign is zero or more assigns separated by a semicolon, but this definition makes the semi-colon effectively optional, since assign* matches 0 or more assigns without semicolons. I suspect you mean

multiAssign:(assign? (';' assign)*);

——————

csvq: (expr?|(expr (',' expr)*));

Prettysure this could just be:

csvq: (expr? (',' expr)*));

—————

MONOP:('til'\|'enlist'\|'first'\|'last'\|'distinct'\|'count'\|'key'\|'where'\|'reverse'\|'null'\|'not'
	\|'raze'\|'rotate'\|'show'\|'system'
	\|'desc'\|'asc'\|'++'\|'--'\|'til'\|'get'\|'abs'\|'all'\|'any'\|'avg'\|'avgs'\|'exp'\|'floor'\|'ceiling'
	\|'cos'\|'sin'\|'tan'\|'acos'\|'asin'\|'atan'\|'exp'\|'log'\|'fills'\|'flip'
	\|'mcount'\|'type'\|'attr'\|'reciprocal'\|'sqrt'
	\|'svar'\|'sdev'\|'var'\|'dev'\|'differ'\|'getenv'\|'group'\|'iasc'
	\|'idesc'\|'::'
	\|'max'\|'maxs'\|'min'\|'mins'\|'med'\|'mmu'\|'read0'\|'read1'\|'prd'\|'prds'\|'exit'\|'neg'\|'inv'
	\|'rand'\|'ratios'\|'ratios'\|'signum'\|'value'
	\|'trim'\|'rtrim'\|'ltrim'\|'upper'\|'lower'\|'string'
	\|'hcount'\|'hdel'\|'hsym'\|'hopen'\|'hclose'
	\|'gtime'\|'ltime'\|'parse'\|'views'\|'tables'); // exotic

I suspect you’ll find it easier going later on if this is a parse rule rather than a Lexer ruler, and each of these keywords is defined as a separate token.

monop: TIL

| ENLIST

| FIRST

…

TIL: ’til’

ENLIST: “enlist”

FIRST: “first”

——————————

Overall, I see quite a few places where I’d suggest that you’re not using parser rules, Lexer rules, and fragments as designed. (For example, I’m even a bit (just a bit) surprised that you’re able to define Lexer rules that include other Lexer rules (generally that’s the point of fragments)

You might be well served to make sure you have a solid grasp of the ANTLR pipeline of tokenizing a stream of characters and then applying parser rules to the stream of tokens. (Or to word it differently, if you’re not solid on that, you definitely want to understand how that process works)

Re:

TIMELIST:TIM (' ' TIM)+ ('n'\|'u'\|'v'\|'t')?;
	TIME:TIM ('n'\|'u'\|'v'\|'t')?;
	TIM:Specials \| DD \| DD? TI \| (DD? TI ':' [0-9] [0-9]) \| (DD? TI ':' [0-9] [0-9] '.' Digits?);
	DD:([-])* Digits 'D';
	TI:([-])* [0-9] [0-9] ':' [0-9] [0-9];

This also hints at a common poor choice that many people make early in grammar development to attempt to encode as much validation as possible into the grammar. I’ve found that it works much better to think of the grammar as a way of saying “this is the only way to interpret this input”, and then you can do your own validation of whether something is a valid time etc. Generally, you’ll wind up with much better error messages. The idea here would be to have enough information in the grammar to distinguish between different inputs, and not much more. Antlr gives pretty good error messages, but they’ll, necessarily be a bit more generic and technical than what you’d be able to do in validation.

=================

Re:

A file I use to check parsing and benchmarking is here:

http://www.timestored.com/jq/online/largeParserFile.q

Ideally the output would match exactly.

OK, clearly not the “jq” that I’m familiar with :)

I’m a bit confused about what “output” you expect to match.

I just retired, and may play around a bit with this just for kicks (briefly), so I’m curious what I’d do to validate changes.

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/a287f6cd-80bd-4951-b770-25b2da8821b1%40googlegroups.com.

Ryan Hamilton

unread,

May 16, 2020, 12:23:04 PM5/16/20

to antlr-discussion

Hi Mike,

The language can be tested live at:

http://www.timestored.com/jq/online

Warning, it needs to download 40mb to run and can be slow.

>> must be about the most counter-intuitive/confusing way to evaluate expressions I’ve ever seen

Confusing to anyone familiar with most programming languages.

I would argue a child that hadn't been taught BODMAS (https://www.skillsyouneed.com/num/bodmas.html) is likely to find strict right-to-left evaluation more intuitive.

You can in fact easily visualize it as a pipeline starting on the right and flowing left consistently, a little similar to bash (flowing left to right) at statement level.

In this language the right-to-left is only at statement level. For example:

a:2*3+1;

b:a*a+1;

Leaves b as 65.

It does NOT attempt to evaluate the second statement first. That would be crazy :)

Regards,

Ryan

To unsubscribe from this group and stop receiving emails from it, send an email to antlr-di...@googlegroups.com.

Ryan Hamilton

unread,

May 16, 2020, 12:25:36 PM5/16/20

to antlr-discussion

>>a common poor choice that many people make early in grammar development to attempt to encode as much validation as possible into the grammar.

Yes, you're right. I did make that mistake and I have been backing some of those decisions out over time.

Mike Cargal

unread,

May 16, 2020, 4:10:08 PM5/16/20

to antlr-discussion

OK, I cloned your repo, and added a "JustParse" class so that I could time just the parser:

When I run it against largeParserFile.q, I get the following errors and it runs in about a 1/3 of a second.

❯ java -cp jq/build/libs/jq-0.0.1-SNAPSHOT-all.jar com.timestored.jq.JustParse largeParserFile.q

largeParserFile.q

line 112:26 extraneous input '\r\n' expecting {';', ')'}

line 113:35 extraneous input '\r\n' expecting {';', ')'}

line 145:8 mismatched input 'a' expecting {';', NEWLINE, WS}

line 145:11 mismatched input 'x' expecting {';', '[', NEWLINE, WS}

line 145:59 extraneous input '}' expecting {<EOF>, ';', NEWLINE, WS}

line 146:1 extraneous input '}' expecting {';', '[', NEWLINE, WS}

line 147:2 mismatched input 'a' expecting {';', '[', NEWLINE, WS}

line 148:2 mismatched input 'a' expecting {';', '[', NEWLINE, WS}

line 148:16 extraneous input '}' expecting {<EOF>, ';', NEWLINE, WS}

line 328:34 extraneous input '\r\n' expecting {';', ')'}

line 329:43 extraneous input '\r\n' expecting {';', ')'}

Parsed 2020-05-16T20:02:02.893095Z -> 2020-05-16T20:02:03.227834Z : 334 milliseconds.

This is without any grammar changes.

Are these expected errors?

Is this a good test file to demonstrate the performance issue you're having? (if so, what kind of performance would you be shooting for?)

Ryan Hamilton

unread,

May 17, 2020, 5:25:47 AM5/17/20

to antlr-di...@googlegroups.com

Thanks for looking into this Mike.

>> I’m a bit confused about what “output” you expect to match.

My "testing" and "benchmarking" currently entirely rely on intellij and that .q file.

In the attached image, you can see I have intellij with antlr plugin open.

When trying to make it faster I:

1. Look at the profiler tab and use that to identify "slow" areas.

2. Modify the grammar

3. Visually inspect the "Parse Tree" (this is the output) to see if it looks the same.

Ideally rather than visual inspection, I would save the parse tree and programmatically compare but I haven't found a way to do that. I do have some simple language tests that check at a higher level everything is working.

When I changed the "expr expr+", I then threw hundreds of simple statements at antlr it got a 50x speedup and 90% correct answers so it seemed to me some massive speedups should be possible. To be honest I'm hoping someone on this list can tell me for sure.

Thanks,

Ryan

You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/syt3eWxqxDQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/22a5f375-a4a1-4f7b-ac03-cda5a411dd4e%40googlegroups.com.

--

www.timestored.com
Tel: +44 (0) 20 3397 7535
Email: ry...@timestored.com

antlr.png

profiler-slowspot.png

Mike Cargal

unread,

May 17, 2020, 7:49:11 AM5/17/20

to ANTLR List

I’ve not used the profiling tool you have puled up in IntellliJ (It looks nice, just haven’t used it), so I don’t have any benchmark to compare it to to know if those are unusual numbers.

If I’m finally getting this a bit, you’re trying to work out how to get operator precedence out of the picture. You should probably know that the order of alternative in you expr rule is setting operator precedence (<assco-right> is just indicating which direction tends to bind tighter (probably a bad way to say that), you’ll typically see it with exponentiation:

5 ^ 7 ^ 9:

with left associativity = (5 ^ 7) ^ 9. (i.e. “wrong”)

with right associativity = 5 ^ (7 ^ 9). (i.e. the “correct” interpretation.

"Expr expr+” seems like.a very odd construct. You expect to see a series of expressions in succession? No operators between them? (The “apply” name didn’t reveal much, and to be honest, just a glance at the source file, Q is a bt inscrutable (and I’ve programmed in a lot of languages… maybe that’s the problem). At any rate, it’s not clear what you’re attempting to accomplish with that (was it was something of a shot in the dark at trying to get pure RTL evaluation of your expressions?)

To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/CAKutr2DusUWBe%2B1uRRYw%3DgykD2C5%2BuWFKXyccJGwSfBHqmKngg%40mail.gmail.com.
<antlr.png><profiler-slowspot.png>

Reply all

Reply to author

Forward

MONOP:('til'\|'enlist'\|'first'\|'last'\|'distinct'\|'count'\|'key'\|'where'\|'reverse'\|'null'\|'not'
	\|'raze'\|'rotate'\|'show'\|'system'
	\|'desc'\|'asc'\|'++'\|'--'\|'til'\|'get'\|'abs'\|'all'\|'any'\|'avg'\|'avgs'\|'exp'\|'floor'\|'ceiling'
	\|'cos'\|'sin'\|'tan'\|'acos'\|'asin'\|'atan'\|'exp'\|'log'\|'fills'\|'flip'
	\|'mcount'\|'type'\|'attr'\|'reciprocal'\|'sqrt'
	\|'svar'\|'sdev'\|'var'\|'dev'\|'differ'\|'getenv'\|'group'\|'iasc'
	\|'idesc'\|'::'
	\|'max'\|'maxs'\|'min'\|'mins'\|'med'\|'mmu'\|'read0'\|'read1'\|'prd'\|'prds'\|'exit'\|'neg'\|'inv'
	\|'rand'\|'ratios'\|'ratios'\|'signum'\|'value'
	\|'trim'\|'rtrim'\|'ltrim'\|'upper'\|'lower'\|'string'
	\|'hcount'\|'hdel'\|'hsym'\|'hopen'\|'hclose'
	\|'gtime'\|'ltime'\|'parse'\|'views'\|'tables'); // exotic

TIMELIST:TIM (' ' TIM)+ ('n'\|'u'\|'v'\|'t')?;
	TIME:TIM ('n'\|'u'\|'v'\|'t')?;
	TIM:Specials \| DD \| DD? TI \| (DD? TI ':' [0-9] [0-9]) \| (DD? TI ':' [0-9] [0-9] '.' Digits?);
	DD:([-])* Digits 'D';
	TI:([-])* [0-9] [0-9] ':' [0-9] [0-9];