Need help speeding up lexing step

36 views
Skip to first unread message

Florin Iucha

unread,
Jun 1, 2020, 10:24:26 PM6/1/20
to antlr-discussion
Dear Antlr Experts,

I have created a lexer/parser for the SamX language (semantic authoring markdown), implemented initially in Java, then ported to C++. I am observing that lexing step is really slow. to the point where I have implemented an object pool for the ATNConfig objects which yields about 10% speedup. But on a small input file (15Kb source, 3400 tokens) I am observing ~.5 seconds to just run the lexer (and discard the tokens). For comparison, on the same computer (AMD Ryzen 2700x) I have compiled the CPP14 lexer/parser and used it to parse its own generated parser (850KB, 17400 tokens) and it lexes in .1 seconds!

Would you be so kind to take a look at my lexers in Java and C++ and tell me whether I'm doing something really silly? This is my first significant project with ANTLR, and I have toggled between the original book and the updated version for inspiration. Based on other discussions here and on StackOverflow I have removed all but one of the conditionals on the left.

I am using the current top of the master branch for the C++ runtime, and 4.8 for the Java runtime.

Thank you,
florin

Mike Lischke

unread,
Jun 2, 2020, 2:31:56 AM6/2/20
to antlr-discussion
Hi Florin,


I have created a lexer/parser for the SamX language (semantic authoring markdown), implemented initially in Java, then ported to C++. I am observing that lexing step is really slow. to the point where I have implemented an object pool for the ATNConfig objects which yields about 10% speedup. But on a small input file (15Kb source, 3400 tokens) I am observing ~.5 seconds to just run the lexer (and discard the tokens). For comparison, on the same computer (AMD Ryzen 2700x) I have compiled the CPP14 lexer/parser and used it to parse its own generated parser (850KB, 17400 tokens) and it lexes in .1 seconds!

You are doing quite a lot of custom processing and manual token handling. This is pretty unusual and I have never seen such, even for more complicated languages like Javascript (which also has a fair amount of special handling).

However, I guess your main problem are left hand predicates like:

HASH : { expectListStart }? '#' { expectListStart = false; } ;

Rewrite **all** your lexer rules to use right hand predicates instead. Read also in the Development section here (https://soft-gems.net/the-antlr4-c-target-is-here/) what I found out about slow lexing speed in the C++ target.

Florin Iucha

unread,
Jun 2, 2020, 7:51:00 PM6/2/20
to antlr-di...@googlegroups.com
Hi Mike,

Thank you for taking the time to review my grammars.

There were three uses for predicates:
  1. Allowing a hash mark or asterisk at the beginning of the line to have different meaning than inside the line. I have removed this since it was only a small syntactic nicety. I have pushed the code on the Java project and only tested it locally on the C++ project, that's why you still saw it.
  2. Indentation-driven nesting levels (Python like): synthesizing INDENT / DEDENT tokens as described in the first ANTLR book.
  3. Escaping blocks of text in another language: the EXTCODE token.
Searching on the web and forums it seems the solution for 2 is still canonical. Don't know what the true cost of calling atStartOfInput() is.

Do you have a suggestion for how to handle EXTCODE? What I need there is for a token (CODE_MARKER) to indicate that from that line on, while the indent level is higher than the CODE_MARKER's token level, everything should not be lexed but just stored verbatim (spaces included) into a single token (EXTCODE). This is like multi-line strings in Python. I have looked at the Python3 grammar in the grammars-v4 project for the "longstring" and "longstringitem" but not clear on how I would apply this. The main differences between extcode and multiline is that in SamX there might be useful metadata between a beginning code block, such as its name:

   ```(xml)(#example)
       <section xml:lang="es">
          <citation>Mother McCree</citation>
          <title>Greeting</title>
          <p>Feliz Navidad</p>
       </section>

Also I still need to track the indentation through an extcode block so I can reproduce it verbatim and to know when the block ends since there is no explicit marker.

I am wondering, is there a simple way to daisy-chain lexers the way the lexer is piped into a parser? From reading random posting it seems it should be possible (especially since they share quite a bit of the implementation)? I am thinking whether converting steps 2 and 3 into separate "mini-lexers" that are run ahead of the main lexer to "help" the main lexer focus on the actual tokens instead of indentation and escaped code.

Thank you for all your work on Antlr4!
florin







--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/2DE6FA5F-040A-4733-B3A1-DBEEDB0C9C4E%40googlemail.com.


--
Don't question authority, they don't know either!

Florin Iucha

unread,
Jun 2, 2020, 7:51:02 PM6/2/20
to antlr-di...@googlegroups.com
I have read "Islands in the Stream" several times - I have a feel it might help with the EXTCODE, but still can't wrap my head around 'lexical modes'.

florin

Florin Iucha

unread,
Jun 2, 2020, 9:24:08 PM6/2/20
to antlr-di...@googlegroups.com
I think I got the hang of the lexical modes, and updated my lexers to use modes instead of tracking that state in the variables. However, this turns out to be even slower (~20% according to perf).

The only predicate left is in newline - the input file has 166 lines - can that impact the performance so severely?

florin

Susan Jolly

unread,
Jun 2, 2020, 10:02:30 PM6/2/20
to antlr-di...@googlegroups.com

Sorry I'm not enough of an ANTLR guru to look at your grammar and give any
advice. However I found that I was able to use lexer modes more
effectively in my grammar after studying this elegant XML lexer written by
Parr. YMMV of course.

https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4

Best wishes,
SusanJ

Florin Iucha

unread,
Jun 2, 2020, 11:49:04 PM6/2/20
to antlr-di...@googlegroups.com
Thank you Susan. I will take a look.

It seems `atStartOfInput` is called once for every token and probably pessimizing the search. I am wondering if I can push a synthetic token into the lexer to act as anchor for the indent on the first line.

florin

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

Eric Vergnaud

unread,
Jun 3, 2020, 12:36:07 AM6/3/20
to antlr-di...@googlegroups.com
I find that INDENT/DEDENT is better done in a TokenStreamRewriter

Does EXTCODE not have an end pattern?

Envoyé de mon iPhone

Le 3 juin 2020 à 09:24, Florin Iucha <florin...@gmail.com> a écrit :



Florin Iucha

unread,
Jun 3, 2020, 8:29:44 AM6/3/20
to antlr-di...@googlegroups.com
Hi Eric,

I have searched the web for an example for "TokenStreamRewriter+INDENT+DEDENT" but could not find any concrete code. Could you point me to an open-source example that I can review?

EXTCODE is a "line" in the codeBlockDef and it ends like all blocks with a DEDENT: https://github.com/0x8000-0000/samx-java/blob/master/src/main/antlr/SamXParser.g4#L224

florin

Eric Vergnaud

unread,
Jun 3, 2020, 9:17:02 AM6/3/20
to antlr-discussion
Hi,

sorry if the TokenStreamRewriter ref was misleading, what I meant is adjust these tokens outside the lexer, during parser consumption;

Eric

Florin Iucha

unread,
Jun 4, 2020, 8:36:47 PM6/4/20
to antlr-di...@googlegroups.com
Removing the call to atStartOfInput() reduces the lexing duration by a factor of 20x for my language which is where I would expect it to be.

florin
Reply all
Reply to author
Forward
0 new messages