Hi Mike,
Thank you for taking the time to review my grammars.
There were three uses for predicates:
- Allowing a hash mark or asterisk at the beginning of the line to have different meaning than inside the line. I have removed this since it was only a small syntactic nicety. I have pushed the code on the Java project and only tested it locally on the C++ project, that's why you still saw it.
- Indentation-driven nesting levels (Python like): synthesizing INDENT / DEDENT tokens as described in the first ANTLR book.
- Escaping blocks of text in another language: the EXTCODE token.
Searching on the web and forums it seems the solution for 2 is still canonical. Don't know what the true cost of calling atStartOfInput() is.
Do you have a suggestion for how to handle EXTCODE? What I need there is for a token (CODE_MARKER) to indicate that from that line on, while the indent level is higher than the CODE_MARKER's token level, everything should not be lexed but just stored verbatim (spaces included) into a single token (EXTCODE). This is like multi-line strings in Python. I have looked at the
Python3 grammar in the grammars-v4 project for the "longstring" and "longstringitem" but not clear on how I would apply this. The main differences between extcode and multiline is that in SamX there might be useful metadata between a beginning code block, such as its name:
```(xml)(#example)
<section xml:lang="es">
<citation>Mother McCree</citation>
<title>Greeting</title>
<p>Feliz Navidad</p>
</section>
Also I still need to track the indentation through an extcode block so I can reproduce it verbatim and to know when the block ends since there is no explicit marker.
I am wondering, is there a simple way to daisy-chain lexers the way the lexer is piped into a parser? From reading random posting it seems it should be possible (especially since they share quite a bit of the implementation)? I am thinking whether converting steps 2 and 3 into separate "mini-lexers" that are run ahead of the main lexer to "help" the main lexer focus on the actual tokens instead of indentation and escaped code.
Thank you for all your work on Antlr4!
florin