Customize C++ Lexer to generate INDENT and DEDENT tokens

Marc Jacobi

unread,

Nov 29, 2019, 4:05:43 AM11/29/19

to antlr-discussion

I am trying to parse an indent based text - like python.
I looked at the python3 grammar (github) and do not completely understand the code. To get a better feeling for its workings I was planning on experimenting with it (in a separate project).

I do not want to put code into the grammar, so I aim to inject a custom Lexer base class to perform the indenting-logic.

Looking at the Lexer cpp code I am at a loss where and how to implement this logic.

Any tips would be helpful.

Related: I aim to produce a parse tree that has the correct nesting based on these indents. As this is my first real (more-complex) grammar project, I am not sure if this is feasible.
I figured I would do as much as possible in the parser in terms of generating an AST...

Thanks,

Marc

Róbert Einhorn

unread,

Nov 29, 2019, 12:58:27 PM11/29/19

to antlr-discussion

Hi Marc,

I hope this helps:
Starter Python

Robert

Marc Jacobi

unread,

Dec 5, 2019, 7:05:44 AM12/5/19

to antlr-discussion

Thanks, I have looked at the code and came to the conclusion (I think) that my initial expectations were wrong - although I have not run the code and verify my findings.

What this seems to be doing is just spit out INDENT and DEDENT tokens, while what I was looking for is a way to make indented content a child of the parent scope in the parse tree.

Is this conclusion correct?

And is what I am trying to achieve feasible?

M.

Marc Jacobi

unread,

Dec 8, 2019, 3:06:54 AM12/8/19

to antlr-discussion

Is it possible to have the parser output its tree in the hierarchy of the encountered INDENTs?

So

text

[INDENT] text

would output textContext with a child textContext...

How can I get this output from the parser?

M.

Mike Lischke

unread,

Dec 8, 2019, 4:44:34 AM12/8/19

to antlr-discussion

Is it possible to have the parser output its tree in the hierarchy of the encountered INDENTs?

So

text
[INDENT] text

would output textContext with a child textContext...

No, you cannot change that. The parse tree always has the same structure. However, it's easy to get the indentation. The TextContent contains the tokens of which it was made of. Get the token stream index of the start token. Then look in the BufferedTokenStream one index before that start index, which must be the whitespaces between the text token and the token before it. Count tabs/spaces in it before the last line break to get the indentation.

Mike
--
www.soft-gems.net

Reply all

Reply to author

Forward