Switching token streams (include files) for ANTLR4

1,401 views
Skip to first unread message

Kevin Cummings

unread,
Aug 30, 2016, 12:14:57 PM8/30/16
to antlr-di...@googlegroups.com
What is the preferred method for switching token streams in order to
handle "include" files? ISTR that there was a way to push a new token
lexer onto a stack in ANTLR3 and pop it back when it hit EOF. I
currently need both a new file stream and possibly a string stream as
well. I'm currently using the new Cpp target if that helps.

--
Kevin J. Cummings
kjc...@verizon.net
cumm...@kjchome.homeip.net
cumm...@kjc386.framingham.ma.us
Registered Linux User #1232 (http://www.linuxcounter.net/)

Jim Idle

unread,
Aug 30, 2016, 9:58:56 PM8/30/16
to antlr-di...@googlegroups.com
I did implement a push pop mechanism directly in the C runtime, but it was only in the C runtime. I am sure that it could be added to the v4 Cpp target - it is useful. But I can't speak to what other priorities are there for the Cpp target. 

It is also easy to write a preprocessor that streams everything as one stream, tagging lines with the filename and line number. Much as the C preprocessor. But you probably knew that.

Jim

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussion+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gerald Rosenberg

unread,
Aug 31, 2016, 12:26:00 AM8/31/16
to antlr-discussion
In Antlr3, multiplexing the input streams is desirable, given the parser does an AST rewrite. i.e., nearly impossible to merge custom ASTs.

In Antlr4, functional merger of parse trees is fairly straight-forward:

1) parse the base file without expanding the include statements
2) walk the parse tree, parsing the files referenced by the include statements
    - store each resulting parse tree in the include statement context of the base parse tree
    - recurse as appropriate
3) build a symbol table
4) walk the base parse tree
    - at each include statement context, pass the symbol table to a walker for the included parse tree,
    - in all other contexts, evaluate the context subject to the then accumulated contents of the symbol table

I have used this approach several times with success. And, the design is relatively easy to maintain.

Mike Lischke

unread,
Aug 31, 2016, 4:39:15 AM8/31/16
to antlr-di...@googlegroups.com
I did implement a push pop mechanism directly in the C runtime, but it was only in the C runtime. I am sure that it could be added to the v4 Cpp target - it is useful. But I can't speak to what other priorities are there for the Cpp target. 

Currently, when you set a token stream in a parser the parser is reset. One would have to check what happens without the reset and if parsing continues properly, when you  exchange the token stream. Simply changing the char stream in a lexer doesn't work, btw, as token streams can be buffered.


Jim Idle

unread,
Aug 31, 2016, 8:15:38 AM8/31/16
to antlr-di...@googlegroups.com
Yes it was done in the lexer, not the Parser of course. 





--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

Jim Idle

unread,
Aug 31, 2016, 10:00:11 PM8/31/16
to antlr-di...@googlegroups.com
I was thinking about this and I think your tree based solution won't work for the majority of languages that allow includes without an awful lot of later processing while walking the that is a lot easier to do in a pre-processing or direct include stage. Things like

jim1.h
#define MACRO(c) c ? dec() : inc();

#ifdef __CPLUS
...
#endif

main.c
#include "jim1.h"

#ifdef FRED
 stufff
#endif

And lots of other cases. 

That can be done later of course, but it's surely a lot of work?

Jim


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussion+unsubscribe@googlegroups.com.

Kevin Cummings

unread,
Aug 31, 2016, 11:33:11 PM8/31/16
to antlr-di...@googlegroups.com
On 08/30/16 21:58, Jim Idle wrote:
> I did implement a push pop mechanism directly in the C runtime, but it
> was only in the C runtime. I am sure that it could be added to the v4
> Cpp target - it is useful. But I can't speak to what other priorities
> are there for the Cpp target.

Thanks Jim,

Yes, this was exactly what I was thinking of and looking for. I know it
isn't high priority, but it would be nice to know what state needs to be
saved/restored so that tokenizing can proceed normally around changing
the input char stream.

I know that the "include" text can be skipped.

When doing the macro part of it, I would need to recognize that an IDENT
is recognized as a macro (in the lexer), skip it, and switch the char
stream to a buffer containing the replacing text and then continue the
token recognition with the string buffer. Then, restore the original
char stream upon the end of the replacement string. Yeah, the
interaction of the 2 methods may get tedious, trying to keep them
compatible with each other.

> It is also easy to write a preprocessor that streams everything as one
> stream, tagging lines with the filename and line number. Much as the C
> preprocessor. But you probably knew that.

Yeah, I wasn't looking forward to doing that. B^) I would have to
lex/parse most of the source code (and include files) looking for the
macro definitions and replace them in a temp source file that would then
need to be lexed parsed again in order to produce the parse trees for
even further processing. yeuk! but, after I solve the above problems,
I may yet have to return to process some "if" directives....

I wrote a PL/1 frontend that both handled %include and %replace in PCCTS
(ANTLR-1, C target), and also handled the Full PL/1 pre-processor
language (if, do, etc). But that was ANTLR-1 where I could easily
inject tokens into the token stream. In ANTLR2 (C++ target), there was
the #[ID,text] token syntax for creating tokens "on-the-fly". The next
time I tried, was in ANTLR-3 (C target), and while I wasn't successful,
I came close. Now I am trying again in ANTLR-4, now that it has C++
support. B^)

> Jim

Jim Idle

unread,
Aug 31, 2016, 11:44:08 PM8/31/16
to antlr-di...@googlegroups.com
I bet you could look at what I did in the v3 runtime and just try to implement it in the v4 runtime. I imagine that the state I preserved in v3 won't be too many miles away from v4.

Also, you could just implement a pre-processor in m3 and pipe the output to your input stream. Still might be work though!

Jim

Gerald Rosenberg

unread,
Sep 1, 2016, 12:45:08 AM9/1/16
to antlr-di...@googlegroups.com
Hi Jim,

It might seem so at first, but in practice very little additional work is required.

Consider the simplest case where the included files are specified by the same grammar (as is the case for C/C++ style includes). The added work is limited to the include context handling - to fire a new instance of the parser against the included file on a first walker pass and, for subsequent passes, fire up new instances of the walker.

Notably, the lexer/parser and walker/listener code is the same for the base and include files.

If a symbol/state table is used, it is passed to each new walker instance on start and recovered at the walk's end. The continuity of the table is maintained - there is no need to be aware of whether the table is being evaluated in connection with the base file or an include. Well, except for reporting error locations. Since the base and include parse-trees are not actually merged, all of their line/col location information is accurate relative to it's tree. Just need to track the name of the "current" parse-tree's file being walked - easily done using the table itself.

Defines are just scoped additions to the table. Each ifdef context is evaluated in the walk against the table, and either the block's subtree is walked or not.

There is one case (at least) that is not easily handled by a stream switching approach: where the name of an included file is a computed value. For example:

@Include "config.nom"         // specifies the value of __JIM
@Define base = "someDir/"

@Ifdef __JIM
@Define file = "jim.nom"
@Else
@Define file = "gerald.nom"
@Endif

@Include $base + $file

For stream switching, you would need a custom pre-processor parser to figure out the name of the include file.

The functional tree merging approach handles this with no additional coding required.

Best,
Gerald
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

Jim Idle

unread,
Sep 1, 2016, 2:52:17 AM9/1/16
to antlr-di...@googlegroups.com
Interesting approach, thanks for sharing it.

On Thu, Sep 1, 2016 at 12:45 PM, Gerald Rosenberg <gbrose...@gmail.com> wrote:
It might seem so at first, but in practice very little additional work is required.

Consider the simplest case where the included files are specified by the same grammar (as is the case for C/C++ style includes). The added work is limited to the include context handling - to fire a new instance of the parser against the included file on a first walker pass and, for subsequent passes, fire up new instances of the walker.

Notably, the lexer/parser and walker/listener code is the same for the base and include files.

If a symbol/state table is used, it is passed to each new walker instance on start and recovered at the walk's end. The continuity of the table is maintained - there is no need to be aware of whether the table is being evaluated in connection with the base file or an include. Well, except for reporting error locations. Since the base and include parse-trees are not actually merged, all of their line/col location information is accurate relative to it's tree. Just need to track the name of the "current" parse-tree's file being walked - easily done using the table itself.

Defines are just scoped additions to the table. Each ifdef context is evaluated in the walk against the table, and either the block's subtree is walked or not.

There is one case (at least) that is not easily handled by a stream switching approach: where the name of an included file is a computed value. For example:

@Include "config.nom"         // specifies the value of __JIM
@Define base = "someDir/"

@Ifdef __JIM
@Define file = "jim.nom"
@Else
@Define file = "gerald.nom"
@Endif

@Include $base + $file

For stream switching, you would need a custom pre-processor parser to figure out the name of the include file.

The functional tree merging approach handles this with no additional coding required.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussion+unsubscribe@googlegroups.com.

Mike Lischke

unread,
Sep 1, 2016, 3:52:10 AM9/1/16
to antlr-di...@googlegroups.com
Consider the simplest case where the included files are specified by the same grammar (as is the case for C/C++ style includes). The added work is limited to the include context handling - to fire a new instance of the parser against the included file on a first walker pass and, for subsequent passes, fire up new instances of the walker.

Doesn't that mean you have an own parser instance for every include file? This is exactly the situation that people wanted to overcome by inventing the token stream push/pop mechanism, as it might use a lot of memory otherwise, depending on the include structure.

Gerald Rosenberg

unread,
Sep 1, 2016, 1:54:30 PM9/1/16
to antlr-discussion
The parser, and walker, instances are immediately created, run, and released. For the parser, only the generated parse-trees are retained. For the walkers, only net additions to the table are retained.

The aggregate memory size of the base and include parse-trees is likely little different from that of a unitary parse-tree produced using the stream switching approach.

Hard to reason whether the stream switching or functional merging will require more CPU/memory allocation cycles.  On a warm VM, the repeated creation/release of a class is a fairly low-overhead operation. Even if functional merging consumes more cycles, I believe the practical difference will be negligible. 

Gerald

Mike Lischke

unread,
Sep 2, 2016, 3:46:33 AM9/2/16
to antlr-di...@googlegroups.com
> The parser, and walker, instances are immediately created, run, and released. For the parser, only the generated parse-trees are retained. For the walkers, only net additions to the table are retained.



True, there is no reason to keep the parsers alive, once they did their job.

Mike
--
www.soft-gems.net

Kevin Cummings

unread,
Sep 2, 2016, 8:24:21 PM9/2/16
to antlr-di...@googlegroups.com
Sorry, I mislabeled the email. It should have read: Switching input
Char streams to the Lexer. I'm trying to handle #include files (or
their equivalent) and some trivial macro replacements, I have not yet
got to the stage where I'm dealing with parser tree processing. B^{

I'm currently playing with a few compiler front-ends where this is
necessary do to the language specifications, and I'm trying to avoid
having a pre-processor pass (the original compiler implementations
didn't have them).

Thanks for the response though.
> kjc...@verizon.net <javascript:>
> cumm...@kjchome.homeip.net <javascript:>
> cumm...@kjc386.framingham.ma.us <javascript:>
> Registered Linux User #1232 (http://www.linuxcounter.net/)
>
> --
> You received this message because you are subscribed to the Google
> Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to antlr-discussi...@googlegroups.com
> <mailto:antlr-discussi...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Stan Sokorac

unread,
Dec 30, 2016, 12:06:25 AM12/30/16
to antlr-discussion
Gerald,

Sorry to revive a few months old thread -- I'm using a similar approach to yours to handle includes, and I only now ran into this discussion. Apart from the includes, though, I didn't think that I could avoid a pre-processing step to handle complex macros, and I wanted to see if you had any thoughts on that. 

So, going with Jim's example macro:

#define MACRO(c) c ? dec() : inc();

...if I have some code that uses the macro:

while (MACRO(val)) { print_value(val); }

...it seems that I need to either preprocess the file to swap MACRO(val) with 'val ? dec() : inc()' and then run the parser again, or switch token streams at MACRO and serve up the macro value, no? You don't fire up a new parser to parse the macro code, do you? It's not a given that the macro code is parseable outside of current parse context, as it could contain just a code fragment.

Stan

Gerald Rosenberg

unread,
Dec 30, 2016, 12:34:40 AM12/30/16
to antlr-di...@googlegroups.com
I would expect that the parser rule for your standard `while` statement would evaluate `MACRO(val)` as a standard method statement. So your parse tree would be something like

     statement WHILE (
            method MACRO (
                 parameter "val"
             )
     )

On walking the parse tree, evaluating method calls, you can first check to see if the method name matches an existing #define and do the substitution then. 

As for parsing the #define itself, the parser used will necessarily be specific to the syntax of the #define 'language'. Notably, the define parsing must produce a symbol for the symbol table or a parse-tree equivalent for functional substitution into the parse tree.

Feasibility in an particular instance will be highly dependent on the #define language and how much you are abusing good programming practices ;).

Otherwise, pre-processing may be the only option.

Gerald

Stanislav Sokorac

unread,
Dec 30, 2016, 10:31:47 AM12/30/16
to antlr-di...@googlegroups.com
At what point does the macro content get parsed, though? It could contain a whole bunch of statements, or a half-statement, such as:

#define TEST_CLASS (NAME) \
   class test_\NAME {  \
      int test_id;   

TEST_CLASS(foo)
  void do_test() { ... };
}

The code above will not parse unless you perform macro substitution.

Unless you put some limits on what can be in a macro, it seems like there's no way to avoid preprocessing... at which point, you might as well deal with includes in the preprocessor, too :(.

Stan

On Thu, Dec 29, 2016 at 11:34 PM, Gerald Rosenberg <gbrose...@gmail.com> wrote:
I would expect that the parser rule for your standard `while` statement would evaluate `
MACRO(val)` as a standard method statement. So your parse tree would be something like

....
     statement WHILE (
            statement MACRO (
val)
    )

On Thursday, December 29, 2016 at 9:06:25 PM UTC-8, Stan Sokorac wrote:

#define MACRO(c) c ? dec() : inc();

...if I have some code that uses the macro:

while (MACRO(val)) { print_value(val); }

...it seems that I need to either preprocess the file to swap MACRO(val) with 'val ? dec() : inc()' and then run the parser again, or switch token streams at MACRO and serve up the macro value, no? You don't fire up a new parser to parse the macro code, do you? It's not a given that the macro code is parseable outside of current parse context, as it could contain just a code fragment.

Stan

On Wednesday, August 31, 2016 at 11:45:08 PM UTC-5, Gerald Rosenberg wrote:
Hi Jim,

It might seem so at first, but in practice very little additional work is required.

Consider the simplest case where the included files are specified by the same grammar (as is the case for C/C++ style includes). The added work is limited to the include context handling - to fire a new instance of the parser against the included file on a first walker pass and, for subsequent passes, fire up new instances of the walker.

Notably, the lexer/parser and walker/listener code is the same for the base and include files.

If a symbol/state table is used, it is passed to each new walker instance on start and recovered at the walk's end. The continuity of the table is maintained - there is no need to be aware of whether the table is being evaluated in connection with the base file or an include. Well, except for reporting error locations. Since the base and include parse-trees are not actually merged, all of their line/col location information is accurate relative to it's tree. Just need to track the name of the "current" parse-tree's file being walked - easily done using the table itself.

Defines are just scoped additions to the table. Each ifdef context is evaluated in the walk against the table, and either the block's subtree is walked or not.

There is one case (at least) that is not easily handled by a stream switching approach: where the name of an included file is a computed value. For example:

@Include "config.nom"         // specifies the value of __JIM
@Define base = "someDir/"

@Ifdef __JIM
@Define file = "jim.nom"
@Else
@Define file = "gerald.nom"
@Endif

@Include $base + $file

For stream switching, you would need a custom pre-processor parser to figure out the name of the include file.

The functional tree merging approach handles this with no additional coding required.

Best,
Gerald

On Wednesday, August 31, 2016 at 7:00:11 PM UTC-7, Jim Idle wrote:
I was thinking about this and I think your tree based solution won't work for the majority of languages that allow includes without an awful lot of later processing while walking the that is a lot easier to do in a pre-processing or direct include stage. Things like

jim1.h
#define MACRO(c) c ? dec() : inc();

--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/FUQbEtonUlw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussion+unsubscribe@googlegroups.com.

Mike Lischke

unread,
Dec 30, 2016, 10:35:50 AM12/30/16
to antlr-di...@googlegroups.com
The code above will not parse unless you perform macro substitution.

Unless you put some limits on what can be in a macro, it seems like there's no way to avoid preprocessing... at which point, you might as well deal with includes in the preprocessor, too :(.

Why is a preprocess such a big issue for you? It doesn't need to be a separate step in your parsing pipeline. You can do that at input stream level for instance. Have you seen my Resource Parser project (http://www.soft-gems.net/index.php/java/windows-resource-file-parser-and-converter)? This is parsing .rc files, which are in many aspects like normal .h files. So it can handle #include, macro substitution etc., but handles all the preprocessor stuff outside of the main parse run. Maybe it can give you some inspiration...


Gerald Rosenberg

unread,
Dec 30, 2016, 1:23:33 PM12/30/16
to antlr-discussion
The techniques that I outlined will not extend to this use (abuse) of a pre-processor.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.

Stanislav Sokorac

unread,
Dec 30, 2016, 1:51:47 PM12/30/16
to antlr-di...@googlegroups.com
The only issue is that it's a bit cumbersome to have to tag the preprocessor output stream with original locations in order to have subsequent parse errors reported accurately. Being able to treat each include file separately is nice and tidy, but extensive use of macros to replace bits of code is ruining that nice plan :). 

Thanks for the link to your project, I will definitely take a look!

Stanislav Sokorac

unread,
Dec 30, 2016, 1:53:15 PM12/30/16
to antlr-di...@googlegroups.com
Ha :). Yeah, that's what I figured. Unfortunately, there's plenty of such "abuse" in the code I'm dealing with, so it looks like I'm going to have to go with a full preprocessing step, or some kind of input stream switching.

To unsubscribe from this group and all its topics, send an email to antlr-discussion+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages