Still floundering trying to grok "transformation" in Antlr 4

Steve Ebersole

unread,

Nov 13, 2014, 1:29:58 PM11/13/14

to antlr-di...@googlegroups.com

Following up on my initial discussion, I am still having problems understanding the practical steps to performing transformations with Antlr 4. Antlr 4 does not support tree re-writing, so I get that that will not be an option. The 2 approaches that get mentioned for use with Antlr 4 include:

"decoration" - I understand the decorator pattern in principal. However, I have no idea how to apply a decoration to the parse tree nodes in Antlr 4. I have seen mention (here, for example) that decorations are covered in the Antlr 4 book, but I could find no reference to "decorat" in a search of my copy.
"symbol tables" - I understand the concept of building up symbol tables in theory. But I just don't see how this helps in my cases.

I was writing up an email on my Antlr 3 vs Antlr 4 thoughts for our dev mailing list. Here are some trees from a simple query (select c.headquarters.state.code from Company c) that I used to illustrate my concerns with Antlr 4 for our use. Simple syntactic analysis will produce a tree something like:

[QUERY]

[SELECT]

[DOT]

[IDENT, "c"]

[IDENT, "headquarters"]

[IDENT, "state"]

[IDENT, "code"]

[FROM]

[SPACE]

[SPACE_ROOT]

[IDENT, "Customer"]

[IDENT, "c"]

There is not a lot of semantic (meaning) information here. A more semantic representation of the query would look something like:

[QUERY]

[SELECT]

[ATTRIBUTE_REF]

[ALIAS_REF, "<gen:1>"]

[IDENT, "code"]

[FROM]

[SPACE]

[PERSISTER_REF]

[ENTITY_NAME, "com.acme.Customer"]

[ALIAS, "c"]

[JOIN]

[INNER]

[ATTRIBUTE_JOIN]

[IDENT, "headquarters"]

[ALIAS, "<gen:0>"]

[JOIN]

[INNER]

[ATTRIBUTE_JOIN]

[IDENT, "state"]

[ALIAS, "<gen:1>"]

The first tree would be the Antlr 4 parse tree, and unless I miss something regardless of whatever else I do, Antlr 4 is always going to make me walk the tree in that form (aka, following those rules).

I understand that ideally we would probably try to align the initial tree more with the structure of the second tree. Problem is that that is not always doable. The big culprit for me is these DOT-IDENT structures because they can represent a wide range of things based on context and other analysis. In this example, for example, I could definitively know that "c.headquarters..." represents an attribute-reference because the path starts with an alias-reference; but of course I cannot know that unless I could analyze the FROM clause first, which I cannot do with Antlr 4 AFAIK.

As I said, I just don't see how externalized symbol tables would help me here if that means I am still walking the tree based on the parse rules. Am I maybe missing something there?

And any pointers on how to practically handle something like this with decoration? Can decoration handle stuff like this?

Thanks for any help.

David Whitten

unread,

Nov 13, 2014, 2:10:34 PM11/13/14

to antlr-di...@googlegroups.com

Could you walk the parse-tree and generate a semantic tree?

It seems that if you have a way to do a tree transform, you could

have code at each tree-division point of the parse tree that would create a new tree with the
data you have.

for example, if the action on the [DOT] node looked to see
if it was a sub-node of a [DOT] node and did nothing,
but created a new [ENTITY_NAME] node when it was the "top" of a subtree that has [IDENT] and [DOT] nodes below it.

This would allow you to build a new tree that was organized the way you would want, doing a single pass visit of all the parse-tree nodes.

Do I understand your problem, or is there a complexity that I don't understand? (or is this method so onerous for some grammars and trees that
it isn't practical to do it this way ?)

Best Wishes,
Dave Whitten
713-870-3834

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steve Ebersole

unread,

Nov 13, 2014, 2:22:15 PM11/13/14

to antlr-di...@googlegroups.com, whi...@netcom.com

Hi Dave, thanks for the reply!

I think you do understand it, yes. My concern is not so much creating the new tree. I know I can do that. My concern is then walking that new tree. If I understand correctly, I wont be able to use the Antlr 4 generated walker (listeners/visitors)...

David Whitten

unread,

Nov 13, 2014, 3:23:55 PM11/13/14

to antlr-di...@googlegroups.com

Okay, I see.

Perhaps you can define your new tree structure as a grammar, then
feed that grammar into ANTLR and have it generate the walker/listener/visitor

code for you? You wouldn't use the "semantic" grammar to generate a tree
while it is parsing, but just to allow you to walk the new generated trees.
ANTLR is a pretty cool tool to do a lot of grunt work for you.

Best Wishes,
Dave Whitten
713-870-3834

Steve Ebersole

unread,

Nov 13, 2014, 8:55:43 PM11/13/14

to antlr-di...@googlegroups.com, whi...@netcom.com

I had wondered about this myself. This essentially works just like our Antlr 2 and 3 based solution with tree grammars. In essence your "semantic grammar" *is* an Antlr 2/3 tree grammar. I don't know if that will work in practice. I guess it could work as long the "new generated tree" implements/extends the proper Antlr interface/class - I have no idea about the requirements there. I will definitely look into this some over the next few days. Thanks!

I would still like to understand how tree decoration works in practice, if anyone has pointers. Do you see decoration working here?

David Whitten

unread,

Nov 13, 2014, 10:03:32 PM11/13/14

to Steve Ebersole, antlr-di...@googlegroups.com

Maybe you could put both grammars into the same .g4 file, but I would
be concerned that the rules that were solely for the semantic tree
would be used by parsing, or that the generated parser might try to
use them.

Does ANTLR support two grammars that are independent, perhaps with
differing input streams ? Maybe one for parsing from an TCP/IP
connection and another from a text file?

I could see how the different input sources might have a different
line protocol. A verbose one for communication with humans, and
another that was much terser for communicating machine-to-machine.

Dave
713-870-3834

Gerald Rosenberg

unread,

Nov 14, 2014, 12:32:48 AM11/14/14

to antlr-di...@googlegroups.com

On Thursday, November 13, 2014 10:29:58 AM UTC-8, Steve Ebersole wrote:

Following up on my initial discussion, I am still having problems understanding the practical steps to performing transformations with Antlr 4. Antlr 4 does not support tree re-writing, so I get that that will not be an option. The 2 approaches that get mentioned for use with Antlr 4 include:
"decoration" - I understand the decorator pattern in principal. However, I have no idea how to apply a decoration to the parse tree nodes in Antlr 4. I have seen mention (here, for example) that decorations are covered in the Antlr 4 book, but I could find no reference to "decorat" in a search of my copy.

Look at `org.antlr.v4.runtime.tree.ParseTreeProperty`. Basically a Map that allows you to associate your own class object(s) with the node instances in the parse tree.

"symbol tables" - I understand the concept of building up symbol tables in theory. But I just don't see how this helps in my cases.

In your example, not much may be needed. If the aliases have no scope, then just a map of alias name to value is needed. If they do have some defined scope, you will need a simple push-down style symbol table.

Remember, walking the parse tree is very cheap. And, that you can strategically act on the visitor enter and/or exit of just the nodes of current interest.

First walk, decorate the tree with your node-type specific objects and do basic decorator object init stuff (whatever you need).
Second walk, gather your alias definitions from the clauses where aliases are defined.
Third walk, on visitor exit from each node, carry up any meaningful data to a decorator object of your choice. When you get to a field that matches the alias, resolve against the symbol table to pull in the association.
Fourth walk, whatever your specific requirements are.

At this point you will likely have filled out decorator objects for the primary statement nodes, query, select, from, etc., each with fields with the appropriate characterizing data.

For analysis, rather than walking an AST and looking at the type of node you are on, walk the parse tree and look at what the decorator then says about the node. How semantic the parse tree is depends solely on how semantic you make the decorators. The content of the parse tree and an AST derived from the same source is always going to be the same.

[QUERY]

[SELECT]

[DOT]

[IDENT, "c"]

[IDENT, "headquarters"]

[IDENT, "state"]

[IDENT, "code"]

[FROM]

[SPACE]

[SPACE_ROOT]

[IDENT, "Customer"]

[IDENT, "c"]

Don't mean to be snarky, but I am sure you can build a richer parse tree than this to start from.

Steve Ebersole

unread,

Nov 14, 2014, 8:48:59 AM11/14/14

to antlr-di...@googlegroups.com

On Thursday, November 13, 2014 11:32:48 PM UTC-6, Gerald Rosenberg wrote:

Remember, walking the parse tree is very cheap. And, that you can strategically act on the visitor enter and/or exit of just the nodes of current interest.

First walk, decorate the tree with your node-type specific objects and do basic decorator object init stuff (whatever you need).
Second walk, gather your alias definitions from the clauses where aliases are defined.
Third walk, on visitor exit from each node, carry up any meaningful data to a decorator object of your choice. When you get to a field that matches the alias, resolve against the symbol table to pull in the association.
Fourth walk, whatever your specific requirements are.

At this point you will likely have filled out decorator objects for the primary statement nodes, query, select, from, etc., each with fields with the appropriate characterizing data.

I can understand this in principal. Of course devil's in the details. And having never built a parser this way, its daunting looking up that hill. The big concern/unknown for me is the fact that I am still walking the parse tree rules, not the more rich "semantic rules". David's suggestion is "safer" to me because in this regard because it follows the paradigm I am familiar with from my Antlr v2/v3 work. Although even there I am uncertain of properly building a separate parse tree such that subsequent Antlr-generated walkers can walk them.

But at least I better understand the details of implementing decoration in Antlr. Thanks!

For analysis, rather than walking an AST and looking at the type of node you are on, walk the parse tree and look at what the decorator then says about the node. How semantic the parse tree is depends solely on how semantic you make the decorators. The content of the parse tree and an AST derived from the same source is always going to be the same.

This is not the first time I have heard reference to parse tree and AST as different concepts in Antlr 4. But I have not seen concrete examples or discussions of what exactly an AST is (as different from a parse tree) in Antlr v4. Can anyone explain this to me? Is an AST just the "semantic tree" I choose to build or not as I walk the parse tree, but external to the parse tree?

As far as "how semantic I make the decorators", well that's the rub. It's a lot easier to find your way through the trees when you at least understand the forest and understand where you need to get on the other side of it :)

[QUERY]
[SELECT]
[DOT]
[DOT]
[DOT]
[IDENT, "c"]
[IDENT, "headquarters"]
[IDENT, "state"]
[IDENT, "code"]
[FROM]
[SPACE]
[SPACE_ROOT]
[IDENT, "Customer"]
[IDENT, "c"]

Don't mean to be snarky, but I am sure you can build a richer parse tree than this to start from.

Perhaps. I certainly would never claim that the above is the best possible structure. But it is the best possible structure I know how to come up with with my limited, non-practical experience with Antlr v4 so far.

By "richer" I assume you mean "more semantic"? Ala, something like:

[QUERY]

[SELECT]

[ATTRIBUTE_REF]

[DOT]

[IDENT, "c"]

[IDENT, "headquarters"]

[IDENT, "state"]

[IDENT, "code"]

...

?

I can see where this would be nice, especially in the decoration case because I would be able to attach semantic information/decoration to the [ATTRIBUTE_REF] node and just never walk the rest of its sub-tree (if I understand you).

The problem is that I really cannot do that *I think*. These DOT-IDENT references have many possibly meanings:

1) Java class name (e.g., com.acme.Brick)

2) Java static field reference (e.g., com.acme.Color.RED)

3) attribute reference (like above)

4) SQL function (the dots come from Oracle's packages concept)

I could resolve (1) and (2) definitively even while building the parse tree; I could attempt to resolve the references from ClassLoader. But accessing the ClassLoader for every DOT-IDENT just to attempt to determine its type is going to be expensive (I know, we do something similar now). (4) could be resolved by the fact that it is usually followed by a open-parenthesis. But, it is not always followed by an open-parenthesis.

I can come up with a scheme to definitively resolve function references. But I would really prefer to stay away from ClassLoader access as the means for semantic decision making.

So, ultimately not sure how I can really make this tree richer. Initially anyway. I can once I know the basis for all attribute references (which I know after I process the FROM clause).

Kevin Cummings

unread,

Nov 14, 2014, 3:43:05 PM11/14/14

to antlr-di...@googlegroups.com

On 11/14/2014 08:48 AM, Steve Ebersole wrote:
>
>
> For analysis, rather than walking an AST and looking at the type of
> node you are on, walk the parse tree and look at what the decorator
> then says about the node. How semantic the parse tree is depends
> solely on how semantic you make the decorators. The content of the
> parse tree and an AST derived from the same source is always going
> to be the same.
>
>
> This is not the first time I have heard reference to parse tree and AST
> as different concepts in Antlr 4. But I have not seen concrete examples
> or discussions of what exactly an AST is (as different from a parse
> tree) in Antlr v4. Can anyone explain this to me? Is an AST just the
> "semantic tree" I choose to build or not as I walk the parse tree, but
> external to the parse tree?

Steve, AST stands for "Abstract Syntax Tree". If you make no changes to
it, it is the same as the Parse tree. But, in earlier versions of
ANTLR, when you could "exclude" tokens from the AST, and otherwise
"transform" it, while you are parsing, that made it different from the
Parse Tree which is simply every token in the order it was parsed. The
AST could be re-written to be "rooted" and easier to walk, while it was
being parsed. This "feature" is currently missing from ANTLR4.

I am an "old school" compiler engineer. All of the compilers I have
worked on created ASTs that were then subsequently transformed by the
various phases of the compiler, right up until the code generator which
would then generate "code" from the final AST. All of the semantic
analysis that happened along the way would transform the AST into
something that the code generator could deal with.

> As far as "how semantic I make the decorators", well that's the rub.
> It's a lot easier to find your way through the trees when you at least
> understand the forest and understand where you need to get on the other
> side of it :)

"decorating" your tree is simply the act of applying your semantic
analysis to your AST so that further phases can better deal with your
tree. Whether that means changing the current AST or generating a new
one entirely is up to you. Older versions of ANTLR would allow you to
modify the AST in tree walkers. Newer versions want to deprecate that
in favor of Listeners/Visitors. For really serious code generation,
LLVM provides a homogeneous AST that can be re-used. Modern thinking is
that if you need this sort of thing, then your parser should probably
transform the ANTLR AST into an LLVM AST that can then be massaged
further down the road. Ter posted some example code of how to interface
with LLVM a couple of years ago.

For small DSLs, ANTLR in its current form seems to work well. for
Larger DSLs, a transformation into something like the LLVM AST seems
like a no-brainer.

Disclaimer: I wrote a PL/I front-end in ANTLR/DLG/Sorcerer 20 years ago
(written in C). I have also ported an LALR(1) big-data grammar to
ANTLR2 8 years ago (in C++). I started to write some old-style compiler
frontends in ANTLR3, but found the lack of direct C++ support
frustrating. The lack of AST transformations (and native C++ support)
is keeping me from using ANTLR4.

--
Kevin J. Cummings
kjc...@verizon.net
cumm...@kjchome.homeip.net
cumm...@kjc386.framingham.ma.us
Registered Linux User #1232 (http://www.linuxcounter.net/)

Gerald Rosenberg

unread,

Nov 14, 2014, 9:48:35 PM11/14/14

to antlr-di...@googlegroups.com

On Friday, November 14, 2014 5:48:59 AM UTC-8, Steve Ebersole wrote:

I can understand this in principal. Of course devil's in the details. And having never built a parser this way, its daunting looking up that hill.

What? Writing a grammar in Antlr4 is nearly identical to writing an Antlr3 grammar. Just without the tree rewrite rules embedded in the parser grammar.

And, if you really want an AST, just build it. The Antlr3 AST nodes are just wrappers of the Antlr3 tokens class. Pull the Antlr3 CommonTree class source into your project and use with the Antlr4 tokens class to get all the same parent/child tree construction plumbing that Antlr3 uses to build its AST. If memory serves, you just need CommonTree, BaseTree and Tree.

The rewrite rules are just a meta-notation describing how to select and hang parser nodes together. Implement each Antlr3 rewrite rule in an Action added to the corresponding Antlr4 parser rule -- makes it look almost like Antlr3.

Each Action just needs to copy meaningful parser nodes (the ones you would identify in rewrite rules) into new 'semantic' AST wrapper nodes (? extends CommonTree) and then hang them in your desired AST order to progressively build the tree. The node copying plumbing of Antlr3 is still present in Antlr4 tokens. The end result is an AST that is little different than what Antlr3 would produce.

Then walk your AST (Antlr3 TreeVisitor) -- no parse tree needed or used. Again, all of the plumbing for walking an AST as used in Antlr3 will be present in the wrapped Antlr4 tokens.

Yes it is a bit more work, but the work is little more than rearranging deck chairs on the Titanic, except there are no icebergs and you are given the Titanic to start with.

If you really want help, then show a true representative sample of your input and your Antlr4 grammar.

Steve Ebersole

unread,

Nov 19, 2014, 4:38:42 PM11/19/14

to antlr-di...@googlegroups.com

Gerald, I appreciate the help so far, but try to understand that when someone has not programmed a certain language or style or approach etc it is often daunting to know where to start. I am glad you feel comfortable programming that way. I have never done it.

http://github.com/sebersole/hibernate-antlr4-poc is the project where I am tracking this PoC. Again all this is a proof-of-concept to try to glean whether Antlr 4 will work for our needs. So the grammar is pretty simplified at this point.

src/main/antlr4/org/hibernate/hql/antlr/HqlLexer.g4 is the lexer grammar

src/main/antlr4/org/hibernate/hql/antlr/HqlParser.g4 is the parser grammar

As for the input, its a query language; we receive it from the users. So there is no *the* input. I can show you some of the (what I see as) more complicating aspects. Again, it mainly comes down to the different semantics of these DOT-IDENT sequences and getting that encoded well into the tree. And please be aware that a lot of this is mandated to me by a spec group and I have no control to be able to "refine" the language to remove the ambiguities.

Anyway, here is a contrived example query to illustrate the ambiguities I have:

select (my_oracle_package.special_current_time_func - o.expectedShipDate) * com.acme.TimeUtils.DAYS_TO_HOURS

from com.acme.Order o

join o.shippingGroup s

on ...

join com.acme.ShippingArea sa

on s.shipTo.zipCode = sa.zipCode

where s.fulfilled = false

--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/hzF_YrzfDKo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.

Gerald Rosenberg

unread,

Nov 19, 2014, 11:31:19 PM11/19/14

to antlr-di...@googlegroups.com, st...@hibernate.org

If you are looking for assurances, then quite clearly, yes, translation from HQL/JPA to SQL is exactly the kind of problem Antlr4 is designed for. Given the HQL/JPA/SQL specs, there is no real ambiguity in the translation -- conceptually it is not too far off from being a 1-to-1 translation, but a pain because the differences are at such a low-level.

Your question then is really just about the mechanics. The project is no doubt large, but you are looking at just a handful of techniques to be mastered and then applied many times over. Ter has written a number of blog posts on Antlr.org about the preference of using a parse tree over an AST. Read them, or read them again - the simple takeaway is that ASTs become brittle under progressive transformations - they are in all meaningful ways self-modifying code. In a project as large as this, the stability that an immutable parse tree gives far outweighs any perceived benefit of re-arranging the AST.

FWIW, in hindsight, I am not particularly fond of parse trees. Hopefully, Antlr5 will allow the parser grammar to directly define key aspects of the form and content of the tree produced. But still an immutable tree once produced, with generated listeners & visitors as now provided. Having just the one level of transformation could significantly simplify analysis. More than one, in purely practical terms, is an anti-pattern.

So, for now, use the tool as designed. The structured parse-tree driven approach it most directly supports will be, if not enthusiastically, welcomed in the end.

The individual parse tree nodes are actually as 'semantic' as any AST nodes that Antlr3 would produce. In the generated parser, each different parser rule is represented by a unique 'context' class object. The queryExpression rule will be represented by a queryExpressionContext() class containing lists and references to querySpecContext(), union_keyContext(), intersect_keyContext(), except_keyContext(), all_keyContext(), and querySpecContext() classes.

Take another look at the generated parser in this light -- you should recognize that each instance of IDENTIFIER will occur in a sub-branch of the parse tree well-characterized by the series of parent contexts that connect it to the tree. An IDENTIFIER instance referenced from a parent instance of type aliasReferenceContext() must represent an alias.

Now, the last aspect of your question -- where to begin -- is simply with the grammar. Looks like it is a conversion from the Antlr2 grammar. It is a good start, but there are better ways of handling things, such as key words, that will make the grammar produce a simpler and more helpful parse tree for analysis.

For example, the @members methods could be removed, with the *_key rules being simplified to explicit elements:

all_key: ALL IDENTIFIER ;

You can add labels that will become fields in the context class, which can make it a bit easier to access the discrete elements of a context class

all_key: k=ALL id=IDENTIFIER ;

In the lexer, the tokens block could be removed with corresponding token rules being defined. The Antlr4 lexer offers 'modes' that allow well-defined subsets of tokens to be recognized. For example, many of the tokens are only valid between a 'select' and 'from'. Not sure it is appropriate for a mode, but worth considering.

Steve Ebersole

unread,

Nov 20, 2014, 8:03:33 AM11/20/14

to antlr-di...@googlegroups.com

On Wed, Nov 19, 2014 at 10:31 PM, Gerald Rosenberg <gbrose...@gmail.com> wrote:

If you are looking for assurances, then quite clearly, yes, translation from HQL/JPA to SQL is exactly the kind of problem Antlr4 is designed for. Given the HQL/JPA/SQL specs, there is no real ambiguity in the translation -- conceptually it is not too far off from being a 1-to-1 translation, but a pain because the differences are at such a low-level.

Your question then is really just about the mechanics. The project is no doubt large, but you are looking at just a handful of techniques to be mastered and then applied many times over. Ter has written a number of blog posts on Antlr.org about the preference of using a parse tree over an AST. Read them, or read them again - the simple takeaway is that ASTs become brittle under progressive transformations - they are in all meaningful ways self-modifying code. In a project as large as this, the stability that an immutable parse tree gives far outweighs any perceived benefit of re-arranging the AST.

I will have a look for these blog posts. Any in particular?

FWIW, in hindsight, I am not particularly fond of parse trees. Hopefully, Antlr5 will allow the parser grammar to directly define key aspects of the form and content of the tree produced. But still an immutable tree once produced, with generated listeners & visitors as now provided. Having just the one level of transformation could significantly simplify analysis. More than one, in purely practical terms, is an anti-pattern.

So, for now, use the tool as designed. The structured parse-tree driven approach it most directly supports will be, if not enthusiastically, welcomed in the end.

Really it is is the control over the production of that initial tree that I found myself missing the most. So I *think* I agree with what you are saying about Antlr 5. But of course we have to use what we have available today

The individual parse tree nodes are actually as 'semantic' as any AST nodes that Antlr3 would produce. In the generated parser, each different parser rule is represented by a unique 'context' class object. The queryExpression rule will be represented by a queryExpressionContext() class containing lists and references to querySpecContext(), union_keyContext(), intersect_keyContext(), except_keyContext(), all_keyContext(), and querySpecContext() classes.

When I said semantic before, I meant in terms of control over the produced output (with gated semantic predicates, etc). Specifically in terms of producing differing sub-trees or nodes in one type of token (IDENT, e.g.) based on where/how it is occurs (aliasReference vs. attributeReference ...). Maybe I am not using parsing vocab correctly; that is definitely possible :)

Take another look at the generated parser in this light -- you should recognize that each instance of IDENTIFIER will occur in a sub-branch of the parse tree well-characterized by the series of parent contexts that connect it to the tree. An IDENTIFIER instance referenced from a parent instance of type aliasReferenceContext() must represent an alias.

So glad you chose this one as the example :) The trouble here is that I don't know that an IDENTIFIER is an "alias reference" (and if you notice in the grammar that rule is hanging anyway). Not in the design of the grammar anyway. Take a simple example query:

select c.hqAddress from Company c ...

'c' is an "alias reference". But using the tools I know from my work with Antlr 2/3, I cannot know that during parse because it also happens to be a forward reference. In Antlr 4 I can do this in the visitors by subsequently walking the parse tree multiple times, starting with the fromClause. After that I know all the possible alias references. But I have so many practical questions in how to code this. I will look through the blogs you mentioned first though before I start asking things that may be answered there. But to be honest this is exactly the kind of thing I am talking about in regards to learning curve and it being daunting: this process of designing the process from the ground, because that design assumes knowledge of the end result.

Now, the last aspect of your question -- where to begin -- is simply with the grammar. Looks like it is a conversion from the Antlr2 grammar. It is a good start, but there are better ways of handling things, such as key words, that will make the grammar produce a simpler and more helpful parse tree for analysis.

For example, the @members methods could be removed, with the *_key rules being simplified to explicit elements:

all_key: ALL IDENTIFIER ;

You can add labels that will become fields in the context class, which can make it a bit easier to access the discrete elements of a context class

all_key: k=ALL id=IDENTIFIER ;

In the lexer, the tokens block could be removed with corresponding token rules being defined. The Antlr4 lexer offers 'modes' that allow well-defined subsets of tokens to be recognized. For example, many of the tokens are only valid between a 'select' and 'from'. Not sure it is appropriate for a mode, but worth considering.

You misunderstand the intent of these *_key rules. They are meant to handle keywords used as not-keywords. For example:

select e from Event e where current_timestamp() between e.from and e.to

Here we have 2 occurrences of the "from" token, but with very different semantics.

This *_key rule approach is actually different than what we do in out legacy translator. The legacy approach was to "hard code" the keywords but to essentially catch RecognitionException in cases where we are expecting an IDENT and to treat the keyword token as an IDENT. This new approach is also a PoC in leveraging structure/syntax to define where keywords could occur.

I had wondered about whether I might be able to leverage these modes.

Terence Parr

unread,

Nov 20, 2014, 10:55:19 AM11/20/14

to antlr-di...@googlegroups.com

Hi guys. I’m sorry to be so silent on these posts but I’ve had to take a few months off of ANTLR to prep two courses this semester.

That said, I will reiterate my thoughts briefly on ASTs vs parse trees and my motivations for moving away from ASTs for antlr 4.

ASTs are best applied in a compiler situation where you have a series of operations you want to perform. Source transformation I think is best done when you have the sub phrase names and structure left in the tree ala parse trees. It also means that there is no mystery about where the whitespace and comments go relative to the important tokens.

I have built 15 pass translators over AST’s trying to slowly morph them into the appropriate output tree. Even for Fortran 77 to parallel Thinking Machines Fortran, I look back and think that was a mistake. Every phase of the tree had a slightly different architecture, which meant a different tree walker that was nearly identical to the other 14. One change in the tree structure rippled through and was highly annoying and bug prone. As with ANTLR itself and some other transformations I’ve done recently, I find that it is best to use a listener or visitor to construct an internal model that represents what you’re going to generate. You can walk tree 1 million times collecting information and organizing your internal model. As you can imagine, the transformation from ANTLR grammar to generated recursive descent parser is nontrivial. Once I have an internal model, it’s trivial to emit structured text using templates.

Tree transformations *are* in fact super useful however in some circumstances, whether it’s an AST or parse tree. When you have things that go from language X to language X, there are lots of simplifications and identity transformations that make sense such as “<expr> + 0” -> “<expr>”. I don’t believe it’s a good idea to convert, say, “x+y” to “push x, push y, add”, for example using tree transformations. Now, you have to build a tree walker for the new language.

random thoughts…

Ter

--

You received this message because you are subscribed to the Google Groups "antlr-discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

Gerald Rosenberg

unread,

Nov 20, 2014, 4:43:03 PM11/20/14

to antlr-di...@googlegroups.com

On Thursday, November 20, 2014 7:55:19 AM UTC-8, the_antlr_guy wrote:

Tree transformations *are* in fact super useful however in some circumstances, whether it’s an AST or parse tree. When you have things that go from language X to language X, there are lots of simplifications and identity transformations that make sense such as “<expr> + 0” -> “<expr>”.

Very much in agreement. Just looking for the ability to remove gratuitous cruft that occurs in a pure parse tree and to make slight but conceptually meaningful changes in the form of the tree, provided that the automatically generated visitor can walk the tree produced.

Since you appear to be open to allowing some limited flexibility in tree generation, the next step would be to work out the use cases. First set that comes to mind allow for a normal form.

A++ ==> expr op -> op expr+

(A++) ==> lParen expr op rParen -> op expr+

++A ==> op expr -> op expr+

(++A) ==> lParen op expr rParen -> op expr+

A OP B ==> expr op expr -> op expr+

A ? B : C ==> expr op=op1 expr op2 expr -> op expr+

Should I open an enhancement request issue and pursue the discussion there?

Terence Parr

unread,

Nov 20, 2014, 4:59:15 PM11/20/14

to antlr-di...@googlegroups.com

Hi. I don’t grok your notation / what your goal is below.
Ter

Gerald Rosenberg

unread,

Nov 20, 2014, 6:16:58 PM11/20/14

to antlr-di...@googlegroups.com, st...@hibernate.org

On Thursday, November 20, 2014 5:03:33 AM UTC-8, Steve Ebersole wrote:

On Wed, Nov 19, 2014 at 10:31 PM, Gerald Rosenberg <gbrose...@gmail.com> wrote:

When I said semantic before, I meant in terms of control over the produced output (with gated semantic predicates, etc). Specifically in terms of producing differing sub-trees or nodes in one type of token (IDENT, e.g.) based on where/how it is occurs (aliasReference vs. attributeReference ...). Maybe I am not using parsing vocab correctly; that is definitely possible :)

This is where a change in mind set for Antlr4 is needed. Often, where a complicated predicate is used in Antlr3, it is better handled by a lexer mode, by expanding the set of parser rules, or by analysis during a tree walk.

For the most part, neither predicates nor modes are required. In the case of an IDENTITY instance (or instance of any other node) the parentage of the instance provides its unique characterization. Just associate a decorator instance with the node instance and add relevant information collected during a walk. Not sure how it could be made any more straight forward.

'c' is an "alias reference". But using the tools I know from my work with Antlr 2/3, I cannot know that during parse because it also happens to be a forward reference. In Antlr 4 I can do this in the visitors by subsequently walking the parse tree multiple times, starting with the fromClause. After that I know all the possible alias references.

Right, no problems here.

But to be honest this is exactly the kind of thing I am talking about in regards to learning curve and it being daunting: this process of designing the process from the ground, because that design assumes knowledge of the end result.

Not to wax too philosophical here, but I believe there are no interesting problems where one knows the end result from the outset. And, life is too short to waste on uninteresting problems.

You misunderstand the intent of these *_key rules. They are meant to handle keywords used as not-keywords.

But, as written, they only match when used as a keyword. In any case, try writing each rule as a positive local assertion. Rely on rule->subrule and rule ordering to steer the parser to the correct subsets of rules to evaluate. Treat predicates as a (near) last resort. Start with the lexer and unit tests to experiment with how you can break down a source text into manageable tokens before leaping to the most complicated aspects of the parser grammar. Yes, you, like anyone else, will likely wind up rewriting the lexer and parser several times over before you find the right groove.

Gerald Rosenberg

unread,

Nov 20, 2014, 6:48:14 PM11/20/14

to antlr-di...@googlegroups.com

Pseudo-code ==> matching part of a grammar rule -> reordered for the tree or, equivalently, a partial description of the context class that will be generated.

In Antlr4, each parser rule and each labeled subrule is represented by a separate context class. By allowing a reordering, you can normalize and have Antlr then generate just a single context class for them all. The example is not the best, since labeled subrules produce context subclasses of the rule context -- and you can put common code in the rule context class. But, you still wind up with a bunch of minimally distinguishable context classes with slight distinctions in element ordering that are (I find) bothersome.

As mentioned earlier, the goal is merely to clean up and simplify the generated tree - more for ease of conceptualization than any performance or other hard benefit.

On Thursday, November 20, 2014 1:59:15 PM UTC-8, the_antlr_guy wrote:

Hi. I don’t grok your notation / what your goal is below.

> A++ ==> expr op -> op expr+

Terence Parr

unread,

Nov 20, 2014, 7:38:16 PM11/20/14

to antlr-di...@googlegroups.com

On Nov 20, 2014, at 3:48 PM, Gerald Rosenberg <gbrose...@gmail.com> wrote:
>
> Pseudo-code ==> matching part of a grammar rule -> reordered for the tree or, equivalently, a partial description of the context class that will be generated.

I think I get your direction. Sam made an option that would let you share the same base context class across methods. that would help.

Ter

>
> In Antlr4, each parser rule and each labeled subrule is represented by a separate context class. By allowing a reordering, you can normalize and have Antlr then generate just a single context class for them all. The example is not the best, since labeled subrules produce context subclasses of the rule context -- and you can put common code in the rule context class. But, you still wind up with a bunch of minimally distinguishable context classes with slight distinctions in element ordering that are (I find) bothersome.
>
> As mentioned earlier, the goal is merely to clean up and simplify the generated tree - more for ease of conceptualization than any performance or other hard benefit.
>
> On Thursday, November 20, 2014 1:59:15 PM UTC-8, the_antlr_guy wrote:
> Hi. I don’t grok your notation / what your goal is below.
>
> > A++ ==> expr op -> op expr+
> > (A++) ==> lParen expr op rParen -> op expr+
> > ++A ==> op expr -> op expr+
> > (++A) ==> lParen op expr rParen -> op expr+
> > A OP B ==> expr op expr -> op expr+
> > A ? B : C ==> expr op=op1 expr op2 expr -> op expr+
>
>
>
>

Reply all

Reply to author

Forward