ANTLR 3.5 C - More info in the AST tree from the parse?

241 views
Skip to first unread message

Jeff Spindler

unread,
Aug 15, 2013, 11:03:25 AM8/15/13
to antlr-di...@googlegroups.com

This may be a really stupid question, and I'm sorry if it's been answered before...
When walking through the AST tree, I only find tokens; none of the parser structure seems to be there - so how can I analyze a token without the parsed context?

I made up a toy example to explain what I mean:


grammar names
;
options
{
    language
= C;
    output
= AST;
   
ASTLabelType=pANTLR3_BASE_TREE;
}
name
: firstname lastname
 
| lastname COMMA^ firstname
 
;
COMMA
 
: ','
 
;
firstname
 
: ID;
lastname
 
: ID;
ID  
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
   
;
WS  
:   ( ' '
       
| '\t'
       
| '\r'
       
| '\n'
       
) {$channel=HIDDEN;}
   
;




Looking at the AST, I see IDs and COMMAs.  When I encounter an ID in the tree, how would I tell if its from the firstname rule or the lastname rule?
// names.cpp : Defines the entry point for the console application.
//


#include "stdafx.h"
#include "namesLexer.h"
#include "namesParser.h"


int parse(const char* pszInput);
int WalkTree(pANTLR3_BASE_TREE tree);


int _tmain(int argc, _TCHAR* argv[])
{
 parse
("Henry James");
 parse
("Henry, James");
 
return 0;
}


int parse(const char* pszInput)
{
 pANTLR3_INPUT_STREAM input
;
 pnamesLexer lex
;
 pANTLR3_COMMON_TOKEN_STREAM tokens
;
 pnamesParser parser
;
 
const char* pszName = "doodah";


 size_t bufferSize
= strlen(pszInput);


 input
= antlr3StringStreamNew(
 
(pANTLR3_UINT8)pszInput,
 ANTLR3_ENC_8BIT
,
 bufferSize
,
 
(pANTLR3_UINT8)pszName);


 lex
= namesLexerNew(input);
 tokens
= antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT, TOKENSOURCE(lex));
 parser
= namesParserNew(tokens);


 namesParser_name_return
 r
= parser->name(parser);


 pANTLR3_BASE_TREE tree
= r.tree;




 
int rr = WalkTree(tree);


 parser
->free(parser);
 tokens
->free(tokens);
 lex
->free(lex);
 input
->close(input);
 
return 1;
}


pANTLR3_BASE_TREE getChild
(pANTLR3_BASE_TREE tree, unsigned i)
{
 
//assert(i < tree->getChildCount(tree));
 
return (pANTLR3_BASE_TREE) tree->getChild(tree, i);
}


const char* getText(pANTLR3_BASE_TREE tree)
{
 
return (const char*) tree->getText(tree)->chars;
}




int WalkTree(pANTLR3_BASE_TREE tree)
{
 pANTLR3_COMMON_TOKEN tok
= tree->getToken(tree);
 
if(tok)
 
{
 pANTLR3_BASE_TREE ltree
, rtree;
 pANTLR3_BASE_TREE ptree
;
 
const char* s = getText(tree);
 
switch(tok->type)
 
{
 
}
 
}
 
else
 
{
 
int k = tree->getChildCount(tree);
 
int r = 0;
 
for(int i = 0; i < k; i++)
 
{
 r
= WalkTree(getChild(tree, i));
 
}
 
return r;
 
}


 
return 0;
}






I suppose I could put clues in the token's user1 field as below, but I'm wondering if there's a simpler way?

firstname
 
: ID { $ID->user1='F'; };
lastname
 
: ID { $ID->user1='L'; };


Mike Lischke

unread,
Aug 15, 2013, 11:24:58 AM8/15/13
to antlr-di...@googlegroups.com

Looking at the AST, I see IDs and COMMAs.  When I encounter an ID in the tree, how would I tell if its from the firstname rule or the lastname rule?

AFAIK you can't. You only get lexer tokens in your tree, no parser rules.

I'd love to get a "parser rule path" for each token that was taken to match it. This way implementing code completion should be a lot simpler.

Jim Idle

unread,
Aug 16, 2013, 4:53:50 AM8/16/13
to antlr-di...@googlegroups.com
In your example, firstname will be first id and lastname the second id:

name:
     ID ID // first last
   | ^(COMMA ID ID) // last first
;



But generally you rewrite to be unambiguous:
tokens { NAME; }
name:
( first=ID last=ID
   | last=ID COMMA first=ID)

      -> ^(NAME $first $last)
;

Then

name : ^(NAME ID ID) // first last ;

Hope that helps?

Jim
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jim Idle

unread,
Aug 16, 2013, 4:54:59 AM8/16/13
to antlr-di...@googlegroups.com, antlr-di...@googlegroups.com
You don't need parser rule context. You use rewrites to disambiguate the tree. See my answer. 

Jim

Mike Lischke

unread,
Aug 16, 2013, 8:52:00 AM8/16/13
to antlr-di...@googlegroups.com
You don't need parser rule context. You use rewrites to disambiguate the tree. See my answer. 


That's not the same Jim. I have a parser rule qualified_identifier that is used in different places (sometimes referencing a table, sometimes a column in SQL syntax etc.). With your suggestion I can only determine that I have a qualified identifier, but not the context I would need e.g. in code completion to bring up a sensible list of choices (either a list of tables or columns).


On Aug 15, 2013, at 11:24 PM, Mike Lischke <mike.l...@icloud.com> wrote:



Looking at the AST, I see IDs and COMMAs.  When I encounter an ID in the tree, how would I tell if its from the firstname rule or the lastname rule?

AFAIK you can't. You only get lexer tokens in your tree, no parser rules.

I'd love to get a "parser rule path" for each token that was taken to match it. This way implementing code completion should be a lot simpler.


Jim Idle

unread,
Aug 16, 2013, 10:14:12 AM8/16/13
to antlr-di...@googlegroups.com
I think that you missed the point Mike. You rewrite or otherwise structure your tree such that there are no ambiguities. Code completion is different, and perhaps a tree is too expensive for that, but a working tree contains no ambiguities from the parse. It's context has nothing to do with the parse, or even [perhaps the initial] tree structure.

I understand what you say, but think that you are confounding different problems.

Jim


Mike Lischke

unread,
Aug 17, 2013, 5:30:49 AM8/17/13
to antlr-di...@googlegroups.com

I admit I hijacked this thread to speak out a related wish I have for quite some time... :-)

Jim Idle

unread,
Aug 18, 2013, 8:45:17 PM8/18/13
to antlr-di...@googlegroups.com
That's ok. It was a reasonable point :)

Jim

Jeff Spindler

unread,
Aug 19, 2013, 10:30:02 AM8/19/13
to antlr-di...@googlegroups.com
Thanks both of you for the interesting discussion!  I understand now that i can't get what i want by looking at the tokens in isolation, but need to get some other context information in there somehow - either by token order or other means.

The grammar was easy to rewrite because it was so simple.  My real world case is rather more complicated - I have ID tokens which the grammar parses in a variety of contexts each of which makes what I want to do with the tokens different.   I'm going to go with attaching info to the user1, user2 fields to get what I want.  (I'm replacing a Visual Parse++ grammar with ANTLR in a rather old bit of code.)

Incidentally, I tried Jim's suggestion as below, but got crashes when i tried to compile it:


grammar names;
options
{
    language
= C;
    output
= AST;
   
ASTLabelType=pANTLR3_BASE_TREE;
}
//name: firstname lastname
// | lastname COMMA^ firstname
// ;


//firstname
// : ID;
//lastname
// : ID;
tokens
{ NAME; }



name
: ^(NAME ID ID) // first last
 
;



NAME
:

( first=ID last=ID
   
| last=ID COMMA first=ID)
     
-> ^(NAME $first $last)
 
;



COMMA
: ','
 
;



ID  
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
   
;


WS  
:   ( ' '
       
| '\t'
       
| '\r'
       
| '\n'
       
) {$channel=HIDDEN;}
   
;


error(10): internal error : C:\Users\jspindler\Projects\Parser research\cpp-examples\names\names.g : java.lang.NullPointerException
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.rewrite_atom(DefineGrammarItemsWalker.java:4510)
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.rewrite_tree(DefineGrammarItemsWalker.java:4438)
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.rewrite_element(DefineGrammarItemsWalker.java:4325)
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.rewrite_alternative(DefineGrammarItemsWalker.java:4198)
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.rewrite(DefineGrammarItemsWalker.java:3959)
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.block(DefineGrammarItemsWalker.java:1927)
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.rule(DefineGrammarItemsWalker.java:1517)
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.rules(DefineGrammarItemsWalker.java:1119)
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.grammarSpec(DefineGrammarItemsWalker.java:591)
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.grammar_(DefineGrammarItemsWalker.java:313)
1>  org.antlr.tool.Grammar.defineGrammarSymbols(Grammar.java:791)
1>  org.antlr.tool.CompositeGrammar.defineGrammarSymbols(CompositeGrammar.java:370)
1>  org.antlr.Tool.process(Tool.java:506)
1>  org.antlr.Tool.main(Tool.java:98)


... so while it looks like token rewriting looks like it might be useful, I'm not really sure how to use it successfully.

Kevin J. Cummings

unread,
Aug 19, 2013, 1:30:34 PM8/19/13
to antlr-di...@googlegroups.com
On or about 08/19/2013 10:30 AM, Jeff Spindler stated for us to ponder:
> Thanks both of you for the interesting discussion! I understand now
> that i can't get what i want by looking at the tokens in isolation, but
> need to get some other context information in there somehow - either by
> token order or other means.
>
> The grammar was easy to rewrite because it was so simple. My real world
> case is rather more complicated - I have ID tokens which the grammar
> parses in a variety of contexts each of which makes what I want to do
> with the tokens different. I'm going to go with attaching info to the
> user1, user2 fields to get what I want. (I'm replacing a Visual Parse++
> grammar with ANTLR in a rather old bit of code.)
>
> Incidentally, I tried Jim's suggestion as below, but got crashes when i
> tried to compile it:


> name :^(NAME ID ID)// first last
> ;


That production is a tree grammar production. When walking the tree, it
shows that a valid "name" tree is NAME token with 2 children, both of
which are ID tokens. You would use a rule similar to this in your tree
grammar file.

> NAME:
> (first=ID last=ID
> |last=ID COMMA first=ID)
> ->^(NAME $first $last)
> ;

Because you Capitalized NAME, it is a token rule and appears in the
lexer. Lexer rules cannot refer to parser rules. Lexer rules get run
over the input stream, and convert the input stream into a TOKEN stream.
The parser then reads the TOKEN stream and parses it into the parse
tree. (Oversimplification.)

--
Kevin J. Cummings
kjc...@verizon.net
cumm...@kjchome.homeip.net
cumm...@kjc386.framingham.ma.us
Registered Linux User #1232 (http://www.linuxcounter.net/)

Jeff Spindler

unread,
Aug 19, 2013, 2:12:26 PM8/19/13
to antlr-di...@googlegroups.com
If that second name defintion was lowercase, I got:

1>  error(101): C:\Users\jspindler\Projects\Parser research\cpp-examples\names\names.g:21:1: rule name redefinition
1>  warning(105): C:\Users\jspindler\Projects\Parser research\cpp-examples\names\names.g:18:10: no lexer rule corresponding to token: NAME

.. and it didn't seem right to have two 'name's on the lefthand side.

On Thursday, August 15, 2013 11:03:25 AM UTC-4, Jeff Spindler wrote:

Jim Idle

unread,
Aug 20, 2013, 12:52:06 AM8/20/13
to antlr-di...@googlegroups.com
In the original reply, NAME was declared as an imaginary token, so it can
be used in the rewrite and the second snippet ^(NAME...) was meant for the
tree grammar.

Can you explain what the "crashes" were? The user1 etc should not be
needed to place the tokens in your tree in context - I put those there for
additional information such as file numbers and things like that. If you
have to use them in your tree grammar just for context ("this is a name"),
then it means the tree structure is not correct - logically if you knew
what it was in your parser grammar, but no longer know in your tree
grammar, then you have lost information, whereas your tree should have
additional semantic clarity, not less! :)

If you post your actual grammar, we can help you with the real issues. It
may be something to do with the formulation of your rules for instance.

Jim

Jim Idle

unread,
Aug 20, 2013, 12:53:23 AM8/20/13
to antlr-di...@googlegroups.com

You missed out the

 

tokens {NAME;}

 

Part of my post.

 

Jim

 

From: antlr-di...@googlegroups.com [mailto:antlr-di...@googlegroups.com] On Behalf Of Jeff Spindler


Sent: Tuesday, August 20, 2013 2:12 AM
To: antlr-di...@googlegroups.com

Jeff Spindler

unread,
Aug 20, 2013, 9:29:18 AM8/20/13
to antlr-di...@googlegroups.com
I'm wondering if you saw my first post from yesterday where I posted my updated grammar based on your suggestions.  Line 17  (immediately after I commented out the original grammar, I have the tokens {NAME;} line.  After the block of code I posted the crash info. "error(10): internal error : 
"C:\Users\jspindler\Projects\Parser research\cpp-examples\names\names.: java.lang.NullPointerException
1>  org.antlr.grammar.v3.DefineGrammarItemsWalker.rewrite_atom(DefineGrammarItemsWalker.java:4510)..."


I'm still confused as to whether the two name: rules after the token { NAME;} are supposed to be both lowercase or one is supposed to be uppercase ( I had guessed that the second one was supposed to be uppercase).

Thanks for your help!

Jeff

Jeff Spindler

unread,
Aug 21, 2013, 5:15:54 PM8/21/13
to antlr-di...@googlegroups.com
Oh, just figured out I didn't want the first

name : ^(NAME ID ID) // first last 
 
;
... just second one that did the rewrite to the imaginary token.
Reply all
Reply to author
Forward
0 new messages