accessing token text in a lexer rule action

1,151 views
Skip to first unread message

Greg D

unread,
Oct 30, 2013, 5:43:34 PM10/30/13
to antlr-di...@googlegroups.com
I've checked the book and the source code, but I can't see any simple way to get the token text within a lexer rule action. The getText() method yields only the final rule text when the token is aggregated with 'more'.

There is an unwieldy alternative, culled from the CommonToken 'create' method. I am concerned that it may fail in certain cases, because it is guarded by some conditions. I'm also confused because 'create' appears to use the '_text' property in preference to the alternative, yet that's not the text the token has at completion.

There is some code below to demonstrate this.  Have I missed a simpler, more robust, way to get the full token text?

lexer grammar LLexer ;

COLON2
   
: '::' -> more ;
DIGITS
   
: [0-9]+ { System.out.println
                 
("at DIGITS: getText()  is '"
                   
+ getText()
                   
+ "'.");
               
System.out.println
                 
("at DIGITS: alt method is '"
                   
+ _tokenFactorySourcePair.b.getText(Interval.of(_tokenStartCharIndex,getCharIndex()-1))
                   
+ "'.");
             
} ;


 sample.txt
123::456


log
$ antlr4 LLexer.g4
$ javac
*.java
$ grun L tokens
-tokens sample.txt
at DIGITS
: getText()  is '123'.
at DIGITS
: alt method is '123'.
at DIGITS
: getText()  is '456'.
at DIGITS
: alt method is '::456'.
line
1:8 token recognition error at: '\n'
[@0,0:2='123',<1>,1:0]
[@1,3:7='::456',<1>,1:3]
[@2,9:8='<EOF>',<-1>,2:0]


Jim Idle

unread,
Oct 31, 2013, 12:24:50 AM10/31/13
to antlr-di...@googlegroups.com
Rather than addressing this question, perhaps you could first step back and say what you are trying to achieve by doing this. It may be that there is an overall simpler answer than this. I have only ever needed token text at token completion or via LA() so I am having a difficult time understanding what you are trying to get from this.

Cheers,

Jim


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Greg D

unread,
Oct 31, 2013, 11:43:20 AM10/31/13
to
The code above is crafted just to show the distinction.

The purpose of the underlying code is to process parametrised macros in the source. Since existing macro definitions may be redefined in the source, I believe it is simpler to manage them at lexer-time. As a result, I am using token actions to maintain the pool of macros and to inject new tokens when a macro is referenced. This creates the need to access the token text in lexer rules. Some of the tokens are aggregated with '-> more', or '{ more(); }, leading to the subject question of this topic.

The book's examples use getText(), but only on self-contained lexer rules. So, I don't know if the behaviour show above is:
  • intentional, because it makes more sense for getText() to yield the local text (i.e. my problem to live with)
  • inadvertent, and an issue should be created
  • wrong-headed, because there is already a robust mechanism to get the aggregated text

My guess is the latter, since that's where I am in my learning of ANTLR v4 and parsing in general. I'm afraid I don't know what LA() means, and so can't phrase this reply in those terms.

Greg D

unread,
Nov 1, 2013, 2:55:52 PM11/1/13
to antlr-di...@googlegroups.com
Working around this with:

@lexer::members {
   
// method to get text for a token, even if aggregated by 'more'
   
public String getTokenText() {
       
if (_tokenFactorySourcePair.b != null) {
           
return _tokenFactorySourcePair.b.getText(Interval.of(_tokenStartCharIndex,getCharIndex()-1));
       
} else {
           
return getText();
       
}
   
}
}




Terence Parr

unread,
Nov 8, 2013, 12:38:19 PM11/8/13
to antlr-di...@googlegroups.com
Yeah, I don't think we have a very convenient way to do this if I remember.

_text is an override I believe if you do a setText, otherwise it takes the interval from the input stream.  getText() is more about the whole string as you've discovered.

 May I ask why you need to operate in the lexer with raw actions? sometimes it's needed but we are hoping people can get away without it.

Ter


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Dictation in use. Please excuse homophones, malapropisms, and nonsense. 

Terence Parr

unread,
Nov 8, 2013, 12:39:54 PM11/8/13
to antlr-di...@googlegroups.com
 thanks for the explanation. It's reasonable to ask for the text of just the current rule but we have no way to do this conveniently. grab _input.index() at the start of the lexer rule (though it will slow things down).  wait. I think we only allow actions at the end come to think of it...hmm...  maybe Sam has a good answer.
Ter


On Thu, Oct 31, 2013 at 8:26 AM, Greg D <gregoir...@gmail.com> wrote:
The code above is crafted just to show the issue.


The purpose of the underlying code is to process parametrised macros in the source. Since existing macro definitions may be redefined in the source, I believe it is simpler to manage them at lexer-time. As a result, I am using token actions to maintain the pool of macros and to inject new tokens when a macro is referenced. This creates the need to access the token text in lexer rules. Some of the tokens are aggregated with '-> more', or '{ more(); }, leading to the subject question of this topic.

The book's examples use getText(), but only on self-contained lexer rules. So, I don't know if the behaviour show above is:
  • intentional, because it makes more sense for getText() to yield the local text (i.e. my problem to live with)
  • inadvertent, and an issue should be created
  • wrong-headed, because there is already a robust mechanism to get the aggregated text

My guess is the latter, since that's where I am in my learning of ANTLR v4.

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Terence Parr

unread,
Nov 8, 2013, 12:44:23 PM11/8/13
to antlr-di...@googlegroups.com
_tokenStartCharIndex is set  before the loop that gets 'more' text for a token. I thought you wanted the text of just the currently matching lexer rule.  Do you want it to match the text of even fragment rules?

 How was the method you provide above different than the default? both will include the "more" text, if I'm reading the source code right.
Ter


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Greg D

unread,
Nov 8, 2013, 6:01:21 PM11/8/13
to antlr-di...@googlegroups.com
My eyeball trace of the code agrees with your expectation, but that is not what is happening.  I boiled the behaviour down to the "test case" shown in the 1st post of this thread.

I haven't done a build on my git clone yet, so I couldn't do a proper code trace.  More importantly, I couldn't tell from the book (the only spec I have) exactly what was supposed to happen with the aggregated text of a 'more' command.

Greg

Terence Parr

unread,
Feb 19, 2014, 2:33:39 PM2/19/14
to antlr-di...@googlegroups.com
Hi. I believe that Sam has fixed this for 4.2. The following unit test shows that getText() actions are evaluated at the input position dictated by the action placement (i I've also just updated the documentation to say that).

@Test public void testActionPlacement() throws Exception {
String grammar =
"lexer grammar L;\n"+
"I : ({System.out.println(\"stuff fail: \" + getText());} 'a' | {System.out.println(\"stuff0: \" + getText());} 'a' {System.out.println(\"stuff1: \" + getText());} 'b' {System.out.println(\"stuff2: \" + getText());}) {System.out.println(getText());} ;\n"+
"WS : (' '|'\\n') -> skip ;\n" +
"J : .;\n";
String found = execLexer("L.g4", grammar, "L", "ab");
String expecting =
"stuff0: \n" +
"stuff1: a\n" +
"stuff2: ab\n" +
"ab\n" +
"[@0,0:1='ab',<1>,1:0]\n" +
"[@1,2:1='<EOF>',<-1>,1:2]\n";
assertEquals(expecting, found);
}

Ter


On Wed, Oct 30, 2013 at 2:43 PM, Greg D <gregoir...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply all
Reply to author
Forward
0 new messages