Access to previous token?

612 views
Skip to first unread message

Thomas Beale

unread,
Oct 18, 2015, 7:17:47 PM10/18/15
to antlr-discussion
I have a syntax in which the following can occur:

xxx matches {/regex/}  

where 'regex' above is some normal PCRE regex, e.g. '^[a-f]+$' or whatever. The slash delimiters are meant to be reminiscent of sed, vi etc. However, elsewhere in the syntax, paths are possible, i.e. patterns like

/aaaa/bbb

Now, in my old yacc/lex based system, I could easily match the /regex/ pattern only if the previous token had been a '{'. 

I know that 'semantic predicates' exist in Antlr, but I can't see how to do the above easily - I would have expected an Antlr built-in like {lastTokVal = '{'} or similar. I'm probably missing something simple here (I have the Antlr4 book, but I didn't see anything in there).

Any pointers appreciated.

Jim Idle

unread,
Oct 18, 2015, 7:22:36 PM10/18/15
to antlr-discussion
Without knowing anything else about your lexer I suggest that you just have a token:

REGEX: '{/' . /* greedy or non greedy depending what you want */ '/}' ;

Then take off the delimiters when you process it. 

Jim





--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thomas Beale

unread,
Oct 18, 2015, 7:29:45 PM10/18/15
to antlr-discussion
It's not as simple as that - the {} are part of a central rule that looks like this:

c_complex_object: type_id '[' ID_CODE ']' SYM_MATCHES '{' c_attribute_def+ '}' ;

This language uses {} in a similar way as the C-based languages. In the example I provided, a regex pattern like /xxxx/ just happens to be one primitive type of thing that can occur between {}. Paths can never appear on their own between braces, so in my old system, this was easy to detect and deal with.

- thomas

Jim Idle

unread,
Oct 18, 2015, 7:41:32 PM10/18/15
to antlr-discussion
But if you can only have {/.../} and not {/.../ /.../} then you can have 

SYM_MATCHES (REGEXP | ... ) ...

Or you can use lexer modes to trigger path or regexp. It's a bit difficult without all the context. 


_____________________________
From: Thomas Beale <wolan...@gmail.com>
Sent: Sunday, October 18, 2015 13:29
Subject: Re: [antlr-discussion] Access to previous token?
To: antlr-discussion <antlr-di...@googlegroups.com>

Mike Lischke

unread,
Oct 19, 2015, 2:47:35 AM10/19/15
to antlr-di...@googlegroups.com
You could try LA(-1), i.e. check the previous token in the token stream in your predicate. The correct syntax of that call depends on the target language. Another variant would be to keep track of the current token in an ivar (e.g. by overriding nextToken()) and check this ivar to validate the regex input.

Mike
--
www.soft-gems.net

Thomas Beale

unread,
Oct 19, 2015, 4:55:22 AM10/19/15
to antlr-discussion

well, if you work through the current rules, they already say SYM_MATCHES ( REGEXP | ...). However, there are some other rules that match paths, e.g.

adl_path          : adl_path_segment+ ;
adl_relative_path : adl_path_element adl_path ;
adl_path_segment : '/' adl_path_element ;

These paths can occur within a stream of other tokens, ultimately inside a {} pair, but never on their own. Currently in my testing environment (IntelliJ + plugin), the REGEX lexer rule matches bits of path, when it sees '/' something something '/'. But I only want that REGEX rule to be used when the '/xxxx/' section is the only thing inside the {}.

This seems a pretty standard requirement for processing any non-trivial language.

The full grammars are here on Github.

Thomas Beale

unread,
Oct 19, 2015, 7:34:08 AM10/19/15
to antlr-discussion

This sounds more like the approach I am looking for. Are there examples of this kind of override? If not, I'd need to know how nextToken() works. From looking at the API, I suspect something like:

@Override
public Token nextToken() {

  if ( _token.getText()=='{' ) {
    // set something local, and check it in a predicate regex rule
  }
  super.nextToken();
}


I'm not a Java programmer so the syntax may be rough.

Mike Lischke

unread,
Oct 19, 2015, 7:51:08 AM10/19/15
to antlr-di...@googlegroups.com

well, if you work through the current rules, they already say SYM_MATCHES ( REGEXP | ...). However, there are some other rules that match paths, e.g.

adl_path          : adl_path_segment+ ;
adl_relative_path : adl_path_element adl_path ;
adl_path_segment : '/' adl_path_element ;

These paths can occur within a stream of other tokens, ultimately inside a {} pair, but never on their own. Currently in my testing environment (IntelliJ + plugin), the REGEX lexer rule matches bits of path, when it sees '/' something something '/'. But I only want that REGEX rule to be used when the '/xxxx/' section is the only thing inside the {}.

This seems a pretty standard requirement for processing any non-trivial language.

ANTLR generated parser are state machines that base their decision solely on the current and future tokens. Never on already seen ones.

However, in your case a validating semantic predicate might help (http://stackoverflow.com/questions/3056441/what-is-a-semantic-predicate-in-antlr). This is executed *after* a match was done, but you can reject that match with this predicate if some condition is (not)met. In your case you could check the generated AST if it contains only that /xxx/ part.

Mike

Mike Lischke

unread,
Oct 19, 2015, 7:58:44 AM10/19/15
to antlr-di...@googlegroups.com
This sounds more like the approach I am looking for. Are there examples of this kind of override? If not, I'd need to know how nextToken() works. From looking at the API, I suspect something like:

@Override
public Token nextToken() {

  if ( _token.getText()=='{' ) {
    // set something local, and check it in a predicate regex rule
  }
  super.nextToken();
}


Just acting on a given token in the nextToken function is probably not what you need. You can easily use a predicate in your grammar instead of overriding the function. I haven’t used this approach myself either, but saw it in the ECMA 3 grammar here: http://research.xebic.com/es3/.


Mike Lischke

unread,
Oct 19, 2015, 8:00:17 AM10/19/15
to antlr-di...@googlegroups.com
Look at the areRegularExpressionsEnabled() there, which might have something you can use in your case.

Eric Vergnaud

unread,
Oct 19, 2015, 8:06:59 AM10/19/15
to antlr-discussion
Hi,

I use predicates a bit, and they work fine provided that they are placed at the beginning of alternatives within a rule.
Also note that they are not evaluated during prediction, only during actual parsing. 

Eric

Thomas Beale

unread,
Oct 19, 2015, 8:22:35 AM10/19/15
to antlr-discussion

that seems to be a dead link.


Thomas Beale

unread,
Oct 20, 2015, 8:03:02 AM10/20/15
to antlr-discussion
some further progress... I changed the regex matching to inline in the parser-rule, as follows:

regex_constraint: '/' regex1 '/' | '^' regex2 '^' ;
regex1: ( '_' | '.*' | '\\.' | '\\/' | ~'/' )+ ; // TODO: not clear why first 4 matches are needed, but they work.

This actually does work (i..e now regexes enclosed in // are disambiguated from paths that contain slashes) - I don't yet understand what the lexer is doing behind the scenes when a literal is found in a parse rule, but I assume that if that literal can be found at the current point in the input stream, it trumps any other lexer pattern matching. Anyway, in the above, I originally just had:

regex1: ( '\\/' | ~'/' )+ ;

i.e. just match quoted slashes, or otherwise, non-slashes. But a regex in the text containing '\.' (quoted dot, i.e. literal dot) wouldn't match, so I added '\\.'. Underscores also don't match, hence the underscore; same for the '.*' pattern.

I don't understand what is going on here, but I assume that the ~'/' in a parse rule doesn't match 'any non-slash', but some smaller set of characters.

thoughts?

Jim Idle

unread,
Oct 20, 2015, 11:36:27 PM10/20/15
to antlr-di...@googlegroups.com
When you put literals directly in the parser, you are relying on the code generator to create lexer rules with the correct precedence. The parser does NOT drive the lexer, it just writes the lexer rules for you. So if you create these lexer rules yourself and put them in the right place (earlier in the source then the higher precedence they get) then you will achieve the same thing but won't be confused as to how it is happening.

With this:

( '_' | '.*' | '\\.' | '\\/' | ~'/' )+ ; 
You are creating a token for '_' a token for the literal '.*' etc and allowing any of these tokens to repeat. I doubt that this is what you want? Also if you have manually created token for '_' and specify it in the parser, then you run the risk of confusing the lexer. 
Jim

Thomas Beale

unread,
Oct 21, 2015, 5:48:42 AM10/21/15
to antlr-discussion
So just to continue that line of discussion for a moment (forgetting about the specific regex rule we were talking about), I have a lot of rules with single character syntax elements, usually various kinds of brackets, other punctuation from the syntax. Examples - see here. Are you saying that this is an unreliable thing to do in Antlr? 

The CPP grammar does the same kind of thing. And my testing so far indicates that the parser works in the intended fashion for these kind of constructs.

So I wonder if what you are saying is in fact that inline lex patterns that are not fixed literals is the problem, i.e. choices of any kind - because I think fixed literals are probably fine.

Sorry if this is basic, I'm still learning the differences from the more yacc/lex architecture I am used to.

Jim Idle

unread,
Oct 21, 2015, 7:08:02 AM10/21/15
to antlr-discussion
Yeah. No worries mate ask away. So yes I am advising you not to use literals in the parser itself but to use the existing lexer rules where you have them and new ones where you don't. Using literals in the parser isn't a crime and can be useful, but when you are fairly new it can confuse the bejesus out of you ;)

If what you have dies the job, then create the lexer rules and use the token names :)

In general I list lexer tokens in the following order (some of this is arbitrary):

Reserved key words 
Character tokens like '{'
Constructs like ID
Comments and white space
Final rule being:

ANY: . ;  /* indicates a character that is illegal and AIDS with error messages */

Jim

Thomas Beale

unread,
Oct 21, 2015, 7:43:56 AM10/21/15
to antlr-discussion
In terms of lexer rule order - yes, that's typical I would say. But I am using the 'import' facility extensively, because various bits of our grammars are reused in different places, so there are actually around 8 files now, most of mixed parser/lex rules. So far I have not found a tool (using IntelliJ plugin) that can display a post-processed single file equivalent form of the source files, in order to see the effective lex rule order. So I only *think* I have the order right ...!

And then we get back to the difficulty, that while in a regex, I want to consume everything that is not a slash or quoted slash, but I don't want to consume chars like that ordinarily - and that gets back to the original post:)
Reply all
Reply to author
Forward
0 new messages