fragment rule 'x' contains an action or command which can never be executed

744 views
Skip to first unread message

Christer Palm

unread,
Apr 14, 2014, 9:41:26 PM4/14/14
to antlr-di...@googlegroups.com
Hi!

I know the topic of actions in fragments have been discussed before, here and in a couple of issues on github.
AFAICT the general stance seems to be that they're not needed because you could either;
  1. Break the language up into smaller tokens and handle the combination in the parser, using lexical modes to handle context if necessary
  2. Get rid of the fragments by pushing their rules into the top-level rule, which, as of 4.2 can have multiple actions
  3. Manually "re-parse" the token text, if that's what you're after

Now, I going to take another stab at arguing for the need for actions in fragments.
I'm working on a parser for an existing grammar with some quite complex literals for which I fail to find a reasonable way to apply solutions 1 & 2;
  1. I can't seem to break the literals up into several tokens because I need the lexical context to disambiguate them from other constructs and they don't have any start/end symbols to trigger a lexical mode shift like "" string literals, <> XML elements or /**/ comments.
  2. The literals are defined by dozens of fragment rules which frequently have multiple references to lower-level fragments. Combining them into a single rule would create a huge unmaintainable mess.

In a sense, the literals in question are island grammars which are ambiguous with the rest of the language with no clear way of distinction other than trial and error. Crazy? Yes, but I didn't invent the language.
On the other hand, I've got ANTLR to properly tokenize the language so solution 3 works fine since all I'm trying to do is breaking the literals into its elements in order to parse them into value objects.
However, it seems a bit crazy and error-prone to have to re-implement the non-trivial lexical analysis of the literals in my value object factory.
Or is there a better way to accomplish this?

A grossly simplified example to illustrate the type of grammar I'm working with;

NUMBER: DIGITS ;
IDENTIFIER: [a-zA-Z] [a-zA-Z0-9]+ ;
OPERATOR: '+' | '-' ;

POINT: POINT2D | POINT3D { /* action */ } ;
fragment POINT2D: 'X' COORDINATE 'Y' COORDINATE ;
fragment POINT3D: 'X' COORDINATE 'Y' COORDINATE 'Z' COORDINATE ;
fragment COORDINATE: ( '+' | '-' )? DIGITS { /* I want action here, too */ } ;
fragment DIGITS: [0-9]+ ;

I.e. X+1 is a POINT, but XY is an IDENTIFIER and A+1 is IDENTIFIER OPERATOR NUMBER.
When matching a POINT I want to get at its COORDINATE's.

Cheers,
Chris

Jim Idle

unread,
Apr 14, 2014, 9:57:37 PM4/14/14
to antlr-di...@googlegroups.com
You are trying to do too much in the lexer. Just lex out the fundamental units and put these constructs in to the parser and your issues should go away.

Jim


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sam Harwell

unread,
Apr 14, 2014, 10:01:45 PM4/14/14
to antlr-di...@googlegroups.com

Arguments in fragment rules could be useful. In fact, the original implementation allowing actions at arbitrary locations allowed them, but this functionality was removed in commit 214b715. There are several problems with this functionality, but the primary problem is it produced a major breaking change to many complicated grammars involving multiple modes. For example, consider the following grammar:

https://github.com/tunnelvisionlabs/antlrworks2/blob/master/org-antlr-works-editor/src/org/antlr/works/editor/st4/experimental/TemplateLexer.g4

 

In this grammar the rule TemplateExpression_LBRACE references rule LBRACE. If actions from invoked rules were executed, the semantics of TemplateExpression_LBRACE would now result in two calls to pushMode(AnonymousTemplate). In addition to sneaky but major breaking changes to existing grammars, the mental effort required to maintain an understanding of the impact of actions in invoked rules vastly outweighed any possible benefit of allowing this scenario. We chose instead to implement LexerATNSimulator in a manner that always ignores these actions, which preserves existing behavior for pre-4.2 grammars, but still allows additional action placement starting with 4.2.

 

Sam

--

Christer Palm

unread,
Apr 15, 2014, 3:28:14 AM4/15/14
to antlr-di...@googlegroups.com
On 2014-04-15 03:57, Jim Idle wrote:
> You are trying to do too much in the lexer. Just lex out the
> fundamental units and put these constructs in to the parser and your
> issues should go away.
Right, but there's a good reason for that.
Looking at my example (which btw lacks a POINT1D for the comment on X+1
to make sense, I was a bit tired when writing that) I might rewrite it as:

NUMBER: DIGITS ;
IDENTIFIER: [a-zA-Z] [a-zA-Z0-9]+ ;
OPERATOR: '+' | '-' ;

POINTX: 'X' ;
POINTY: 'Y' ;
POINTZ: 'Z' ;

point1d: POINTX OPERATOR? NUMBER ;
point2d: POINTX OPERATOR? NUMBER POINTY OPERATOR? NUMBER ;
point3d: POINTX OPERATOR? NUMBER POINTY OPERATOR? NUMBER POINTZ
OPERATOR? NUMBER ;

Which (correct me if I'm wrong) will cause me to loose the
disambiguation between POINTxD and INDENTIFIER. Now every 'X' will give
me a POINTX, but for instance a single 'X' all by itself is an
IDENTIFIER. Unless the parser can reject a token and make the lexer
"backtrack" and return an alternate match, it won't work.
Now, in this particular example it might be possible to patch up the
situation with some clever parser rules, but in my actual grammar the
token boundaries are sometimes completely different depending on the
alternative.

Cheers,
Chris

Christer Palm

unread,
Apr 15, 2014, 5:40:18 PM4/15/14
to antlr-di...@googlegroups.com
On 2014-04-15 04:01, Sam Harwell wrote:

Arguments in fragment rules could be useful. In fact, the original implementation allowing actions at arbitrary locations allowed them, but this functionality was removed in commit 214b715. There are several problems with this functionality, but the primary problem is it produced a major breaking change to many complicated grammars involving multiple modes. For example, consider the following grammar:

https://github.com/tunnelvisionlabs/antlrworks2/blob/master/org-antlr-works-editor/src/org/antlr/works/editor/st4/experimental/TemplateLexer.g4

 

In this grammar the rule TemplateExpression_LBRACE references rule LBRACE. If actions from invoked rules were executed, the semantics of TemplateExpression_LBRACE would now result in two calls to pushMode(AnonymousTemplate). In addition to sneaky but major breaking changes to existing grammars, the mental effort required to maintain an understanding of the impact of actions in invoked rules vastly outweighed any possible benefit of allowing this scenario. We chose instead to implement LexerATNSimulator in a manner that always ignores these actions, which preserves existing behavior for pre-4.2 grammars, but still allows additional action placement starting with 4.2.


Thanks for the backgrounder. I was already somewhat familiar with this history from reading previous threads on the subject, but it's still very interesting.
Yes, thinking about it, I do agree that there are certainly conflicting use-cases for lexer actions wrt whether subrule actions should fire or not.
AFAICT the additional action placement enhancement is primarily needed to support the use case of marking/capturing subexpressions. However, many of the situations where you would want to do subexpression marking/capturing (i.e. "composite" tokens) are also the situations where you'd want to use fragments. Which makes it quite unfortunate that you'd have to choose between one or the other.

A couple of ideas;

 - With the "new style" action verbs, ANTLR is able to support certain verbs in certain places depending on whether their semantics make sense in that context. If new verbs were added to support subexpression marking/capturing they could/should be allowed in subrules. Verbs like type, pushMode and popMode OTOH shouldn't be invoked in subrules and should probably throw an error if used in a fragment rather than just triggering a warning like they do today. Old style {}-actions probably needs to be handled just like before to avoid the "sneaky" side effects.

 - Would it make sense to execute subrule actions in fragments but not non-fragment subrules?

Cheers,
Chris

Jim Idle

unread,
Apr 15, 2014, 9:08:09 PM4/15/14
to antlr-di...@googlegroups.com
I have never found a situation where something like this cannot be done in the parser but without knowing exactly what you are parsing it is difficult to see what you need to parse. If your language is such that tokens change according to context, then apart from the language being badly specified (at least for other than syntax directed hand crafted parsing and lexing), then if you cannot use lexer modes, you will need to make the tokens as generic as possible. 

In the above case your best choice would be something akin to this:

atom: POINTX OPERATOR? NUMBER (POINTY OPERATOR? NUMBER (POINTZ OPERATOR? NUMBER)?)?
 | ident
;

ident: // In your parse tree walker check i and k for null, the one that is not null is your IDENT
 | i=IDENT
 | k=keyword 
 ;

keyword:
 | POINTX
 | POINTY
 | POINTZ
;

In v3 you would need:

atom: (POINTX OPERATOR? NUMBER)=> POINTX OPERATOR? NUMBER ... etc

In v4 you could also use:

ident: i=(IDENT| POINTX| POINTY|POINTXZ) ;

And ignore the token type of i in your walker.

But whatever your actual situation, trying to solve it in the lexer without using modes will usually be very difficult and in any case usually degrades the error reporting capabilities of your program. In general:
  1. Simplest lexer with hopefully no actions
  2. Simplest parser with no actions unless it is useful to do things like build known symbols at this stage
  3. Walk the parse tree, disambiguate, verify semantics, usually generate some model of the translation unit
  4. Process the model
Always defer error processing to as far down the tool chain as you can as you will then have more context available to give more accurate error messages. 

I hope this is of help to you.

Jim




Cheers,
Chris

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussion+unsubscribe@googlegroups.com.

Christer Palm

unread,
Apr 16, 2014, 3:20:25 AM4/16/14
to antlr-di...@googlegroups.com
On 2014-04-16 03:08, Jim Idle wrote:
> I have never found a situation where something like this cannot be
> done in the parser but without knowing exactly what you are parsing it
> is difficult to see what you need to parse. If your language is such
> that tokens change according to context, then apart from the language
> being badly specified (at least for other than syntax directed hand
> crafted parsing and lexing), then if you cannot use lexer modes, you
> will need to make the tokens as generic as possible.
>
> In the above case your best choice would be something akin to this:
> [...]
Thanks for taking your time Jim!

I wouldn't argue against the language beeing badly specified, but it's a
legacy grammar and I don't get to specify it. I can tell you it's even
hard to write a hand-crafted parser for it.

Let's stick to my example of POINTs because I think it's a trivial
example which clearly demonstrates the core of the problem without
mudding it with the complexity of the complete grammar. I don't want to
be a bully, but I think we also need to keep in mind that the language
/is/ indeed trivially disambiguated by the ANTLR lexer as long as
literal == token, which I am perfectly happy with per se as that's
exactly what my backend wants as tokens.

Unfortunately, your proposed solution fails for input like "X1" which is
improperly recognized as an ident. If my understanding is correct, this
is because the ANTLR lexer will choose the longest match if there are
several alternatives. So "X1" becomes IDENTIFIER rather than POINTX NUMBER.

Here's, by the way, a full working example:

----
grammar Point;

@header {
package antlr;
}

NUMBER: DIGITS ;
OPERATOR: '+' | '-' ;
POINT: POINT1D | POINT2D | POINT3D ;
IDENTIFIER: [a-zA-Z] [a-zA-Z0-9]+ ;

fragment POINT1D: 'X' COORDINATE ;
fragment POINT2D: 'X' COORDINATE 'Y' COORDINATE ;
fragment POINT3D: 'X' COORDINATE 'Y' COORDINATE 'Z' COORDINATE ;
fragment COORDINATE: ( '+' | '-' )? DIGITS ;
fragment DIGITS: [0-9]+ ;

point: POINT { System.out.println("POINT: " + $POINT.text); };
operator: OPERATOR { System.out.println("OPERATOR: " + $OPERATOR.text); };
identifier: IDENTIFIER { System.out.println("IDENTIFIER: " +
$IDENTIFIER.text); };
number: NUMBER { System.out.println("NUMBER: " + $NUMBER.text); };
expression: point | operator | identifier | number ;
input: expression+ ;
----

Cheers,
Chris

Kevin Cummings

unread,
Apr 16, 2014, 1:53:38 PM4/16/14
to antlr-di...@googlegroups.com
Chris, as an innocent bystander here looking on, can I ask:

If X+1 is a Point, is X + 1 also a Point? If the white space is allowed, you will almost have to do the recognition in the parser.

--
Kevin J. Cummings
kjc...@verizon.net
cumm...@kjchome.homeip.net
cumm...@kjc386.framingham.ma.us
Registered Linux User #1232
(http://www.linuxcounter.net/)
> --
> You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

Terence Parr

unread,
Apr 16, 2014, 1:56:48 PM4/16/14
to antlr-di...@googlegroups.com
If you want to get really crazy, like me, you can pass all characters to the parser as tokens and do scanner list parsing. There is a tiny example in the source code for the book in extras I think.
T

Christer Palm

unread,
Apr 17, 2014, 9:37:10 AM4/17/14
to antlr-di...@googlegroups.com
On 2014-04-16 19:56, Terence Parr wrote:
> If you want to get really crazy, like me, you can pass all characters to the parser as tokens and do scanner list parsing. There is a tiny example in the source code for the book in extras I think.
>
:-)
Right, I read about that in the book (it's mentioned in 5.6 Drawing the
Line Between Lexer and Parser) and it did caught my mind as something to
look deeper into. I haven't looked at the referenced grammar
(code/extras/CSQL.g4) yet though so I really can't say much about that
alternative at this point.

I guess that using a parser that isn't designed to be scannerless in
this way is not without some interesting issues, but I will certainly
look into it.

> On Apr 16, 2014, at 10:53 AM, Kevin Cummings <cumm...@kjchome.homeip.net> wrote:
>
> Chris, as an innocent bystander here looking on, can I ask:
>
> If X+1 is a Point, is X + 1 also a Point? If the white space is allowed, you will almost have to do the recognition in the parser.
>
"X + 1" would not be a POINT so that wouldn't be an issue in this case.

But for the sake of argument I don't see why an ANTLR generated lexer
would have any particular problems with ambiguities involving whitespace
characters. I mean, to the lexer, whitespace characters are just like
any other characters. For example, if you have:

FOOBAR: 'foo' WS 'bar' ;
IDENTIFIER: [a-zA-Z]+ ;
WS: ' '+ -> skip ;

Then "foo bar" becomes FOOBAR because that's the first rule that matches
the input and "foo baz" becomes IDENTIFIER IDENTIFIER. I'd say it's
really no different than relying on the lexer to be able to tell
keywords from identifiers.

Cheers,
Chris

Jim Idle

unread,
Apr 20, 2014, 9:19:28 PM4/20/14
to antlr-di...@googlegroups.com
The problem with a lexer only approach though is that input like this:

X4Y something else

Will cause your lexer to just fall over and the error message won't be very good. However if you are expecting all your input to be syntactically sound (such as for a converter) then this might bot be an issue for you. But i would just use IDENTIFIER and then code to check for the co-ordinate form in the parser.

Good luck :)

Jim




Cheers,
Chris

Christer Palm

unread,
Apr 23, 2014, 6:24:25 PM4/23/14
to antlr-di...@googlegroups.com
On 2014-04-21 03:19, Jim Idle wrote:
> The problem with a lexer only approach though is that input like this:
>
> X4Y something else
>
> Will cause your lexer to just fall over and the error message won't be
> very good. However if you are expecting all your input to be
> syntactically sound (such as for a converter) then this might bot be
> an issue for you.
First of all I wouldn't call it a lexer-only approach. I do have a
parser which takes the POINTs, IDENTIFIERs, NUMBERs and OPERATORs and
does all the usual parsery things with them.

I would rather compare my use case to the usual practice of having the
lexer match floating-point literals as a single token. The difference
beeing that my literals are a bit more complex and I don't have the
benefit of boilerplate methods like Double.parseDouble() to re-scan and
parse the token text.

I do parse user input from the wild and I do agree that lexer-level
errors won't be as good as those emitted by the parser. Although I'm not
too sure that parser level errors would generally be a lot more
meaningful in this case.

By the way, the example you give /will/ match, as IDENTIFIER("X4Y")
IDENTIFIER("something") IDENTIFIER("else"), which illustrates another
problem with these kind of grammars - the high likelyness that a simple
typing mistake produces something totally different than what the user
intended. It may even be valid input. Consider, for comparison, "double
x = 12f+3;" in Java. If the user actually meant 12e+3 he's in for a
surprise.

> But i would just use IDENTIFIER and then code to check for the
> co-ordinate form in the parser.
Well, first of all you can't, because not all POINT's have the form of
IDENTIFIER's. For example "X+1".

But even if I could, my actual grammar needs ~1000 lines of code to
break apart my POINT equivalents. The vast majority of that code is just
replicating the work the lexer just did to match the token, including
backtracking and sh*t, which is just a royal pain to implement and get
right and kind of the major reason to why I want to use a parser
generator in the first place rather than hand-coding it. Which is
exactly my motivation for the ability to capture the matching text of
lexer sub-rules. Now, it may sound like a stupid corner case - but for
comparison you should consider that Double.parseDouble() needs close to
250 lines of code to do its seemingly simple thing.

> Good luck :)
Thanks Jim!

As a side note, I have experimented with the "token per character"
approach suggested by Terence.
I did have problems with NullPointerException's in the default error
handling strategy which turned out to be due to the fact that the
CharsAsTokens TokenSource provided with the book example didn't
implement get/setTokenFactory() properly. Once I fixed that it seems to
work quite well, despite its uglyness. I highly doubt that it performs
very well though, although I haven't ran any performance tests on it yet.

Cheers,
Chris

David Whitten

unread,
Apr 23, 2014, 6:42:50 PM4/23/14
to antlr-di...@googlegroups.com
On Wed, Apr 23, 2014 at 6:24 PM, Christer Palm <pa...@nogui.org> wrote: 

As a side note, I have experimented with the "token per character" approach suggested by Terence.
I did have problems with NullPointerException's in the default error handling strategy which turned out to be due to the fact that the CharsAsTokens TokenSource provided with the book example didn't implement get/setTokenFactory() properly. Once I fixed that it seems to work quite well, despite its uglyness. I highly doubt that it performs very well though, although I haven't ran any performance tests on it yet.


Cheers,
Chris

could you share the "fixed" getTokenFactory() and setTokenFactory()
and tell us why the original one was broken ?

Thanks
David 

Terence Parr

unread,
Apr 23, 2014, 6:44:08 PM4/23/14
to antlr-di...@googlegroups.com
hahahha. yeah,I noticed they were broken when I tried to use it for something real the other day myself. here you go:


@Override
public TokenFactory<?> getTokenFactory() {
return CommonTokenFactory.DEFAULT;
}


Ter
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages