(known) Issues with importing grammars?

112 views
Skip to first unread message

Athanasios Anastasiou

unread,
May 5, 2014, 11:13:53 AM5/5/14
to antlr-di...@googlegroups.com
Hello everyone

I seem to be having some trouble trying to use tokens that have been specified in import definitions and i am not sure if i might be coming across something that is documented or known.

The situation is this:

//---File: Alpha.g4---
lexer grammar Alpha;
...
...
NUM:[0-9];
SOMETOKEN:NUM+ '.' NUM+;
//-----------------------------------------

//---File: Beta.g4---
lexer grammar Beta;

import Alpha,Gamma; //(Gamma contains yet more token definitions that are common across various grammars)

somerule:SOMESTANDARDTEXT SOMETOKEN;

//SOMETOKEN: [0-9]+ '.' [0-9]+;
//-----------------------------


Entering the test rig (in ANTLRWorks 2.1), i get a "line 1:23 mismatched input '6.2' expecting SOMETOKEN.
If i now uncomment the SOMETOKEN definition WITHIN Beta.g4 (and comment it in Alpha), then the token is parsed without any problem.

Any ideas as to what am i doing wrong? (Is this something to do with inheritance? e.g. when importing, are all rules and tokens assumed to be private or protected?)

Looking forward to hearing from you
AA

Terence Parr

unread,
May 5, 2014, 12:56:07 PM5/5/14
to antlr-di...@googlegroups.com
Hi.  You probably want rule NUM to be a fragment rule as opposed to matching a digit all by itself, right? Let's start with that and see if it fixes it
T


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Dictation in use. Please excuse homophones, malapropisms, and nonsense. 

Athanasios Anastasiou

unread,
May 7, 2014, 5:20:42 AM5/7/14
to antlr-di...@googlegroups.com
Hello

Thanks for the suggestion. My question is more general as i am coming across a lot of tokens which end up containing other tokens. Therefore, i could be doing something wrong conceptually as far as ANTLR is concerned and perhaps i need to convert some tokens to rules composed of irreducible and well defined tokens. Having said that, can i please ask:

1) Is a fragment still usable and visible on its own?

2) Are token references (tokens referenced inside other tokens) resolved properly? (i.e. MINUS:'-'; PLUS:'+'; SIGN:(MINUS|PLUS); NUM: SIGN? [0-9]+;

All the best
AA



--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/sHqXIskCjBE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.

Athanasios Anastasiou

unread,
May 7, 2014, 3:58:04 PM5/7/14
to antlr-di...@googlegroups.com
Hello again

I tried a number of different things, including specifying certain parts of the grammar as fragments but i still get the same error.

The test case is as follows:

alpha:NAMESTR;
NAMESTR:[a-zA-Z][a-zA-Z0-9_]+;

The test string is "blah" and the error message is

"line 1:0 mismatched input 'blah' expecting NAMESTR"

I am trying to debug a complex grammar (spanning 3 files) in antlrworks2 and i can only progress up to point before i start receiving these errors which i find a bit puzzling.

Any help or pointer towards what might be the problem would be greatly appreciated.

All the best
AA

Terence Parr

unread,
May 7, 2014, 5:18:23 PM5/7/14
to antlr-di...@googlegroups.com
On Wed, May 7, 2014 at 2:20 AM, Athanasios Anastasiou <athana...@gmail.com> wrote:
Hello

Thanks for the suggestion. My question is more general as i am coming across a lot of tokens which end up containing other tokens. Therefore, i could be doing something wrong conceptually as far as ANTLR is concerned and perhaps i need to convert some tokens to rules composed of irreducible and well defined tokens. Having said that, can i please ask:

1) Is a fragment still usable and visible on its own?

nope. is literally a fragment that cannot be seen by the parser
 

2) Are token references (tokens referenced inside other tokens) resolved properly? (i.e. MINUS:'-'; PLUS:'+'; SIGN:(MINUS|PLUS); NUM: SIGN? [0-9]+;

sure :)

Ter 

Terence Parr

unread,
May 7, 2014, 5:21:06 PM5/7/14
to antlr-di...@googlegroups.com
On Wed, May 7, 2014 at 12:58 PM, Athanasios Anastasiou <athana...@gmail.com> wrote:
Hello again

I tried a number of different things, including specifying certain parts of the grammar as fragments but i still get the same error.

The test case is as follows:

alpha:NAMESTR;
NAMESTR:[a-zA-Z][a-zA-Z0-9_]+;

hi. that should be * not +
 

The test string is "blah" and the error message is

"line 1:0 mismatched input 'blah' expecting NAMESTR"

 when this sort of thing happens, it is typically a token type mismatch.  If you are not using import then you must make sure to use the tokenVocab option.

import should bring in all of the token definitions properly. It could be that they are being compiled by ANTLR individually and in the wrong order.

the simplest thing to do is to keep your lexer in a completely separate grammar for your situation. Then you would only need to do an import on one grammar, but will need the tokenVocab as I mentioned.
T

Athanasios Anastasiou

unread,
May 9, 2014, 4:29:10 AM5/9/14
to antlr-di...@googlegroups.com
Hello Terence

I did not have the time to apply these modifications yet but i would just like to thank you for your responses.

There are a lot of places in this grammar where simple tokens are composed of other tokens. I have now started decoupling them as much as i can so that the leafs of the tree are simple irreducible tokens and anything "composable" is a rule. It is not straightforward though because each correction in the definition has to be propagated to a few different files. I went from having something i thought i was going to debug at syntax level, to something that needs to be debugged at source level :)

As a side-note, another thing that i noticed and i think that it applies both to antlworks2 and the SDK is that sometimes they will function normally even if a rule has not been defined or is erroneously defined...It usually happens with grammars spread over a number of different files.

All the best
AA




Terence Parr

unread,
May 9, 2014, 12:50:41 PM5/9/14
to antlr-di...@googlegroups.com
ok,Thanks for the heads up. Please let me know if you find a simple example where it is not handling missing rules correctly across files.
T

Athanasios Anastasiou

unread,
May 13, 2014, 4:04:01 PM5/13/14
to antlr-di...@googlegroups.com
Hello

Can we please confirm if the following is a bug or something that i am doing wrong?

As the grammar i am dealing with is spread over a number of different files by now, i decided to "re-build" it in a bottom up way, testing each element as i go along.

It turns out that this "expecting [something]" error is not due to inheritance or ANTLR getting its tokens mixed up across different grammars and requiring the tokenVocab option.

Here is an example: I wish to test CODE_STR and prior to it i only have a few simple declarations. Keeping VALUE_STR commented, rule u works as expected. If i uncomment VALUE_STR, i get the "line 1:0 mismatched input '0.0.0.1278' expecting CODE_STR" error.

In fact, it must be something to do with the '.' and '-' characters OF VALUE_STR (i.e. a token that is completely unrelated to the one i am trying to parse). I have tried escaping those characters in the set definition but whether escaped or not it doesn't seem to make any difference.

//VALUE_STR                  :[a-zA-Z0-9\._\-]+;
NAMESTR                    :[a-zA-Z][a-zA-Z0-9_]*;
ALPHANUM_CHAR              :[a-zA-Z0-9_]+;
NUM                        :[0-9]+;
CODE_STR                   :([0]|[1-9][0-9]*)('.'([0]|[1-9][0-9]*))*;

u:CODE_STR;

What do you think?

Terence Parr

unread,
May 13, 2014, 5:02:56 PM5/13/14
to antlr-di...@googlegroups.com
uh. all of them can match the same input, say, 4, right?

i think you need to figure out which char sequences are which lexemes.

Ter


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Athanasios Anastasiou

unread,
May 13, 2014, 5:16:49 PM5/13/14
to antlr-di...@googlegroups.com
I don't think i made it clear but the above are part of a larger definition.

Each one of these separate tokens will later be used to construct larger definitions.

In my case, i wanted to test if CODE_STR really did recognise "1.4.5".

To do this, i inserted a "dummy" rule, called "u". But upon the first test, i received that error.

I then started removing definitions until "u" worked as expected.

The last definition i removed that seemed to make "u" operative was VALUE_STR.

VALUE_STR does not participate in the definition of u or CODE_STR.

Trying to understand why VALUE_STR causes this problem, i concluded that it's probably the \. or \- characters that cause the problem because when i remove them, rule "u" seems to be succeding in parsing the test case, at least within the testRig.

Rule "u" is extremely simple, it should not be returning that error and yet it does.




--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/sHqXIskCjBE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.

Jim Idle

unread,
May 13, 2014, 10:37:56 PM5/13/14
to antlr-di...@googlegroups.com
I think that you are mis-understanding how this all works. 

You seem to be thinking in terms of a rule "calling" a LEXER rule. However LEXER rules are completely independent of the parser rules.

So your u rule may be "expecting" to get a CODE_STR but in fact the parser will ask the lexer for the next token and the lexer will return what matched with the input stream. As many of your tokens overlap, the lexer will match the first one in your list - VALUE_STR - and so your parser will throw a syntax error. When you comment out VALUE_STR then the lexer just finds the next one in the list until you comment out all but CODE_STR.

In your tokens above pretty much all of them can match the same input as the others, so this is never going to work. Are you not getting any errors/warnings about overlapping tokens when you generate from the grammars?

Basically you need to go back to basics and read the online tutorials and preferably the book(s) as well or I think that you are going to find yourself frustrated and unable to continue.

Jim





Athanasios Anastasiou

unread,
May 14, 2014, 3:50:16 AM5/14/14
to antlr-di...@googlegroups.com
Thank you very much for the clarification Jim, it was really helpful. Yes, that's pretty much how i thought things were working, that the parser will track the rules rather than check if they agree with the input.

I am transcribing these rules from a set of yacc resources, but as it seems some translation needs to be performed here as well if no token should overlap with any other token and that might alter the way many of them are expressed drastically.

No, i do not get any "overlapping tokens" errors.

Thank you for your help

All the best
AA


Athanasios Anastasiou

unread,
May 15, 2014, 12:47:21 PM5/15/14
to antlr-di...@googlegroups.com
Hello everyone

Alright, i had another look at the overall framework with respect to what i am trying to express here and i have come to a slightly awkward conclusion which i thought i would verify with you.

Here is the most brief example of what i am trying to say:

NUM:'0'|ONUM;
ONUM:[1-9];
DOT:'.';
SIGN:'+'|'-';

SOMETOKEN: ('0'|ONUM NUM*) (DOT ('0'|ONUM NUM*))*; //That is, 0.0.0.0.0 OR 1.1.0 OR 123245.34723.34334 BUT NEVER 0000000.012312.087

SOMEOTHERTOKEN:(NUM+ DOT NUM+) (DOT NUM+)*; //Any number sequence delimited by a single dot and containing at least 2 elements.

YETANOTHERTOKEN: SIGN? NUM+ (DOT NUM+)?; // A very simple floating point number;

With these three sequences (and they are not the only ones), i am running again to the same problem of having two or more rules for parsing the same expressions. Perhaps some re-ordering of the rules would cull a few erroneous states but sooner or later, ANTLR could come across some input that could fit any of these rules (Simple example "4.5" can be parsed by SOMEOTHERTOKEN or YETANOTHERTOKEN equally well).


I suppose that the answer here is that i should leave only one way of expressing a list of numbers delimited by dots (am i correct at this ?).

Also, am I right in understanding that any checking of what has been parsed would have to be handled later by code rather than at the parsing level (?) (I wouldn't like this, but if that's the way it's done...that's the way it has to be done)

David Whitten

unread,
May 15, 2014, 2:56:42 PM5/15/14
to antlr-di...@googlegroups.com
could you turn SOMETOKEN ,  SOMEOTHERTOKEN , and YETANOTHERTOKEN into rules instead of Token definitions?


Athanasios Anastasiou

unread,
May 16, 2014, 5:22:14 AM5/16/14
to antlr-di...@googlegroups.com
Hello David

Thank you for your response.

It was one of the things i tried earlier too but i don't think that this would solve the problem of rules competing for the same input (at least in this specific example i am giving above).

The whole thing is reminiscent of prefix codes. Codes constructed in such a way that they don't need start/stop markers in a stream and i guess that this is what the parser is doing ultimately with the tokens produced by the lexer (http://en.wikipedia.org/wiki/Prefix_code).

All the best
AA






--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/sHqXIskCjBE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.

David Whitten

unread,
May 16, 2014, 12:23:59 PM5/16/14
to antlr-di...@googlegroups.com
If your tokens are unique, i.e. every character string only generates a single token then you let the rules fight it out over tokens.  

Any system that has no keywords will have this problem.  The same text string can be used as a keyword or as an identifier.  Essentially, things that look like keywords in one part of the parse are simply identifiers in a different part of the parse.

It seems your problem is that your tokens contain other tokens. To my knowledge, this is not what you  want to do.  Whatever you find are your base tokens, should be your only tokens.  Combining tokens together to make bigger units is the job of the parser rules.  Token processing does not have backtracking. Parser rules do have backtracking and and thus can back up when a token that violates the definition of a particular rule is encountered.

To the rest of the mailing list: If I am wrong about ANTLR and the previous paragraph, I would like to know very soon, as grammar creation, testing, and use  is a taxing intellectual effort, and I want to know that ANTLR is not suitable for the context dependent task I am embarking on.

David

Terence Parr

unread,
May 16, 2014, 1:19:21 PM5/16/14
to antlr-di...@googlegroups.com

On May 16, 2014, at 9:23 AM, David Whitten <whi...@netcom.com> wrote:

> If your tokens are unique, i.e. every character string only generates a single token then you let the rules fight it out over tokens.

Yep,The longest token wins and if there is a single sequence matched by multiple rules, ANTLR chooses the token mentioned first in the grammar.

> It seems your problem is that your tokens contain other tokens. To my knowledge, this is not what you want to do.

this typically means you need to move some of the structure specification up into the parser and use simpler tokens

> Whatever you find are your base tokens, should be your only tokens. Combining tokens together to make bigger units is the job of the parser rules.

yep!

> Token processing does not have backtracking. Parser rules do have backtracking and and thus can back up when a token that violates the definition of a particular rule is encountered.

Well, I wouldn’t talk about backtracking because nothing really backtracks in ANTLR anymore. Just assume that it does the right thing given a grammar specification if you don’t get a tool error.

>
> To the rest of the mailing list: If I am wrong about ANTLR and the previous paragraph, I would like to know very soon, as grammar creation, testing, and use is a taxing intellectual effort, and I want to know that ANTLR is not suitable for the context dependent task I am embarking on.

context dependent languages are difficult to build parsers for. It either requires semantic predicates or clever use of rules or parser/lexer interaction.

Ter

David Whitten

unread,
May 16, 2014, 1:52:23 PM5/16/14
to antlr-di...@googlegroups.com
Thank you Terence for taking the time to review what I said.

I am rather concerned about your statement that backtracking doesn't work any longer.

To be clear, if I have the input:

   COMMAND ARG1 COMMAND ARG2

and I have a rule

rulelist ::== ruleitem+

ruleitem ::== rule1 | rule2 | rule21 

rule1 ::== COMMAND ARG1
rule2 ::== COMMAND ARG2
rule21 ::== COMMAND ARG1 COMMAND ARG2


which parse would be recognized ? 
(I don't know a notation to express this, if there is a better one, please tell me)

rulelist
           ruleitem 
                         rule21
                                   rule1
                                   rule2
or

rulelist 
           ruleitem
                       rule1
           ruleitem
                      rule2


One of my goals is to take code that already exists, and transform it into other code.
I want to put in general rules so all input can be parsed, 
but also include special case rules that I can recognize, and generate special case code.

Am I thinking that the ANTLR tool is capable of this.

Athanasios Anastasiou

unread,
May 16, 2014, 3:13:37 PM5/16/14
to antlr-di...@googlegroups.com

Hello everyone

Thank you for your responses.

I think that my "problem" is a bit more complicated than that. Some dot delimited numeric lists denote versions and are supposed to have at least two elements ( i.e. 1.2), one of these versions is constrained to any series of 1.1 (or more). The other can be any digit delimited by dots up to any depth. A third rule (which i had to remove or rather re express) was parsing floating point numbers (1.1) and another one was parsing generic "values" including dot delimited numbers. A fourth rule was parsing identifiers, including accessing member attributes.

The majority of those i am re-expressing so that they make more sense.

But when it comes to Float VS "dot numeric" (of any depth but at least two) I am at loss. I don't know how to get ANTLR to recognise the difference.

In an earlier question of mine about expressing constraints like [0-9]{0,3} in ANTLR, another gentleman (whose name i can't find right now, sorry) had suggested that this kind of checking should occur as a semantic check.

I think that the use-case i describe above could support some (distant :) ) future modifications of ANTLR's meta-language to be able to express such token constraints. In this way the meta-language would be a stand-alone model of the language it describes containing all constraint information in one place, rather than half of it in the specification and half of it somewhere in the code.

In the meantime, any ideas on how to deal with that little problem are welcome :) (preferably not involving 'code'...i don't know which language could be used to build a parser in)

All the best
AA

Sam Harwell

unread,
May 16, 2014, 4:19:13 PM5/16/14
to antlr-di...@googlegroups.com
Hi David, 

It's not that backtracking doesn't work. It's simply unnecessary for ANTLR 4. Backtracking is only used in previous versions as a fallback mechanism to handle very long or nondeterministic lookahead sequences. The ALL(*) algorithm used in ANTLR 4 does not have any such limitation, so there is no problem left for backtracking to solve. 

Sam

Sam Harwell

unread,
May 16, 2014, 4:26:09 PM5/16/14
to antlr-di...@googlegroups.com
Until someone invents a radically improved error handling mechanism, every semantic constraint you move from the parser to code that analyzes the parse tree will substantially improve the end user experience when that error appears in their code. The corollary to this means if you express all of your semantics in the parser, then the tool you are creating will be severely impaired when it comes to informing users about any error in their source code. 

Sam


-------- Original message --------
From: Athanasios Anastasiou
Date:05/16/2014 9:13 PM (GMT+01:00)
Subject: Re: [antlr-discussion] (known) Issues with importing grammars?

You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

David Whitten

unread,
May 16, 2014, 4:49:21 PM5/16/14
to antlr-di...@googlegroups.com
Sam,
thank you for your reply.

I'm sorry if I am repeating my question, but I didn't understand how to apply your answer to my question in the previous e-mail.
I am hoping that ANTLR is able to support multiple rules (not multiple ways to create a token) that can provide two different interpretations of a token stream.  I think I am happy with the rule that matches the largest number of tokens is best, such as in my example, ruleitem parsing as rule21 would match the largest number.  

Again in my example both possible trees would match the same number of tokens. 
How do I control which parse tree would be built by ANTLR 4 ? 

Sam Harwell

unread,
May 16, 2014, 5:14:43 PM5/16/14
to antlr-di...@googlegroups.com
In the event of an ambiguity in the parser, the alternative which appears first among the ambiguous alternatives is used. This means for the specific grammar you posted, rule21 would never be parsed because rule1 is also viable and appears first. 

The parse tree it produces is the second one you listed. 

Sam

Athanasios Anastasiou

unread,
May 21, 2014, 8:15:51 AM5/21/14
to antlr-di...@googlegroups.com
Hello everyone

Sam, thank you very much for your response. Error handling is indeed a major feature, especially for complex input. I do not suggest that it gets neglected. The specific point i am making is that ANTLR's meta-language is not only good for constructing parsers but also to document and specify a language. From this point of view, it would make sense to keep all the definitions in one place. In other words, have a g4 file be a self contained description of the characteristics of a language, without having to look for further details somewhere else, in some source file.

Perhaps the checking i am trying to do here is easy enough that it can be carried out in a short "action" snippet in which case it could be added to the antlr file too.

In my case, I can't start exploring this yet because I am re-organising the existing yacc tokens to something that makes more sense in ANTLR in a rule-by-rule fashion...otherwise too many rules clash and i get errors that are complex to resolve.
To unsubscribe from this group and all its topics, send an email to antlr-discuss...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages