Nested rule acting up

40 views
Skip to first unread message

Marcin Wyszyński

unread,
May 13, 2016, 5:56:55 PM5/13/16
to antlr-discussion
Hi folks,

I'm using a heavily modified Objective-C grammar to parse implementation files and I got stuck trying to work out nested generics/protocols. The rule chain goes something like this:

protocolReferenceList : '<' protocolList '>' ;

protocolList
: protocolName (',' protocolName)* ;

protocolName
: protocolReferenceList | IDENTIFIER ;



And the input is:

@implementation OCKDocument {
   
OCKHTMLPDFWriter *_writer;
}


- (instancetype)initWithTitle:(NSString *)title elements:(NSArray<id<OCKDocumentElement> > *)elements {
   
return self;
}


@end

The problem is that when there is a whitespace between two angle brackets everything parses nicely. But if I use `NSArray<id<OCKDocumentElement>>` instead I get a parsing error:

line 5:87 no viable alternative at input 'NSArray<id<OCKDocumentElement>>'
line
5:65 extraneous input '<' expecting {'auto', 'bycopy', 'byref', 'char', 'const', 'double', 'enum', 'extern', 'float', 'id', 'in', 'inout', 'instancetype', 'int', 'long', 'oneway', 'out', 'register', 'short', 'signed', 'static', 'struct', 'typedef', 'union', 'unsigned', 'void', 'volatile', 'NS_OPTIONS', 'NS_ENUM', '__weak', '__unsafe_unretained', '(', '{', ';', ':', '*', IDENTIFIER}
line
5:87 mismatched input '>>' expecting {',', '>'}
line
5:91 extraneous input ')' expecting {'auto', 'bycopy', 'byref', 'char', 'const', 'double', 'enum', 'extern', 'float', 'id', 'in', 'inout', 'instancetype', 'int', 'long', 'oneway', 'out', 'register', 'short', 'signed', 'static', 'struct', 'typedef', 'union', 'unsigned', 'void', 'volatile', 'NS_OPTIONS', 'NS_ENUM', '__weak', '__unsafe_unretained', '(', IDENTIFIER}

I'm relatively new to ANTLR so can you please kindly point me to what's going on here? I would have thought the whitespace or it's abscence should not matter here. Grammar attached.

Thank you in advance!
ObjC.g4

John B. Brodie

unread,
May 13, 2016, 6:38:11 PM5/13/16
to antlr-di...@googlegroups.com

`>>` is defined as a binaryOperator

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marcin Wyszyński

unread,
May 14, 2016, 2:28:47 AM5/14/16
to antlr-discussion
I'm afraid this isn't it. Remove `>>` from the list of binary operators and nothing changes. I would be surprised if it did because the `binaryOperator` rule should not be matched there.

Eric Vergnaud

unread,
May 14, 2016, 8:17:06 AM5/14/16
to antlr-discussion
Removing '>>' from binary operators does not change anything because it is also defined by SHIFT_R.
In brief you have a '>>' which is recognized by the lexer, so you will never match >> as 2 separate '>'.
Be aware that the lever runs before the parser, and is non contextual. A token is a sequence of characters, and is always recognized before being matched by grammar rules.

Eric 

Jim Idle

unread,
May 15, 2016, 10:04:02 PM5/15/16
to antlr-di...@googlegroups.com
My advice is to do and take note of the following:
  • Remove all the 'x' literals from your parser and place them in your lexer as real tokens - they are very confusing for people starting out with ANTLR
  • Order the tokens so you can detect clashes
  • You cannot have two tokens that match the same sequence, ANTLR can and will warn you about this with simple things like character sequences, so I am surprised it is not already telling you this. So, create one token with a more abstract name such as LCHEVRON: '<' ;
  • The token names are not relevant for ANTLR, they are only for you, so it does not matter what you call them. If they are reused in different places, then you use a generic name or the name that is likely to be most common. You cannot call the same sequence one name at one point and a differnt name at another.
  • Remember that the parser has NO influence at all on the lexer, it does not tell the lexer to return one token instead of another.
  • Your clash is between '<<' and '<', so unless you lexically detect when you need one or the other and so can use lexer modes, then you need to lose the '<<' token (remember you cannot have two of these and the parser cannot help the lexer in any way)
  • When you need to look for the << operator, code it in the parser as LCHEVRON LCHEVRON
  • If you do not wish to all X < < Y (a space in the middle) then have a semantic check (or an action in the parser) that raises an error if the line numbers of the two tokens are not the same and the second is not immediately after the first.
  • Buy a copy of the book and read it - it will explain a lot of the above. If you cannot, then find one of the online tutorials (search the list archives) and read that - they are probably worth a read even if you have the book.
  • You might want to develop a parser for a simpler language than ObjC before tacking that one - get some experience.

That will solve your issue with << and probably a whole lot more that you are not yet aware of.

Jim



John B. Brodie

unread,
May 15, 2016, 11:06:42 PM5/15/16
to antlr-di...@googlegroups.com

Greetings!

Jim's comments are, of course, spot on ---- but

The OP's issues were with '>>' and not '<<'

So in Jim's remarks replace `<<` with `>>` and LCHEVRON with RCHEVRON and you should be good to go.

However, there are similar issues with `<<` that you have not encountered yet, so do Jim's suggestions directly also.

Jim Idle

unread,
May 16, 2016, 3:56:51 AM5/16/16
to antlr-di...@googlegroups.com
He will have the same issue with << though ;)
Reply all
Reply to author
Forward
0 new messages