ANTLR4, "extraneous input" messages?

2,905 views
Skip to first unread message

brende...@gmail.com

unread,
May 1, 2013, 11:04:43 PM5/1/13
to antlr-di...@googlegroups.com
I've got a little grammar now to play with, so I wrote a quick test following the pattern of the expression parse from the Antlr book.

It seems like it might be working (the output of the parse tree as text is a little hard to read). But I keep getting messages like "extraneous input" which don't make sense to me.

A bit of background before I dump the whole thing on you. I'm experimenting with a wiki-like grammar because I thought it would be fun and different enough from the examples in the book to allow me to make my own mistakes. A basic grammar might look like wiki text with a bit of JSP thrown in:

For example:

!{pagecommand}
!{othercommand}

Some paragraph with *markup*.

Another paragraph (they start with two newlines).

OK, so there's the example of a super simple input I'd like to parse. When I made some test vectors however, I got these odd error messages written to standard error (not by me). For example, for the test vector 'a few words' which should be a single paragraph by itself, I got:

'a few words': (wikiPage (header a few words) body coda)
line 1:0 extraneous input 'a few words' expecting {<EOF>, '!{', SPCTAB, NL, WS}

First line is printed by me, which is the input text and then the parse tree. The second line is printed by the ANTLR parser. It seems to be looking for a header, but the header should be optional. Also while the test string appears in the parse tree, it appears to me at the position where it should be a header, which doesn't make any sense to me. But maybe I'm reading it wrong.

Here's the first few lines of the grammar:

grammar WikiParser;

wikiPage : header body coda ;

header : (WS* bangCommand)* ;

bangCommand : '!{' LETTER+ '}' ;

body : (newline line)* ;

line : wikiText+ ;

coda : WS* ;

newline : (SPCTAB* NL)+ ;

To me, this says that all of the top components are optional. Mostly I do this by appending * to the parser syntax, allowing them to be null. Again perhaps I'm reading it wrong though.

Breaking it down, the header lines have optional white space (WS*) before the actual header line (bangCommand).

Then the body consists of lines preceded by newlines. The newlines aren't optional, but the whole body is because the lines themselves are optional (newline line)*

Finally the coda is there just to eat up any extra whitespace in the file.

OK, that's simple explanation of what I've got. Here's the gory details.

Output I get (the error messages are kind of jumbled up with the regular output text):
<stdout>
-------------------------------------------
run:
'': (wikiPage header body coda)
line 1:0 extraneous input ' ' expecting {<EOF>, '!{', SPCTAB, NL, WS}
' ': (wikiPage (header ) body coda)
' ': (wikiPage (header \t) body coda)
line 1:0 extraneous input '\t' expecting {<EOF>, '!{', SPCTAB, NL, WS}
'!{bang}': (wikiPage (header !{bang}) body coda)
line 1:0 extraneous input '!{bang}' expecting {<EOF>, '!{', SPCTAB, NL, WS}
'a few words': (wikiPage (header a few words) body coda)
line 1:0 extraneous input 'a few words' expecting {<EOF>, '!{', SPCTAB, NL, WS}
'Sentence one.

Sentence two.': (wikiPage (header Sentence one.\n\nSentence two.) body coda)
line 1:0 extraneous input 'Sentence one.\n\nSentence two.' expecting {<EOF>, '!{', SPCTAB, NL, WS}
BUILD SUCCESSFUL (total time: 0 seconds)
--------------------------------------------
</stdout>

Full grammar:
<grammar>
------------------------------------------------
grammar WikiParser;

wikiPage : header body coda ;

header : (WS* bangCommand)* ;

bangCommand : '!{' LETTER+ '}' ;

body : (newline line)* ;

line : wikiText+ ;

coda : WS* ;

newline : (SPCTAB* NL)+ ;

wikiText : escape | link | markup | plain ;

escape : '\\' ANY ;

link : '<' LINK_BODY? '>' ;

markup : MARKUP_TEXT MARKUP_TEXT+ ;

plain : LETTER | DIGIT | SPC ;

LINK_BODY : ~[<">]+ ;

LETTER : [a-zA-Z] ;

SPC : [ ] ;

SPCTAB : [ \t] ;

DIGIT : [0-9] ;

// all punctuation except < > and \
MARKUP_TEXT : [`~!@#$%^&*()_-+={[\]{}|:";',./] ;

NL : '\r'? '\n' ;

// any ASCII, other text may be included later
ANY : [\u0021-\u007E] ;

WS : [ \t\r\n] ;
-----------------------------------------------
</grammar>

And the full test harness:

<java>
-----------------------------------------------
package com.techdarwinia.wiki;

import com.techdarwinia.wiki.parser.WikiParserLexer;
import com.techdarwinia.wiki.parser.WikiParserParser;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;

public class Parser
{
private String[] testVectors = {
"", // no input
" ", // just whitespace
"\t",
"!{bang}", // one bang command
"a few words", // one paragraph
"Sentence one.\n\nSentence two.", // two paragraphs
};

public static void main( String[] args )
{
// Test Parser...
new Parser().runTests();
}

private void runTests()
{
for( String test : testVectors ) {
simpleTest( test );
}
}

private void simpleTest( String test )
{
ANTLRInputStream ains = new ANTLRInputStream( test );
WikiParserLexer wpl = new WikiParserLexer( ains );
CommonTokenStream tokens = new CommonTokenStream( wpl );
WikiParserParser wikiParser = new WikiParserParser( tokens );
ParseTree parseTree = wikiParser.wikiPage();
System.out.println( "'" + test + "': " +
parseTree.toStringTree( wikiParser ) );
}
}
-----------------------------------------
</java>

Norman Dunbar

unread,
May 2, 2013, 4:28:57 AM5/2/13
to antlr-di...@googlegroups.com
Morning,

wikiPage : header body coda ;

At the moment, you are saying "a wikipage is a mandatory header then a
mandatory body then a mandatory coda".

You need to make the header optional:

wikiPage : header? body coda ;

This says that the header is optional, and need not be present, but the
body and coda must be.

Then your header rule need only specify what is to be recognised as a
header, when one is present. The optionality of a header is not part of
recognising the header itself, but it is part of recognising a wikipage,
where the header is optional. (I think!)



HTH

Cheers,
Norm.




--
Norman Dunbar
Dunbar IT Consultants Ltd

Registered address:
Thorpe House
61 Richardshaw Lane
Pudsey
West Yorkshire
United Kingdom
LS28 7EL

Company Number: 05132767

brende...@gmail.com

unread,
May 2, 2013, 12:18:36 PM5/2/13
to antlr-di...@googlegroups.com, nor...@dunbar-it.co.uk
Thanks for the reply.

I hope not, because that would make the grammar very limited in how it can be used. I'll try it out though and see if it makes a difference.

P.S. Anyone else find Google Groups to be pretty poor? It always double-spaces all quoted text. Really annoying.

Reply all
Reply to author
Forward
0 new messages