Reporting syntax errors with Neotoma

6 views
Skip to first unread message

Tony Arcieri

unread,
Jan 20, 2011, 12:54:06 AM1/20/11
to Sean Cribbs, graeme defty, Reia Mailing List, Neotoma
Hi Sean, just wanted to touch base with you on what I think is one of the biggest problems with building a large grammar (like Reia's) in neotoma:

At present, if a parse error occurs, it isn't localized to the "nearest" nonterminal. That is to say, syntax errors within a complex construct consisting of many nested nonterminals report the error occurring on a line/column which is relatively distant from where the actual syntax error occurred.

As I am still unfortunately uneducated as to the inner workings of PEGs, I don't know why this should be. I have no idea if it's a problem with Neotoma itself, or the Reia PEG grammar. However, I'm very much interested in moving off of the leex/yecc grammar that Reia is currently using for a number of reasons. If we can get this worked out, and incorporate some of the changes from the "memory" branch of Neotoma, I think it may be practical to switch Reia from a LALR grammar to a Neotoma-based PEG.

--
Tony Arcieri
Medioh! Kudelski

Graeme Defty

unread,
Jan 20, 2011, 7:48:41 PM1/20/11
to Reia Mailing List, Neotoma
Sean,

I am sure you are aware of the kind of situation referred to here, but for those wondering what the fuss is about, the following is a small example of the situation which arose in the PEG for Reia:

Consider a language which allows for a series of module declarations, each of which contains a number of function declarations. Function declarations just have a character string as a body.

As an example:

mod one

   func one_a

     xxx

   endf

   func one_b

    yyy

   endf

endm

mod two

  func two_a

    zzz

  endf

endm


A PEG to define this could be as follows:

prog <- space mod+ !. ~;

mod <- 'mod' space val func+ 'endm' space ~;

func <- 'func' space val val 'endf' space ~;

val <- !'endf' [_a-z]+ space ~;

space <- (' ' / '\t' / '\n')+ ~;



And all in the garden is rosy, until we get an invalid program such as the following:

mod one

  func one_a

    xxx

  endf

  bunc one_b

    yyy

  endf

endm


Now the parser will go through the following steps:


try to recognise a program by recognising 0-n modules

try to recognise a module by :

recognising 'mod' – success

try to recognise 0-n funcs

recognise a func by:

recognise 'func' – success

recognise val – success

recognise 'endf' – success

func recognised – success

recognise a func by:

recognise 'func' – fail ('bunc' found)

func not recognised – fail

recognise 0-n funcs – success (one found)

recognise 'endm' – failure! ('bunc' found)

Recognise module – failure


The test looks as follows:

>  test:parse("mod one func one_a xxx endf bunc one_b yyy endf endm  ").                               
{fail,{expected,{at_least_one,{string,"endm"}},
                {{line,1},{column,1}}}}

So the parser returns with a failure in recognising a module and although the error message highlights what was being sought ('endm'), the error is pointing at the 'mod' keyword, whereas the error is clearly a badly-specified func and the error pointer should be on the mis-spelled 'bunc' keyword.

Well, there may be a way to code the PEG to make the messages better. I would not like to claim that I have thought long and hard about it, partly because this seems to me a bit like putting the cart before the horse. Getting a complex PEG right is already hard enough! And I guess this is a large part of my motivation. For a programmer trying to use a compiler, vague or imprecise error messages are an irritation, but for the compiler constructor it is even worse trying to re-engineer what has gone wrong to trace back to an error in the PEG.

Perhaps what is needed is something akin to the 'soft cut' of some functional languages which says “no backtracking beyond this point”.

Another idea that occurred to me was the ability to bracket an alternation sequence in another character ( '{' and '}'?) to say that errors in the alternate list should not make it out of the alternate list.

Then again, perhaps all that is needed is for the deepest error to pass its location back up the stack and have the error report contain the location of the error and the highest level that contains it.

I have read a few paper of late on PEGs, but the focus is invariably on improving the time or memory requirements. I am not seeing anyone trying to address the behaviour of a PEG in terms of error reporting. Maybe it is hard ;-)

Regards,

g

____________________________________________________________
Reply all
Reply to author
Forward
0 new messages