Consider a language which allows for a series of module declarations, each of which contains a number of function declarations. Function declarations just have a character string as a body.
As an example:
mod one
func one_a
xxx
endf
func one_b
yyy
endf
endm
mod two
func two_a
zzz
endf
endm
A PEG to define this could be as follows:
prog <- space mod+ !. ~;
mod <- 'mod' space val func+ 'endm' space ~;
func <- 'func' space val val 'endf' space ~;
val <- !'endf' [_a-z]+ space ~;
space <- (' ' / '\t' / '\n')+ ~;
And all in the garden is rosy, until we get an invalid program such as the following:
mod one
func one_a
xxx
endf
bunc one_b
yyy
endf
endm
Now the parser will go through the following steps:
try to recognise a program by recognising 0-n modules
try to recognise a module by :
recognising 'mod' – success
try to recognise 0-n funcs
recognise a func by:
recognise 'func' – success
recognise val – success
recognise 'endf' – success
func recognised – success
recognise a func by:
recognise
'func' – fail ('bunc' found)
func not recognised – fail
recognise 0-n funcs – success (one found)
recognise
'endm' – failure! ('bunc' found)
Recognise module – failure
The test looks as follows:
> test:parse("mod one func one_a xxx endf bunc one_b yyy endf endm ").
{fail,{expected,{at_least_one,{string,"endm"}},
{{line,1},{column,1}}}}
So the parser returns with a failure in recognising a module and although the error message highlights what was being sought ('endm'), the error is pointing at the 'mod' keyword, whereas the error is clearly a badly-specified func and the error pointer should be on the mis-spelled 'bunc' keyword.
Well, there may be a way to code the PEG to make the messages better. I would not like to claim that I have thought long and hard about it, partly because this seems to me a bit like putting the cart before the horse. Getting a complex PEG right is already hard enough! And I guess this is a large part of my motivation. For a programmer trying to use a compiler, vague or imprecise error messages are an irritation, but for the compiler constructor it is even worse trying to re-engineer what has gone wrong to trace back to an error in the PEG.
Perhaps what is needed is something akin to the 'soft cut' of some functional languages which says “no backtracking beyond this point”.
Another idea that occurred to me was the ability to bracket an alternation sequence in another character ( '{' and '}'?) to say that errors in the alternate list should not make it out of the alternate list.
Then again, perhaps all that is needed is for the deepest error to pass its location back up the stack and have the error report contain the location of the error and the highest level that contains it.
I have read a few paper of late on PEGs, but the focus is invariably on improving the time or memory requirements. I am not seeing anyone trying to address the behaviour of a PEG in terms of error reporting. Maybe it is hard ;-)
Regards,
g
{fail,{expected,{string, "endm"},{{line,1},{column,30}}},{{line,1},{column,1}}}
At that point maybe something more like a backtrace structure would be better?
However, this still leaves unresolved a few issues:
1) In the case of ordered choice, do you return:
- the first failure? (longest match, usually)
- the last failure?
- all failures?
2) At what point do the errors become too verbose?
3) How do you account for "expected" or "normal" failures (causes of
backtracking)?
Sean
So, the answer to me seems to be (and one I was avoiding because it
can come out ugly and hard to munge) is to wrap the errors all the way
up, without popping out the line/column number. Then you'd have
something like:
{fail,{expected,{string, "endm"},{{line,1},{column,30}}},{{line,1},{column,1}}}
At that point maybe something more like a backtrace structure would be better?
However, this still leaves unresolved a few issues:
1) In the case of ordered choice, do you return:
- the first failure? (longest match, usually)
- the last failure?
- all failures?
2) At what point do the errors become too verbose?
3) How do you account for "expected" or "normal" failures (causes of
backtracking)?