Automatic error message generation

5 views
Skip to first unread message

Tony Arcieri

unread,
Jul 21, 2010, 7:02:48 PM7/21/10
to neoto...@googlegroups.com
In this message on erlang-questions:


...it's suggested that it's possible to automatically trace parse errors to a failed character lexing rule.

Could the suggested approach be applied to neotoma, allowing LALR-like automatic generation of parse error messages?  This has been the one sticking point which has prevented me from further exploring a Neotoma-based parser for Reia.

--
Tony Arcieri
Medioh! A Kudelski Brand

Sean Cribbs

unread,
Jul 22, 2010, 5:31:28 PM7/22/10
to neoto...@googlegroups.com
Tony,

That's an interesting idea and I'd love to explore it. Neotoma
currently returns the first error (rather than the last) and doesn't
wrap the error on the way out, only returning what it expected at the
leaf of the tree. (For what it's worth, not many compilers give more
help than this.)

Right now I'm stuck on the internal conversion to binaries (instead of
lists) and am looking into using QuickCheck to help with that. It may
require some rewriting. There's also some changes I'd like to do with
memoization so that the memos are not out-of-process in ets (which can
be a significant performance hit for large inputs).

Sean

Tony Arcieri

unread,
Jul 22, 2010, 5:38:07 PM7/22/10
to neoto...@googlegroups.com
On Thu, Jul 22, 2010 at 3:31 PM, Sean Cribbs <seanc...@gmail.com> wrote:
That's an interesting idea and I'd love to explore it. Neotoma
currently returns the first error (rather than the last) and doesn't
wrap the error on the way out, only returning what it expected at the
leaf of the tree. (For what it's worth, not many compilers give more
help than this.)

Returning what was expected is very useful, however a LALR parser will give me the unexpected, erroneous token.

Ideally it'd be great to get all the information needed to generate a message like:

"Unexpected [ at line: 42, col 8, expecting: ("
 
but I'd be perfectly happy with just:

"Unexpected [ at line: 42"

which is what I presently get out of yecc.

Sean Cribbs

unread,
Jul 24, 2010, 2:51:52 PM7/24/10
to neoto...@googlegroups.com
Oh, I see what you mean. That should be easy to provide, perhaps with
a flag in the grammar that says how many bytes of input to return with
the error?

Sean

Tony Arcieri

unread,
Jul 24, 2010, 3:09:06 PM7/24/10
to neoto...@googlegroups.com
On Sat, Jul 24, 2010 at 12:51 PM, Sean Cribbs <seanc...@gmail.com> wrote:
Oh, I see what you mean.  That should be easy to provide, perhaps with
a flag in the grammar that says how many bytes of input to return with
the error?

Yeah, something like that, although I'm sure you could pick a sane default :)
 

Tony Arcieri

unread,
Jul 24, 2010, 5:03:41 PM7/24/10
to neoto...@googlegroups.com
On Sat, Jul 24, 2010 at 12:51 PM, Sean Cribbs <seanc...@gmail.com> wrote:
Oh, I see what you mean.  That should be easy to provide, perhaps with
a flag in the grammar that says how many bytes of input to return with
the error?

And you know, this is a lot more useful than what yecc gives you.  Yecc gives you just the failed token, whereas it'd be a lot more useful to have the failed character sequence, including the trailing context.  Having the column number gives you the same information, but having trailing context lets you deduce the parse error at a glance.
 

Tony Arcieri

unread,
Sep 15, 2010, 12:33:24 AM9/15/10
to neoto...@googlegroups.com
So, to dredge this thread up again...

I guess the thing that would be the most useful is the actual set of characters that failed to parse (i.e. largest substring of characters which almost formed something valid but failed due to the next character) and probably with the next character (likely the failing character) included for context rather than just firing off some arbitrary sized glob of trailing context.

Would that make sense to display back to the person authoring programs in the described language, namely those people who have made grammatical mistakes? People are used to seeing parse errors in the form of failing tokens (that is, if their program tokenized to begin with). I'm wondering if you could extract something close to that when a parse error occurs.

Sean Cribbs

unread,
Sep 16, 2010, 9:12:51 AM9/16/10
to neoto...@googlegroups.com
So, I think we might be able to do that, or at least something close
to it. We could modify p_seq so that it returns all of the successful
subtrees trailed by the unsuccessful one, and p_choose so that it
determines the longest parse tree that failed (generally that's going
to be the first one, one reason why I have it return the first
failure). The result of the failure could be massaged a little bit
before returning, too.

Sean

Reply all
Reply to author
Forward
0 new messages