I need a little help with error detection/recovery/repair in either or
both of LEX and YACC (a little snippet of code, even just a few
productions, would be tremendously appreciated! :).
The documentation I have on YACC mentions (very briefly) the use of the
"error" keyword in productions, but doesn't give a good example of how
this can be used to provide good error messages (ie. what line, what
token, what was expected etc) and how the error could be corrected (or at
least recovered from to allow the parse to continue).
I'd appreciate any help I could get.
Thanx for your time,
- Darryl
PS. Is it best to keep all (most) symbol table interactions
(ie. lookup, insertions) in the scanner? I'm a little new
to LEX/YACC in case you haven't noticed :)
--
Darryl Friesen | Client Services
Darryl....@usask.ca | Dept of Computing Services
frie...@sask.usask.ca | University of Saskatchewan
[I certainly do the basic symtab lookups in the scanner, though only in
the parser or later can I assign attributes to the symbols. Re error
handling, it's a pain in yacc, not the least because many older yaccs have
bugs in the error rule code. See the Levine (me), et al., lex&yacc and
Schreiner and Friedman, both of which discuss error recovery. The FAQ has
the complete references. -John]
--
Send compilers articles to comp...@iecc.com or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compiler...@iecc.com.
If you're really interested in error recovery forget about using yacc. I
would suggest taking a look at the gmd tools also known as cocktail. The
toolbox is a suite of utilites for constructing compilers, translators,
etc. It has a LALR(1) parser called coincidently 'lalr'. The tool 'lalr'
features automatic error recovery. It can insert and delete tokens from
the input stream so that parsing can continue. Additionaly, it is claimed
the generated parsers run about twice as fast as yacc generated ones. They
also have a lexical analyzer called 'rex' that is also supposed to be much
faster than 'lex'. Check it out. You can get it from:
gatekeeper.dec.com:/pub/plan/gmd
--Mike
------------------------
Michael Collison
Open Software Foundation
11 Cambride Center
Cambridge, Ma 02142
email: coll...@osf.org
* classic literature on lex&yacc
* a very helpful visualization of yacc's workings
* scanners: lex vs hand-written
* some notes on error recovery
Some suggestions if you are just starting with yacc (& lex):
* Read the existing documentation carefully (and parts of it more than
once). The classical papers by their authors come with every unix
documentation set. They are not part of the online man pages, but
can be found as "supplementary docments" (sometimes as copies of the
original (eg. Ultrix), sometimes somewhat revamped (eg. SunOS)).
You should read them not only because they are the "official
references", but also because they are nice tutorials with many
practical hints. (Many of them become apparent only after some
practice.)
%r 39
%K CSTR
%R Comp. Sci. Tech. Rep. No. 39
%I Bell Laboratories
%C Murray Hill, New Jersey
%A M. E. Lesk
%T Lex \(em A Lexical Analyzer Generator
%D October 1975
%r 32
%K CSTR
%R Comp. Sci. Tech. Rep. No. 32
%I Bell Laboratories
%C Murray Hill, New Jersey
%A S. C. Johnson
%T Yacc \(em Yet Another Compiler-Compiler
%D July 1975
J. R. Levine et.al. "Lex & Yacc" (O'Reilly&assoc.) is a marvellous
book, not only for beginners. Schreiner/Friedmann is very
comparable. I prefer "Levine et.al." on a slight margin.
The following item is -again- quite old but nevertheless the most
readable introduction to LR parsing and YACC theory I ever came
across. If you want to know how "default reductions" affect you,
this is worthy to be read:
%T LR Parsing
%A A. V. Aho
%A S. C. Johnson
%J Comp. Surveys
%V 6
%N 2
%P 99-124
%D June 1974
* Get Bob Corbett's "byacc" and the small but effective modifications
by J. Roskind. The latter will give you a tree-ish trace which was,
at least for me, the key for understanding the initially intricate
mechanism for error recovery ("Ah, *now* yyerror() gets called, but
we havn't started resyncing yet...and *now* we are synced
again...and *here* comes the {action}..."). If you have trouble
locating Roskind's "skeleton.c" replacement, I can mail it to you.
[It's in the compilers archive as part of Roskind's C++ grammar -John]
* Unlike the case for yacc, it is often somewhat hard to make a
decision for or against the use of lex. Scanners are simple and
quick to formulate with lex. However, if the generated scanner
can't be taken as is and you can't get what you want by fiddling
around with the macros and start post-processing the generated
source code, which is not uncommon, you'll get a somewhat
complicated setup involving sed-Skripts and Makefiles which I
wouldn't inflict on absolute beginners. If the tokens are simple, a
hand written scanner will be just a 20-line C routine and easier to
maintain, debug, and adapt to the environment. (It certainly pays
off to know what a "%*c%[a-z]" format means to sscanf!)
* "eyacc" was a yacc variant developed at Berkeley for the
implementation of their pc/pi/pix Pascal compiler. On most systems,
the man page just said: "eyacc is a yacc variant providing better
error recovery" and left you alone. It has nowadays been dropped
from most unix distributions. A better man page would have told
you:
"Eyacc is an old version of yacc(1). Eyacc fully enumerates
test actions in its parser when an error token is in the
look-ahead set. This prevents the parser from making
undesirable reductions when an error occurs before the error
is detected. The table format is different in eyacc than it
was in the old yacc, as minor changes had been made for
efficiency reasons."
before leaving you alone, too. Now, you can read the whole story
if you read in three sources:
``Practical LR Error Recovery'' by Susan L. Graham, Charles
B. Haley and W. N. Joy; SIGPLAN Conference on Compiler Con-
struction, August 1979.
``Practical Syntactic Error Recovery'' by Susan L. Graham and
Steven P. Rhodes; Communications of the ACM, Vol.18 No.11,
November 1975
and of course the eyacc source itself, if you are lucky to have a
BSD source license. I take the freedom to reveil an excerpt from
the READ_ME:
"[note also that] these changes are really useful only with a
fairly large set of error recovery routines which are part of
both "pi" and "pxp". The routines are language independent,
but the method will only work on languages which have some
redundancy in them... it is probably ill suited for C, but
would work fine in ALGOL-60, ALGOL-W, EUCLID, LIS, SP/K, PL/1,
ALPHARD, CLU, ..."
So, how does all this relate to you? To put it in a nutshell: the
default reductions in yacc prevent many useful diagnostics like
"what production is currently active?" or "which tokens are legal
here?". Many programmers have unsuccessfully tried to reconstruct
such information from yacc-generated parsers. I found some code for
computing valid lookahead symbols here and there, but it was never
correct. The eyacc history shows that it requires quite some work
and expertise to do a truly proper job.
This isn't to say that you can't generate parsers with yacc which can deal
with errors. Just indicating the current yyline and yytext and barfing
"syntax error" goes already a long way. The "error" mechanism is not as
complicated as it seems. In my experience, beginners are often
over-zealous with productions like
assignment : lhs '=' rhs ';'
| lhs '=' error { puts ("rhs kaput"); } ';'
where yacc can provide much better resyncing if you restrict yourself to a
catch-all "statement: ... | assignment | error". Make yourself clear what
happens (with the Corbett/Roskind combi mentioned above) before sinking
time into guessing and sprinkling productions which might never get
reduced anyway. Look into yacc grammars written by old-hands. You'll
find very good examples not only on effective error handling but also on
the nature of actions. In my early yacc days, I was too keen on
delegating action stuff to external procedures, nowadays I just go ahead
and plug the tree structures together inline:
idlist : idlist ',' id { FOO *l; for (l=$1; l->next; l=l->next);
l->next = $3; $$=$1;
/* wham bam thank you mam */ }
The hardest thing to do at error recovery is often the prevention of
partly undefined data structures and memory leaks due to lost data
structures which have already been allocated and built at lower levels,
but not connected to a main parse tree.
Martin Neitzel
(PS: Digging through the eyacc sources reminded me on one of few
comments in Kernighan's "pic" yacc source. It just said:
/* WITCHCRAFT */ and lo' and behold, it was! :-)
>I need a little help with error detection/recovery/repair in either or
>both of LEX and YACC (a little snippet of code, even just a few
>productions, would be tremendously appreciated! :).
Here is a yacc production that uses the special error symbol to trap
parsing errors in a semicolon-terminated statement (as in C):
Statement : Stmt semicolon_tok
| error semicolon_tok
;
Whenever yacc encounters a parsing error, it effectively backs up the
stack util it can match the input to an error production. When it matches
such input, a call to yyerror() is made. You can supply your own
yyerror() routine, which prints whatever error messages you like. When
yyerror() returns, parsing continues just as if Statement had been
produced.
You need to select your error productions strategically, so that if a
syntax error is encountered, (1) the nonterminal that gets produced does
not affect other symbols in the production (in my example, producing an
error Statement consumes all input up to and including a semicolon, which
should mean that the next statement is ready to be parsed), and (2) Which
nonterminal to produce is not ambiguous.
> PS. Is it best to keep all (most) symbol table interactions
>(ie. lookup, insertions) in the scanner? I'm a little new
>to LEX/YACC in case you haven't noticed :)
This aspect of the design is really up to you. You should consider of
what kind of information the symbol table will make use, then figure out
where the bulk of that information fits into your program -- in the
scanner or in the parser?
I hope this helps. I didn't see any other posted responses to the
question, so I hope this is not too redundant, even though the question
was posted 4 days ago. :)
--
Byron C. Darrah | bus. mail: attn: Byron C. Darrah, Mail Station: B223
PHONE: (714) 732-9381 | Hughes Aircraft Company, GSD.
FAX: (714) 732-1953 | 1901 W. Malvern Ave.
bd...@atr-2s.hac.com | Fullerton, California, USA
| 92634
BD> Whenever yacc encounters a parsing error, it effectively backs
BD> up the stack until it can match the input to an error
BD> production. When it matches such input, a call to yyerror()
BD> is made.
The timing is incorrect. When yacc recognizes an error it calls yyerror()
immediately (and then only if it isn't already in recovery mode (three
token shift rule)). *Then* it starts popping states hoping to uncover an
"error" state. Next, the reduction of a production involving "error"
_may_ require the shift of further expected input as specified, which in
turn means that the parser reads and ignores all other tokens. Your
example, by the way, does exactly that:
BD> Statement : Stmt semicolon_tok
BD> | error semicolon_tok
After the error has been reported and the "Statement: error semicolon_tok"
production has been uncovered, the parser will stick in this state and
ignore all input until you finally feed it a semicolon. Nothing else will
get you back on track. (Byron already pointed this correctly out.)
If you insist on doing it this way, you should clear the error state with
a yyerrok action after the semicolon:
Statement: ... | error ';' {yyerrok;}
More sync tokens than an exclusive ';' are presumably in order, say a '}'
or 'endif'. Before computing lookahead sets youself, it's often
sufficient to let yacc do it by itself. That resolves to a simple
Statement: ... | error
That will rather aggressively try to make sense out of the tokens at the
error point. (Namely with respect to the context a Statement can appear
in.) So your attention shifts to prevent cascading errors. The three
token tolerance rule may appear very dumb, but does its job remarkably
well. If you're not satisfied with this approach, a hand written action
that yylex()es up to a suitable continuation token might help you.
No matter how you do it, the truly big problem is how not to run
semantically amok. For example, you might miss the end of a block and
continue with the wrong scope still active. Sometimes you can exploit
heuristical thumb rules in the lexer to (in)validate an assumption. A
"#define" in a C source is likely to be found at the external level, a
brace at the left margin is more likely pertaining to the function body
than to an inner block, a nested comment start looks spurious.
Martin Neitzel
Is there a good way to handle error recovery in lex/yacc for a LISP like
language of the form (keyword (keyword token) ... ) where recovery is
triggered at the balanced parenthesis?
If the lexical analysis returns KEYWD for "(keyword", would you need
to add an error token on each production? E.g.
name: KEYWD1 ... ')'
| KEYWD2 ... ')'
...
| error ')'
;
Is there are easier way to do this like
name: KEYWD1 ... endp
| KEYWD2 ... endp
;
endp: ')'
| error ')'
;
Would the best way to skip tokens during error recovery with balanced
parenthesis be to add code in lex to count parenthesis nesting, then
set an error state skipping tokens until the matching ')' is found?
If lex skips tokens until a matching ')' during the error state, does
that eliminate the need to add error tokens to each production?
Would it be useful to add action code to save the nesting level for
use in error recovery?
I don't know how much more structure than '(keyword ... )' has your
language at hand has to offer, but:
"Lisp like language" essentially means your grammar is nothing more than
nested lists. Parsing nested lists itself is trivial, you need nothing
more than your {atom, '(', ')'}-lexer and a level counter.
Unbalanced parentheses are the _only_ kind of errors you can have at all.
Good recovery would mean properly re-balancing '(keyword's with ')'s at
the point of error, but you lack the neccessary syntactic sugar to do
this. That's what the Simplicity Of Lisp is about... ;-) (Now you see
what Algol68 funnyisms a la ".case ... .esac" were good for. rof.)
Your only option for doing better recovery would be to concentrate on
semantic issues. ("Hey, these symbols come from the outer environment.
Perhaps a ')' was missing." "(FOO wants an even number of args." "(BAR
takes only numerical args." ...)
These things are clearly beyond the scope of the lex/yacc parsing engines
themselves.
Martin Neitzel