Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Compiler Books?

9 views
Skip to first unread message

v796

unread,
Oct 27, 2003, 9:05:42 PM10/27/03
to
Hi,

Of the following Compiler books which are the "best" and easiest to
read in order to implement a compiler for Pascal/C like language.

1 A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles,
Techniques and Tools , Addison-Wesley, 1986(88?)(Dragon book).
2 C. Fischer and R. LeBlanc. Crafting a Compiler, Benjamin Cummings,
1991.
3 C. Fischer and R. LeBlanc. Crafting a Compiler in C, Benjamin
Cummings.
4 A. C. Holub. Compiler Design in C , Prentice-Hall Inc., 1993.
5 Appel. Modern Compiler Implementation in C: Basic Design , Cambridge
Press.
6 Appel. Modern Compiler Implementation in Java: Basic Design ,
Cambridge Press.
7 Fraser and Hanson. A Retargetable C Compiler: Design and
Implementation , Addison-Wesley.
8 Dhamdhere. Compiler Construction , McMillan.
9 Holmes. Object Oriented Compiler Construction , Prentice Hall.
10 Holmes. Building your own Compiler with C++ , Prentice Hall.
11 Wirth. Compiler Construction , Addison-Wesley.
12 Wilhelm and Maurer. Compiler Design , Addison-Wesley.

Any comments are greatly appreciated.

Sincerely,
v796.
[Many of them have capsule reviews in the FAQ. -John]

VBDis

unread,
Nov 1, 2003, 4:07:24 AM11/1/03
to
vick...@rediffmail.com (v796) schreibt:

>Of the following Compiler books which are the "best" and easiest to
>read in order to implement a compiler for Pascal/C like language.

IMO top-down parsers are easier to understand than bottom-up
parsers. Even if both types can be used for C and Pascal, bottom-up
parsers are commonly described and used for C, and top-down parsers
for Pascal and other "Wirthian" languages.

If you are free in the design of the language, you may choose a Pascal
like language, for simpler implementation of the parser and
compiler. But if your language has to be somewhat compatible with C,
you have to go the harder way.

Just my opinion, comments from the community are welcome ;-)

DoDi

Henry Spencer

unread,
Nov 1, 2003, 5:02:25 PM11/1/03
to
VBDis <vb...@aol.com> wrote:
>IMO top-down parsers are easier to understand than bottom-up
>parsers. Even if both types can be used for C and Pascal, bottom-up
>parsers are commonly described and used for C, and top-down parsers
>for Pascal and other "Wirthian" languages.
>...if your language has to be somewhat compatible with C,

>you have to go the harder way.

Sorry, not so. C was designed for top-down parsing; the first C
compiler, Dennis Ritchie's own, used recursive descent with some
flourishes.

Even ANSI C is not that hard to parse top-down. It requires some work
and thought, not just a mechanical transcription of the standard's
grammar. There are a few places where it helps to peek ahead one
extra token, to decide which way to go in a timely way. And of course
you need some interaction with the symbol table to get typedefs right.
Some of this is annoying but none of it is prohibitive; I've done it.

Incidentally, Bjarne Stroustrup is on record as saying that letting
Aho and Johnson talk him out of writing a recursive-descent parser for
the early C++ was a major mistake. C++ is extremely difficult to
parse using pure bottom-up techniques, and needs a lot of kludges,
while doing it top-down is fairly straightforward and is common in
production compilers.
--
MOST launched 30 June; first light, 29 July; 5arcsec | Henry Spencer
pointing, 10 Sept; first science, early Oct; all well. | he...@spsystems.net

Mohd Hanafiah Abdullah

unread,
Nov 1, 2003, 5:03:43 PM11/1/03
to

<vb...@aol.com> wrote:
>vick...@rediffmail.com (v796) schreibt:

>IMO top-down parsers are easier to understand than bottom-up
>parsers. Even if both types can be used for C and Pascal, bottom-up
>parsers are commonly described and used for C, and top-down parsers
>for Pascal and other "Wirthian" languages.
>
>If you are free in the design of the language, you may choose a Pascal
>like language, for simpler implementation of the parser and
>compiler. But if your language has to be somewhat compatible with C,
>you have to go the harder way.


I agree that top-down parsers are easier to understand than bottom-up
ones. And this applies to C or Pascal or any other Algol-like
language. Plus, top down parsers are easier to implement by hand if
you choose to although there are parser generators out there.
Personally I have manually written top-down parsers using LL(k)
grammar for both C and Pascal and of course I spent more time on C due
to the language's various features and idiosyncracies. But, overall
the degree of difficulty was the same.

I have never used parser generators such as yacc/bison or antlr.
Would be nice to know people's opinion on manual implementation vs
automation of parsers, especially from those who have done both; in
terms of efficiency, readability, maintainability,
ease-of-development, and so forth.

But a parser although fun is a small portion of an optimizing
compilers, and to me the more challenging phases are the semantic
analyzer, optimizer, and code generator.

Napi
--
http://www.cs.indiana.edu/hyplan/napi.html

Mohd Hanafiah Abdullah

unread,
Nov 1, 2003, 5:03:10 PM11/1/03
to

VBDis <vb...@aol.com> wrote:
>vick...@rediffmail.com (v796) schreibt:
>IMO top-down parsers are easier to understand than bottom-up
>parsers. Even if both types can be used for C and Pascal, bottom-up
>parsers are commonly described and used for C, and top-down parsers
>for Pascal and other "Wirthian" languages.
>
>If you are free in the design of the language, you may choose a Pascal
>like language, for simpler implementation of the parser and
>compiler. But if your language has to be somewhat compatible with C,
>you have to go the harder way.

I agree that top-down parsers are easier to understand than bottom-up


ones. And this applies to C or Pascal or any other Algol-like
language. Plus, top down parsers are easier to implement by hand if

you choose to although there are parser generators out there. I
personally have manually written top-down parsers using LL(k) grammr
for both C and Pascal and found it ok.

Napi

--
http://www.cs.indiana.edu/hyplan/napi.html

Henry Spencer

unread,
Nov 2, 2003, 7:46:21 PM11/2/03
to
Mohd Hanafiah Abdullah <na...@cs.indiana.edu> wrote:
>I have never used parser generators such as yacc/bison or antlr.
>Would be nice to know people's opinion on manual implementation vs
>automation of parsers, especially from those who have done both...

My opinion:

Automation wins for readability (less clutter obscuring the syntax),
efficiency (table walking is often faster than procedure calling), and
above all, ease of experimentation (particularly because bottom-up
parsing, which is more powerful and imposes fewer constraints, lends
itself to automation but not to hand-crafting).

Manual implementation wins for ease of semantics integration (easy to
interweave with the syntax, and top-down means more context is known),
ease of adding kludges to get around difficult spots, ease of
producing good error messages (again, better knowledge of context),
and avoiding dependence on major tools that may not be available
everywhere.

(One caveat: as I've hinted at, people often assume that hand-crafted
means top-down and automated means bottom-up, and that confuses the issues
some. Automated top-down also exists.)


--
MOST launched 30 June; first light, 29 July; 5arcsec | Henry Spencer
pointing, 10 Sept; first science, early Oct; all well. | he...@spsystems.net

[Automation also wins for accuracy. You can be quite confident that
the parser that yacc or another generator writes for you parses
exactly the grammar you gave it, but it's very easy to leave
undiagnosed holes in a hand-written RD parser. -John]

Henry Spencer

unread,
Nov 8, 2003, 6:36:45 AM11/8/03
to
>[Automation also wins for accuracy. You can be quite confident that
>the parser that yacc or another generator writes for you parses
>exactly the grammar you gave it, but it's very easy to leave
>undiagnosed holes in a hand-written RD parser. -John]

Yes and no and kind of. The key problem is having to transform the
representation used in design to the one needed for implementation.
If that transformation isn't trivial, it's easy to make mistakes in
it.

But I think this problem can hit you even in the automation case. If
the automation has serious constraints on what it will accept, or you
used an eccentric design notation, the transformation can be
non-trivial even though both are grammars.

(As a case in point, it was not until 1984 -- hmm, might have been
1983, my memory is a bit vague -- that there was an LALR(1) grammar
for C. Even though C had been described using grammars since its
origins in the early 70s, writing a *correct* LALR(1) grammar for it
was hard. Steve Johnson's late-70s PCC did use a yacc parser, but
cheated a lot. In particular, PCC refused to accept a number of
unusual syntactic forms that were officially legal C. This was not
something you could discover by a quick comparison of the yacc grammar
to the C Reference Manual grammar.)

Jeff Kenton

unread,
Nov 21, 2003, 5:43:10 AM11/21/03
to
One more to add to your list:

Keith Cooper & Linda Torczon: Engineering a Compiler

I got my first look at it yesterday, but haven't read it through yet. It
appears to be clearer than anything else I've seen in years -- latest
algorithms and great explanations.

v796 wrote:
> Of the following Compiler books which are the "best" and easiest to
> read in order to implement a compiler for Pascal/C like language.

--

-------------------------------------------------------------------------
= Jeff Kenton Consulting and software development =
= http://home.comcast.net/~jeffrey.kenton =
-------------------------------------------------------------------------

Jeff Kenton

unread,
Nov 21, 2003, 5:45:03 AM11/21/03
to
My two cents:

1. Efficiency: someone I work with, who's often right, claims that you
can always write a parser by hand that's faster than what yacc and lex
produce. Other people claim exactly the opposite.

2. Readability and maintainability: probably goes to the tools,
although some of that is superficial. There's often a lot of magic
hidden in the details that isn't as obvious on second read as you
thought it was the first time through.

3. Ease of development: here's where I dislike lex and yacc. The
actual code produced is impossible to read, and it's really painful to
debug when you get it wrong. Furthermore, you always have to invent
trickery to handle certain cases. It's often at the boundary between
lexing and parsing, or between the parser and whatever calls it, where
you start to think the tools are hurting you more than they're
helping.

Overall, I prefer to hand write my own. I find it easier to get
exactly what I need that way, and haven't found that execution speed
or development time suffers.

Mohd Hanafiah Abdullah wrote:
> I have never used parser generators such as yacc/bison or antlr.
> Would be nice to know people's opinion on manual implementation vs
> automation of parsers, especially from those who have done both; in
> terms of efficiency, readability, maintainability,
> ease-of-development, and so forth.

--

Chris F Clark

unread,
Dec 4, 2003, 12:40:57 AM12/4/03
to
Mohd Hanafiah Abdullah asked:

> I have never used parser generators such as yacc/bison or antlr.
> Would be nice to know people's opinion on manual implementation vs
> automation of parsers, especially from those who have done both; in
> terms of efficiency, readability, maintainability,
> ease-of-development, and so forth.

Jeff Kenton replied:


> 1. Efficiency: someone I work with, who's often right, claims that you
> can always write a parser by hand that's faster than what yacc and lex
> produce. Other people claim exactly the opposite.

(Preface: my responses are *very* biased as I am the author of a tool
which is "actively" marketed!)

This is trivially true, if one considers the fact that one can take the
output of any given tool and then profile and modify it so that the
resulting hand-written tool is faster.

However, in contrast, if someone figures out how to write a faster
output for a specific tool, you may have to rearchitect your
hand-tweaked to take advantage of those improvements. Note, that
that doesn't happen very often (in 15 years of working on a tool,
there were approximately 4 times where we specifically worked on
speeding the generated parsers up).

All that being said, generally hand-written tools can be faster
than machine generated ones (provided you know what you are doing).
The difference in execution speed may or may not be significant.
If you don't know what you are doing, machine generated ones are
likely as fast. More importantly, this can be a case of premature
optimization. Why hand-craft a fast lexer or parser if that's not
the bottleneck in your application? You would be better off,
measuring the real bottleneck and fixing that!

As evidence of the above, in several years of using automated tools
to build a variety of projects for clients, I have only twice had
times where adjusting the lexing/parsing code made a significant
impact on the client applications and these applications (the ones
where it didn't matter too) were essentially parsers. Most of the
time, fixing up malloc/free and other low-level issues were more
important.

> 2. Readability and maintainability: probably goes to the tools,
> although some of that is superficial. There's often a lot of magic
> hidden in the details that isn't as obvious on second read as you
> thought it was the first time through.
>
> 3. Ease of development: here's where I dislike lex and yacc. The
> actual code produced is impossible to read, and it's really painful to
> debug when you get it wrong. Furthermore, you always have to invent
> trickery to handle certain cases. It's often at the boundary between
> lexing and parsing, or between the parser and whatever calls it, where
> you start to think the tools are hurting you more than they're
> helping.

These are the real issues. Your end goal should probably be to
produce a correct lexer/parser for your target language with the
minimum effort, so you can expend that effort on some other part of
the tool.

A real-life eample is particularly illustrative here. I work with
a group that builds a Verilog compiler for use by chip designers.
Verilog is a typically nasty language, with a C-like (but not C)
preprocessor, places where whitespace is important and places where
it isn't, messy identifier and numeric rules, and what-have you.
In our product, we have at least two different Verilog parsers.
One is hand-written and one is generated by a tool.

The hand-written one was borrowed from a previous product and then
slowly has slowly gathered more frills, knots, and warts. The
hand-written one parses only a subset of Verilog (although the
subset continually grows) and silently skips any text it can't
parse. That form of error-recovery is important for it. It is used
to parse texts that may or may not be correct Verilog (e.g. code
still being written, text documents that aren't intended to be
Verilog, binary files, what-have-you). It's goal is to extract as
much valid Verilog as it can out of said documents, syntax
highlight them and do some other "trivial" tasks--some of the tasks
not being trivial at all. It has taken a couple of person years to
get it to parse what it does as correctly as it does, not counting
the time to initially write the precursor version.

The machine generated one was created by transliterating the 1995
IEEE spec into the dialect that the generation tool uses and making
some adjustments to account for the hard-to-parse parts and for
local dialect extensions. In addition, the language of the tool
was used to describe the resulting tree we wanted built. A second
pass was added to translate the tree that the generator automatic-
ally builds into a different representation used by the "back-end"
of our product. The initial coding of the gramar took under a
month, and since that time we have made only a few changes to it,
on the order of two-more person months. However, the translation
phase to the second representation was a far larger effort and has
consumed on the order of one to two person years. Only about six
months ago did we feel that we had the translation fully stable.

At this point, I'm willing to draw the following conclusions:

1) The machine generated parser is generally more reliable, more
robust, and easier to maintain than the hand-written one.
Moreover, no one is completely sure what dialect the hand-written
one actually parses. In addition, the hand-written one has spawned
numerous spin-offs, where part of the hand-written one was cloned
and reused in other parts of the product (to parse other small
fragments of the input).

However, the hand-written one *DOES* do a far better job of
responding to ill-formed texts. Also, the architect of the
hand-written one is far more comfortable with it and finds its code
simpler to follow. Most parser generators do a good job at
generating parsers which handle correct input texts or texts with
some small amounts of errors in them that the user needs to correct
(and even there certain lexical errors typically destroy the
parsers ability to recover). They do not do a good job at
generating parsers that can pick fragments of correct code out of a
sea of garbage, nor any other tasks they weren't designed to do.
But, note that some parser generators (e.g. Quinn Tyler Jackson's
meta-S) are now beginning to support a grep-mode, which may solve
that problem.

2) The machine generated one was easier to get to a high-level of
functioning first. For complicated languages it is easier to find
the right hacks to apply to a machine generated parser than it is
to write the same correct code by hand. If we had been able to use
the parse (AST) tree coming out of the parser as our internal
representation, the parsing part of the project would have been
done in a trivial amount of time.

However, a parse tree is often not the correct internal representa-
tion of the input text. It often takes a lot of work to generate a
data structure that has the correct semantics (especially when it
isn't clear what the correct semantics are--several of our internal
representations have had major shifts in how they model the
semantics to reflect our evolving understanding of how to map the
input text into a working abstract model). As far as I can tell,
parser generators have not eliminated that work. That said, we did
not used a tool with attribute grammar support (such as Eli) and
perhaps some of those issues would have been mitigated through
that. Still, the hard part is often interfacing the output of the
parser generator with parts of the project that may have stringent
requirements that aren't "compatible". However, my impression is
that these tasks are intrinsically hard and that hand-generating
the parser would not have mitigated them (and would have made them
harder, since a hand generated parser does not build an AST
representation automatically, and that was a stepping stone at
least). I know that's true for the project, I'm working on. In
fact, I want more automation, not less.

I have ideas (drawn from my pratical experience) that will be added
to the tool over time. Some of them will improve that problem.
None of them are likely to be magic wands.

3) The output of a tool (and its associated library, if any) is more
opaque than hand-written code. Moreover, the more a tool enables
you to do (and worse, enables you to do efficiently) only makes the
problem worse. Doing a task takes code. Doing a generalized
version of the same task takes even more code. Do an efficient
generalized version of the same task takes sill more code. Doing
that and also including options for tasks that aren't being done
takes even more. . . .

I have not yet seen a tool that generates totally simple code and
code that handles all the complicated cases easily. Again, I have
some ideas here (on how to make the output more tailorable so that
simple problems can get simpler code generated), but the problem is
intrinsically difficult. The two items are polar opposites. You
can't get both at the same time. The best one can hope for is a
balance that is right for ones own project.

That said, the tradeoff in having already (mostly?) debugged code
that implements some of the things that I will need (e.g. nested
include file support, AST construction, i18n error meesages)
already in the library, lets me build most things faster and not
have to debug my own typo errors as I reimplement them. That
weighs heavily in faovr of a tool (and library) in my book. I'll
take a little opaqueness to get that.

Some relevant statistics: I cannot recall the last time I debugged
the *code* generated by the tool. I debug approximately two
library problems a year (across the entire tool community's user
base). I debug approximately four "table generation" problems per
year, mostly dealing with features in the tool that aren't in the
underlying theoretical model. Most of the bugs are both obscure
(happening in corners of the feature set) and obvious (when the
failure happens the tool or the generated application fails in a
dramatic way). There has so far been only one bug in the tool that
has affected the project above I was describing. There was also
only one case where the grammar did not describe the language
correctly and the grammar needed to be debugged.

Recommendations:

If you are doing something trivial or need some obscure ability not
supported by any tool you can find, don't use a tool. We will
probably never replace the hand-written parser in our product with
a machine generated one (although we consider it from time to time,
especially when needing to parse in a new context), because we
don't trust the tool we use to gracefully implement a grep/
maximal-munch-correct-fragments-only type parsing. In addition,
sometimes, when I have to implement a small state machine, I will
do so by hand, espcially in embedded contexts where firing up a
whole object created by the tool would be expensive at runtime and
would provide no explicit benefits.

However, if you have a non-trivial language that may grow and/or
change, get the best (most automated) tool you can. It *will* save
you work. Especially, if that tool goes beyone simple lexing and
parsing to AST construction (or other features relevant to your
application). For example, ask someone about JavaCC with JJTree or
JTB or ask someone about ANTLR (aka PCCTS ver 2) or about PCCTS ver
1 and Sorcerer. There are news-groups for both tools
(comp.compilers.tools.javacc and ==.pccts), so it shouldn't be hard
to get info and opinions, with the one caveat being that many of
the posters are "new" users (e.g. students or experimenters doing
projects, who currently have questions) so it is hard (but not
impossible) to get a response from a seasoned user. Most seasoned
users have simply finished their parsing part and moved on to other
tasks--and that's the point.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
19 Bronte Way #33M voice : (508) 435-5016
Marlboro, MA 01752 USA fax : (508) 251-2347 (24 hours)

Quinn Tyler Jackson

unread,
Dec 4, 2003, 3:32:29 AM12/4/03
to
Mohd Hanafiah Abdullah asked:

> > I have never used parser generators such as yacc/bison or antlr.
> > Would be nice to know people's opinion on manual implementation vs
> > automation of parsers, especially from those who have done both; in
> > terms of efficiency, readability, maintainability,
> > ease-of-development, and so forth.

Jeff Kenton replied:

> > 1. Efficiency: someone I work with, who's often right, claims that you
> > can always write a parser by hand that's faster than what yacc and lex
> > produce. Other people claim exactly the opposite.
>

And then Chris Clark commented:

> This is trivially true, if one considers the fact that one can take the
> output of any given tool and then profile and modify it so that the
> resulting hand-written tool is faster.
>
> However, in contrast, if someone figures out how to write a faster
> output for a specific tool, you may have to rearchitect your
> hand-tweaked to take advantage of those improvements. Note, that
> that doesn't happen very often (in 15 years of working on a tool,
> there were approximately 4 times where we specifically worked on
> speeding the generated parsers up).

It has been my experience (as a parsing tool architect) that the tool route
is, overall, the better of the two routes for several reasons. (For any
value of "better"....)

1. Optimizations made to the parsing engine are reflected across all (or
most, anyway) generated parsers that use that engine. Time and time again
when I have improved the Meta-S engine, other parsers have benefited by
those same improvements, without any consideration for those other parsers
having been taken during optimization.

For instance, tweaking the engine to get 10% more speed out of a parser for
language A resulted in parser B gaining 50% more speed. What was a small
bottleneck for A turned out to be a big bottleneck for B. Had parser A and B
been hand-written, any optimizations made to A would have had to have been
applied to B specifically.

2. Generated parsers are generated from specifications that can be checked
in some automated way for "correctness."

For instance, left-recursion is always wrong in an LL(k) (for any k) parser.
Finding left-recursion in a grammar specification isn't particularly
difficult to do by hand, but people get tired eyes, miss a step here and
again, and might miss it in a large specification if the moon is in the
right phase. Finding left-recursion in a grammar specification using
automated means is guaranteed to never suffer from boredom-induced oversight
on the part of the parser generator. It's not enough to hand-write a parser
from a specification that is known to be left-recursion free, since the
"hand writing" part of the process can introduce human error. The only way
to *always* catch left-recursion errors in an LL(k) parser is to generate
that parser from a specification known to be left-recursion free.
Hand-written parsers cannot be checked in an automatic way for this error.
(Proof: Any algorithm that could find left-recursion in a hand-written
parser would imply a solution to the HP.)

And that's *just* left-recursion. Any number of other detectable errors can
be overlooked in a hand-written parser. Automated parser generators never
tire of running all grammar specifications through the standard series of
error checks. As long as the algorithms used to find the errors are correct,
the grammars that pass through these tests, and the parsers they
subsequently generate, are guaranteed to be correct. Hand-written parsers
are only as correct as they happen to be -- and the amount of brainpower to
determine just what that is ... well, anyway ... I think I've made point 2.

3. Parsers generated from a specification can be modified and maintained in
ways that hand-written ones typically aren't. (Chris said enough on this
already.)

4. Parsers generated from a specification can be written by domain
specialists. Hand-written parsers can be written by domain specialists who
also happen to be programmers in the host language. Anyone who knows (or
can read) the HTML specification can implement a correct HTML grammar using
a decent tool, and can get all of the efficiency of whatever host language
the tool was implemented in. It takes an HTML expert who also happens to be
a good C++ programmer to hand-code an HTML parser as a C++ class (or set of
classes) from scratch. Domain specialists are not always experienced
programmers, and even if they are, may or may not be inclined to spend time
learning the theory necessary to write efficient parsers for their domain
specialties.

5. Parser generators can (more and more these days) produce parsers that can
be hosted by a multitude of language environments. One specification can be
used by a Delphi, C++, VB, C#, or Java host if the engine happens to support
interfaces to each of these. The grammar writer need not know how to program
in ANY of these environments for this to come about.

6. Parser generators encourage standardization and community. There are
well-known ways to write grammar subsections that parse function parameter
lists, white space, and dangling-else's. Most of these well known ways are
expressed in the syntax of parser generator specifications. Delphi, C++, VB,
C#, Java, Perl programmers can benefit from the knowledge of others who have
already solved the same parsing issues at a level of abstraction above the
implementation language. (Without parser generators, many of the questions
found in this newsgroup could not be concisely expressed, and answers
provided by experts would adhere to no standard.)

Cheers,

--
Quinn Tyler Jackson

Rodney M. Bates

unread,
Dec 8, 2003, 5:21:03 AM12/8/03
to
Quinn Tyler Jackson wrote:
>
> 6. Parser generators encourage standardization and community. There are
> well-known ways to write grammar subsections that parse function parameter
> lists, white space, and dangling-else's. Most of these well known ways are
> expressed in the syntax of parser generator specifications. Delphi, C++, VB,
> C#, Java, Perl programmers can benefit from the knowledge of others who have
> already solved the same parsing issues at a level of abstraction above the
> implementation language. (Without parser generators, many of the questions
> found in this newsgroup could not be concisely expressed, and answers
> provided by experts would adhere to no standard.)
>

Parhaps I am being quixotic, but:

There are lots of languages being designed that have very cobbled-up
syntax, that doesn't fit any clean formal model. It seems mostly only
parser writers notice this consciously.

Is there any hope that more widespread use of parser generators by
people designing syntax in the first place would improve this situation?

Or was that what you were saying?

----------------
Rodney M. Bates

Nick Roberts

unread,
Dec 8, 2003, 5:27:42 AM12/8/03
to
I don't really wish to negate what Jeff, Chris, and Quinn have said,
but I do think that it is sometimes advantageous to code a parser by
hand, so I just want to provide a quick 'other side of the coin' here.

Quinn Tyler Jackson wrote:

> It has been my experience (as a parsing tool architect) that the tool route
> is, overall, the better of the two routes for several reasons. (For any
> value of "better"....)
>
> 1. Optimizations made to the parsing engine are reflected across all (or
> most, anyway) generated parsers that use that engine. Time and time again
> when I have improved the Meta-S engine, other parsers have benefited by
> those same improvements, without any consideration for those other parsers
> having been taken during optimization.

One /potential/ problem with parser generators is that 'pessimisms'
(bugs!) also get propagated in this way. You /might/ optimise all
your generated parsers by introducing an optimisation into the parser
generator, but you /could/ also break all of them, by introducing a
bug into the parser generator. Obviously, the (good) parser generator
writers know this, and take pains not to do it.

> 2. Generated parsers are generated from specifications that can be checked
> in some automated way for "correctness."

A possible problem is that this could instill a false sense of
confidence (there are many errors in a specification that a generator
cannot detect). Obviously the solution is the old "Don't do that!"
Don't be so silly as to think that if the tool accepted it, it must be
right.

But it's a gotcha you have to know about to avoid. Some people do use
a parser generator foolishly thinking that it will enable them to
solve a complex problem quickly and easily (especially if they are the
kind of people who believe advertising literature :-)

> 3. Parsers generated from a specification can be modified and maintained in
> ways that hand-written ones typically aren't. (Chris said enough on this
> already.)

Absolutely true. Sometimes, however, the reverse is true. If you are
clever about hand writing a parser, you can build factorisations into
it to make certain anticipated future tweaks and modifications easy
(and safe). The specification language of a parser generator generally
prevents such factorisations.

I am currently writing an Ada compiler. The Ada syntax is very stable
(even at the introduction of a new revision, which happens about once
every ten years). Thus the need for changes to the parser due to
changes in the syntax will be minimal. I anticipate changes due to
changes in the output data (parsed tree) will also be minimal. Where I
do anticipate most change and improvement will be in the
sophistication and breadth of the error diagnostics. This desire to be
able to (eventually) achieve very sophisticated diagnostics is one of
the main things that puts me off using a parser generator tool.

> 4. Parsers generated from a specification can be written by domain
> specialists. Hand-written parsers can be written by domain specialists who
> also happen to be programmers in the host language.

Also perfectly true, but if it so happens that the parsed language and
parser (host) language are the same, the problem (almost) goes away.

I am writing my Ada compiler in Ada!

> 5. Parser generators can (more and more these days) produce parsers that can
> be hosted by a multitude of language environments. One specification can be
> used by a Delphi, C++, VB, C#, or Java host if the engine happens to support
> interfaces to each of these. The grammar writer need not know how to program
> in ANY of these environments for this to come about.

Fine, but it's not very often a particular parser needs to be hosted by
more than one language. And implementing a generator that targets multiple
host languages might open up more cracks for bugs to creep in (obviously
this is another product quality issue).

> 6. Parser generators encourage standardization and community. There are
> well-known ways to write grammar subsections that parse function parameter
> lists, white space, and dangling-else's. Most of these well known ways are
> expressed in the syntax of parser generator specifications. Delphi, C++, VB,
> C#, Java, Perl programmers can benefit from the knowledge of others who have
> already solved the same parsing issues at a level of abstraction above the
> implementation language. (Without parser generators, many of the questions
> found in this newsgroup could not be concisely expressed, and answers
> provided by experts would adhere to no standard.)

As Devil's advocate, one might also argue that really new and innovative
approaches to parsing (and its associated activities, such as syntax error
diagnostics) are likely to be somewhat stifled by this standardisation.

But I'm fairly sure that, for most people who need a parser, "it works" is
a much higher priority than "it's totally rad, man" (or whatever ;-)

Also, I would assert that forums such as comp.compilers tend to do much to
foster and aid such innovation.

--
Nick Roberts

Marco van de Voort

unread,
Dec 20, 2003, 4:51:37 PM12/20/03
to
On 2003-12-04, Chris F Clark <c...@shell01.TheWorld.com> wrote:
>
> This is trivially true, if one considers the fact that one can take the
> output of any given tool and then profile and modify it so that the
> resulting hand-written tool is faster.
>
> However, in contrast, if someone figures out how to write a faster
> output for a specific tool, you may have to rearchitect your
> hand-tweaked to take advantage of those improvements. Note, that
> that doesn't happen very often (in 15 years of working on a tool,
> there were approximately 4 times where we specifically worked on
> speeding the generated parsers up).

Interesting, if the main reason to chose handcoded tools is speed.

What if it was quality of error generation? Could you argue your case if
my primary objective was error generation?

> (and even there certain lexical errors typically destroy the
> parsers ability to recover). They do not do a good job at
> generating parsers that can pick fragments of correct code out of a
> sea of garbage, nor any other tasks they weren't designed to do.
> But, note that some parser generators (e.g. Quinn Tyler Jackson's
> meta-S) are now beginning to support a grep-mode, which may solve
> that problem.

I don't mind as much. Simply issueing an error for "strange" output is
enough. What if I'm more interested in the viewpoint of a user to fix
bad programs (to be parsed by my parser), what would you do?

Chris F Clark

unread,
Dec 22, 2003, 4:14:01 AM12/22/03
to
I wrote:
>
> This is trivially true, if one considers the fact that one can take the
> output of any given tool and then profile and modify it so that the
> resulting hand-written tool is faster.
...

To which, Marco van de Voort, asked:

> Interesting, if the main reason to chose handcoded tools is speed.
>
> What if it was quality of error generation? Could you argue your case if
> my primary objective was error generation?

I'm not certain I could make the argument (either way) for error
generation. Certainly, the general state of the art of error
reporting in generated parsers is poor.

On the other hand, many tools now given the user a good hand at
certain parts of the error reporting. For example, the machine
generated parser usually has encoded within it the list of tokens that
were expected. In addition, for LL machine generated parsers, the
generated parser knows exactly where in the parse the error occured
and what was attempting to be parsed--the LR machine knows that too,
but in a more limited fashion.

A hand generated parser is likely to be rewritten in the recursive
descent style. Thus, it is going to have roughly the same properties
as a machine generated LL parser, which is likely to be a machine
generated recursive descent parser. The key point being, that at any
point in the resulting parse when an erroneous token is detected the
parser can easily print out a relatively good error message that is
precisely tailored to the problem at hand.

However, if you think about it, that precise handling of the erroneous
token is almost exactly equivalent to adding a new rule to the grammar
that says that at that particular point the token is "legal" and
whatever error message to be printed is the appropriate semantics for
that rule. In the long run, I would be happier adding such rules to
my grammar and having a tool validate that I haven't introduced an
ambiguity. In fact, if I were working on such a problem, I would most
like a tool that implemented "fuzzy grammars"*, so that I could make
my error productions part of the "fuzzy set" and have them only apply
when they didn't introduce ambiguities. However, that's just me.

*) I have at least a pair of great papers on the concept of fuzzy
grammars--grammars that have numbers associated with the rules that
tell "how much" a particular rule is "part" of the grammar or not,
1.0 being entirely applicable, 0.0 being not at all applicable, with
the most applicable rule being the one that is applied in an
ambiguous case. Unfortunately, the papers are buried in the stuff
I'm preparing to move to my new house, so I can't properly cite them
and give credit to the author--I think it was Peter Anvin....

The second point on this topic, which I think I mentioned in another
thread, is that many (most to my mind) errors are actually sub-lexical
occuring at the single character level and not at the parsing level.
Even most hand-written parsers use some form of separate lexer (which
is mostly context insensitive), so when I make the error of omitting
the closing quote from my character string, it swallows far too much
relevant text and the resulting error recovery isn't important,
because the basic problem the missing character is not in the parsers
purview at all.

Carl Cerecke

unread,
Dec 23, 2003, 5:22:20 AM12/23/03
to
Chris F Clark wrote:

> The second point on this topic, which I think I mentioned in another
> thread, is that many (most to my mind) errors are actually sub-lexical
> occuring at the single character level and not at the parsing level.

As part of my recently completed PhD, I analysed about 200,000
incorrect Java programs written by novice programmers. Nearly all are
correct lexically. That is, the program constitutes a valid lexical
stream. About half of the programs had no syntax errors, only
semantic errors. (Using the term "program" loosely here of course. If
the files really were programs, they wouldn't have any errors, and I
wouldn't have looked at them. They are mostly "almost programs",
although some wouldn't even be classed as that; you'd be surprised
what some novice progammers submit to a compiler!).

Anyway, the point is that almost all errors involve a lexically valid
token stream. However....

> Even most hand-written parsers use some form of separate lexer (which
> is mostly context insensitive), so when I make the error of omitting
> the closing quote from my character string, it swallows far too much
> relevant text and the resulting error recovery isn't important,
> because the basic problem the missing character is not in the parsers
> purview at all.

....the most difficult errors to repair mostly fall into two related
categories: comment delimiter problems, and string delimiter problems.

Some novice programmers really have a problem remembering /* opens a
comment, and */ closes a comment - often transposing one or the other or
both. Suddenly, the parser is asked to make sense of the tokens <star>
<slash> <ident> <ident> <ident> ... and gets rather confused. Seeing
this has convinced me that it is better for the comment delimiters of a
language to be single-character tokens that are not used for any other
purpose. Also, notice how the above stream of tokens could look like a
malformed expression to a parser attempting recovery.

Even though missing delimiter problems are really a lexical issue from
the programmer's point of view, it's really the parser that has to
deal (gracefully!) with the problem, while the lexer often remains
blissfully ignorant.

Cheers,
Carl.

Joachim Durchholz

unread,
Dec 27, 2003, 7:12:46 PM12/27/03
to
Carl Cerecke wrote:

> Chris F Clark wrote:
>
>>The second point on this topic, which I think I mentioned in another
>>thread, is that many (most to my mind) errors are actually sub-lexical
>>occuring at the single character level and not at the parsing level.
>
> As part of my recently completed PhD, I analysed about 200,000
> incorrect Java programs written by novice programmers. Nearly all are
> correct lexically.

Hmm... I think that mostly-lexically-correct property is correlated with
the lack of experience on the side of the programmers.
Once people get used to the language, syntax errors tend to vanish
almost entirely, and what remains are single-character typos which tend
to either mangle a name (generating a "name not declared" error, which
isn't in the domain of parsing), or damage a token so that it goes into
another token class (generating a lexical error, i.e. not a parse error
either).

So I guess the kind of error handling needed depend largely on
programmer experience.

>>Even most hand-written parsers use some form of separate lexer (which
>>is mostly context insensitive), so when I make the error of omitting
>>the closing quote from my character string, it swallows far too much
>>relevant text and the resulting error recovery isn't important,
>>because the basic problem the missing character is not in the parsers
>>purview at all.
>

> .....the most difficult errors to repair mostly fall into two related


> categories: comment delimiter problems, and string delimiter problems.
>
> Some novice programmers really have a problem remembering /* opens a
> comment, and */ closes a comment - often transposing one or the other or
> both. Suddenly, the parser is asked to make sense of the tokens <star>
> <slash> <ident> <ident> <ident> ... and gets rather confused. Seeing
> this has convinced me that it is better for the comment delimiters of a
> language to be single-character tokens that are not used for any other
> purpose. Also, notice how the above stream of tokens could look like a
> malformed expression to a parser attempting recovery.

Actually, in my learning days, I have seen a parser complain about a
missing expression between * and / in code like this:
a = */ now things get hairy */
<an-incredibly-complicated>
<multi-line expression>
It took me several seconds to figure out what was happening (but having
read about lexing and parsing, I knew where that error originated - I've
got no idea how a complete newbie to programming would have fared).

However, I think that the main problem here is that C uses the same
characters for the begin and end comment delimiters. If comments were
delimited using (* ... *), I'm pretty sure that any errors would stick
out to the view of the programmer, so even if he doesn't understand the
error message he'll quickly see what to fix.
Actually (*...*) is the Pascal convention, and I /never/ had problems
with comment delimiters when I learnt programming using Pascal. Of
course, that's just anecdotal evidence, a wider statistical base would
be beneficial to say anything with certainty.

Regards,
Jo

Chris F Clark

unread,
Dec 27, 2003, 7:17:35 PM12/27/03
to
Carl Cereke makes an interesting point:

> As part of my recently completed PhD, I analysed about 200,000
> incorrect Java programs written by novice programmers. Nearly all are
> correct lexically.
...

> Anyway, the point is that almost all errors involve a lexically valid
> token stream.

My vision of the problem was probably biased by viewing the errors
made by my colleagues, who are hardly novice programmers. Most of the
errors I have seen from viewing the errors that occur as we work on
programs are "typographical" ones, usually involving a simgle missed,
transposed, or slightly modified character (e.g. ' and " are just a
shift-key away on my keyboard). Note, we also do a lot of editing by
cut-and-paste, so errors where the region to be moved is missing a few
characters on either end is also common--again that tends to
exaggerate the number of typographical sytle errors.

Now, perhaps helping professional programmers out is not as important,
as we are generally capable of finding and fixing our own errors.
However, the typographical errors (especially misplaced delimiters)
are ones that cause the most cascading errors. This probably also
accounts for the maxim that many programmers including myself follow,
which is fix the first error and ignore the rest--one can do a little
better than that, but quite quickly one learns when to punt on chasing
down subsequent error messages. About the only time, I pursue all
errors is when I have modified a procedures argument list, and then
the error instances are almost always the correct list of calls to
that procedure.

In contrast, and this adds weight to your hypothesis, while less
common for professional programmers, the most difficult to find errors
involve complicated and subtle uses of the language in question. The
errors that my colleagues and I have the hardest time finding involve
(in C++) subtle declarations where the exact syntax is not always
transparent. Any one of templates, macros, and a variety of
constructors can quickly result in a declaration which when misspelled
gives a totally uninformative error message. And the errors are often
of the semantic ilk, where the declaration was misinterpreted by the
compiler to be of a different semantic class than the one intended and
the errors are thus unrelated to what we were attempting to write.

I would be interested to hear if you came up with a strategy to handle
these semantic errors, when the user has submitted a set of valid
declarations and statements, but where there is a mismatch between the
semantics of one or more of the enries--e.g. the user has declared a
function, but meant to declare a variable and the uses would have all
been correct if the variable had been declared, but were in error
because the compiler "knew" that the identifier represented a function
(from the syntactically correct, but unintended function declaration)
and (the uses) were thus contextually illegal.

The one thing I will say that the semantic problem (as I posed it
above) has in common with the syntactic problem is that in both cases
the correct answer can only be seen when viewed in a complete context,
where "correct" fragments are modified until the total errors are
minimized.

In the lexical case, the parser assumes that the tokens the lexer has
handed to it are correct. However, when one delimiter is missed, the
tokens the lexer hands to the parser are not the correct ones--they
are gibberish (lexically correct gibberish). And if one looked at the
lexical problem in terms of finding a correctly parsable program with
the *minimum editing distance*, one would immediately see that the
correct program would have a different set of tokens. However, I've
never seen a lexer/parser that attempted to find the program with the
minimum editing distance from the original source.

Similarly, one could view certain semantic problems the same way.
Given a syntacitcally correct set of declaractions, how can one find
the minimally modified set the generates the no errors. In my
example, where a function was declared when a variable was intended,
in most languages it would take fewer changes to the source text to
change the declaration type of the identifier from function to
variable than it would to correct all (mis)uses of the identifier.

Note, that a compiler that followed these principles would work (and
particularly report errors) in a completely different fashion than any
that I am aware of today.

The common error reporting scheme is to find oneself in an
"impossible" situation, where something cannot be correct and to
report an error complaining about the current predicament. Note that
the error recovery scheme almost always assumes that what was
successfuly processed before was correct and that the internal
database contains only true facts. Because the software (compiler) is
not robust and we cannot assume that the program (compiler) will not
crash at some future point due to these inconsistencies exposing
weakpoints in the program (compiler) we are writing, we attempt to
report the error right away and perhaps make a graceful exit without
doing further damamge to the program's (compiler's) internal data
structures.

An "error minimizing compiler" would have to be able to assume that it
was robust enough not to fail and then collect all the inconsistencies
and then try to modify things to resolve those inconsistencies. Once
the compiler knew which sets of modifications resolved all the
inconsitencies, it would then report the set(s) of modifications that
were minimal. By the way, I'm not confident enough that I could write
a sufficiently robust compiler to allow pursuing such an error
recovery strategy.

Looking forward to more exchanges,

Oliver Zeigermann

unread,
Jan 2, 2004, 8:39:54 AM1/2/04
to
Chris F Clark wrote:
> In the lexical case, the parser assumes that the tokens the lexer has
> handed to it are correct. However, when one delimiter is missed, the
> tokens the lexer hands to the parser are not the correct ones--they
> are gibberish (lexically correct gibberish). And if one looked at the
> lexical problem in terms of finding a correctly parsable program with
> the *minimum editing distance*, one would immediately see that the
> correct program would have a different set of tokens. However, I've
> never seen a lexer/parser that attempted to find the program with the
> minimum editing distance from the original source.

This sounds much like robust *natural* language parsing where you have
malformed sentences all the time still trying to understand the meaning
of them. I have seen approachs using propagation of constraints
describing the structure of valid sentences. This incoporates weak
constraints that only claim certain properties *should* be satiesfied. I
can't remember how parsing actually was done and what variables were
constraint over what domain, but I could find it out if anyone is
interested.

Oliver

Mark Sayers

unread,
Jan 7, 2004, 5:59:13 AM1/7/04
to
I thought I would throw my 2 cents in too.....

From my experience of the professional world I would agree most of the
common errors that occur are due to fragments of code being copied and
pasted from one location to another and not altered correctly.

In many cases its quite possible to do this within a function such
that the variables are indeed valid, however not amended in such a way
they produce the desired effect. This isnt something which I believe
can be remedied by a better (or worse) compiler or my experience with
a given programming language. It is simply a matter of taking care
when making changes to code and testing that something peforms the
task it is intended to.

At the end of the day a computer program can be correct in terms of
lexical anaylsis and syntax but a compiler cant be expected to know
what the programmer wanted to happen and amend their code to do this
when they have made a mistake of this type. Its a matter of human
error not compiler error or design.

It is possible and is more the norm that typos can be detected however
variables with similar names can be problematic with regards to human
error, at the end of the day checking the code visually is correct and
testing it does what it says on the packet is the only really sensible
approach to produce reliable and robust software.

My personal believe if that too great a percentage of time is spend
coding and not enough on the design of the software to begin with to
produce easily maintainable code that is readable. Often some people
seem to go for the approach of slap it together and cross your fingers
or a design which is over complex or to try and incorpoprate means of
efficency in the code which can be better facilitated with compiler
optimisations instead of by hand.

At the end of the day a readable and reliable piece of code which is
then optimised for peformance or speed by the compiler or other tools
is a more valuable comodity than one which is pulling all kinds of
tricks and confusing and baffling to not only other people trying to
maintain the code but often to the programmer themselves when trying
to remedy problems at a later date.

It is fair comment to say that many newbie programmers do make common
mistakes due to their lack of understanding of what they are trying to
achieve and how to actually achieve it, however the bigger picture as
to how these people are taught to write code when they get into the
work place is more of an issue.

Its all good and proper to make mistakes when your learning something
and developing your skills, but in the commercial world making
mistakes costs money and the time spend checking things and
determining a problem earlier rather than later is a more effective
use of resource (though often project managers do not seem to always
believe it at the time).

Mark

Jeff Kenton

unread,
Jan 10, 2004, 4:35:59 AM1/10/04
to
Mark Sayers wrote:
> My personal believe if that too great a percentage of time is spend
> coding and not enough on the design of the software to begin with to
> produce easily maintainable code that is readable.

I have to add something on the question of readablility. I have
worked at a place where the deliberate policy was to eliminate most
comments and white space, for the purpose of being able to see more
code on a single screenful. The president went so far as to strip
comments from existing code. And then wondered why new hires took so
long to learn the internals.

-------------------------------------------------------------------------
= Jeff Kenton Consulting and software development =
= http://home.comcast.net/~jeffrey.kenton =
-------------------------------------------------------------------------

[We all know that you can't legislate against stupidity. -John]

Oliver Zeigermann

unread,
Jan 23, 2004, 4:12:43 AM1/23/04
to
For anyone interested here are a some links to constraint based robust
parsing of *natural* language parsing:

http://nats-www.informatik.uni-hamburg.de/~wolfgang
http://nats-www.informatik.uni-hamburg.de/~wolfgang/papers/nle2003.ps.gz
http://nats-www.informatik.uni-hamburg.de/~wolfgang/papers/its-ws.ps.gz

The problem to adapt this to parsing of programming languages mainly
seems to be you will have to completely rewrite your grammar to constraints.

Oliver

0 new messages