The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?

Roger L Costello

unread,

Jun 9, 2022, 12:50:50 PM6/9/22

to

Hi Folks,

Page 84-85 of the dragon book [1] says:

There are several reasons for separating the analysis phase of compiling into
lexical analysis and parsing.

1. Simpler design is perhaps the most important consideration. The separation
of lexical analysis from syntax analysis often allows us to simplify one or
the other of these phases. For example, a parser embodying the conventions for
comments and white space is significantly more complex than one that can
assume comments and white space have already been removed by a lexical
analyzer. It we are designing a new language, separating the lexical and
syntactic conventions can lead to a cleaner overall language design.

2. Compiler efficiency is improved. A separate lexical analyzer allows us to
construct a specialized and potentially more efficient processor for the task.
A large amount of time is spent reading the source program and partitioning it
into tokens. Specialized buffering techniques for reading input characters and
processing tokens can significantly speed up the performance of a compiler.

3. Compiler portability is enhanced. Input alphabet peculiarities and other
device-specific anomalies can be restricted to the lexical analyzer. The
representation of special or non-standard symbols, such as the up-arrow in
Pascal, can be isolated in the lexical analyzer.

Specialized tools have been designed to help automate the construction of
lexical analyzers and parsers when they are separated.
----------------------------
Those seem like compelling reasons for separating the lexical analysis from
parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them?

/Roger

[1] Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman.

Alexei A. Frounze

unread,

Jun 9, 2022, 9:15:57 PM6/9/22

to

On Thursday, June 9, 2022 at 9:50:50 AM UTC-7, Roger L Costello wrote:
> 2. ...

> A large amount of time is spent reading the source program and partitioning it
> into tokens. Specialized buffering techniques for reading input characters and
> processing tokens can significantly speed up the performance of a compiler.

In any decent compiler this amount of time is relatively small compared to optimizations.
Unless we're talking about JIT, which is a different kind of compiling.

> 3. Compiler portability is enhanced. Input alphabet peculiarities and other
> device-specific anomalies can be restricted to the lexical analyzer. The
> representation of special or non-standard symbols, such as the up-arrow in
> Pascal, can be isolated in the lexical analyzer.

ASCII is supported out of the box nowadays. Unicode is available and simply
parsing and storing Unicode code points is straightforward (processing
Unicode text, especially displaying, is problematic, but that should hardly
affect a programming language or its compiler much).

> Those seem like compelling reasons for separating the lexical analysis from
> parsing,

...

> [1] Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman.

The book is old and doesn't quite reflect the current state of things, IMO.

Alex
[My impression is that the lexer can often take significant time since it has to look
at every character in the input, but the parser is fast unless you're doing something
strange like very ambiguous Earley parsing. -John]

gah4

unread,

Jun 10, 2022, 11:41:38 AM6/10/22

to

On Thursday, June 9, 2022 at 9:50:50 AM UTC-7, Roger L Costello wrote:

> Page 84-85 of the dragon book [1] says:

> There are several reasons for separating the analysis phase of compiling into
> lexical analysis and parsing.

OK, no-one else has said it yet, so I will.

With small memories and slow computers 70 or 60 years ago, the time
and space were important.

Then we had microcomputers, where we had to relearn everything learned
on mainframes years before, so that was 50 or 40 years ago.

But by 30 or 20 years ago, memories were big and computers fast.
While programs needing compiling have also gotten larger, I am not
convinced that either speed or size is much of a problem now.

The separation of tasks, modular programming, often makes sense,
so it is likely still reasonable to keep them separate. But I don't
believe because of space or size.

(Well, maybe running compilers on an Arduino Nano still have
a size or space limit. But most don't do that.)

There is still the old saying:

"premature optimization is the root of all evil"

Hans-Peter Diettrich

unread,

Jun 10, 2022, 11:43:49 AM6/10/22

to

On 6/9/22 4:52 PM, Roger L Costello wrote:

> Those seem like compelling reasons for separating the lexical analysis from
> parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them?

I cannot answer the ANTLR question but want to point out why IMO
(traditional) languages based on tokens should have a tokenizer in the
first place.

We encountered a problem with scannerless parsers and traditional
programming languages in Meta§. A tokenizer can use a couple of
whitespace or control characters or tokens as token separators. Without
such a rule it's very hard to flag all places in a grammar where
whitespace is *required* as a token separator.

As an example: in an "else if" sequence whitespace is required between
"else" and "if" while in other context e.g. "else{" whitespace is not
required. This problem may be solved by a regex, but why should a
grammar be inflated by adding a token termination clause to *each* keyword?

Fortran is another example where keyword recognition requires
backtracking or similar means due to ignorance of spaces e.g. in the
well known "DO10I = 1," snippet.

DoDi
[A separate lexer also makes it a lot easier to skip comments. For Fortran
you had to do a prepass to see whether a statement was an assignment or something
else but after that you could tokenize without backtracking. -John]

Christopher F Clark

unread,

Jun 11, 2022, 7:26:16 PM6/11/22

to

Since, Terence borrowed that idea from our version of Yacc++, I feel
qualified to answer. (And we in turn borrowed plenty of ideas from
ANTLR, so it's all fair.)

First, they aren't really merged. The Scannerless parser people often
merge the idea, but ANTLR still has them separate and generates a
lexer and a parser as two separate entities and there are slight
differences between them (as there should be). And, you can actually
do them as separate files (and compile them separately) if you want.

All that happens is that you have the two parts using roughly the same
notation (and if you don't need the lexer specific features, it is
essentially a subset of the parsing language, and vice-versa). So,
you learn that notation and you know it for both parts.

Moreover, by combining the two parts in one file, you know that the
parts "go together" and you have less problems with mismatches,
especially not the kind where you update one but then have an "old"
version of the other which doesn't quite match. It also allows you to
introduce "tokens" (especially "keywords") in the parser. (Note you
lose this if you compile the parts separately.)

A slightly more advanced version of that allows you to have tokens
that are only matched in certain contexts. ANTLR has some of this
implemented, so if you have a place in your parser where you want to
match ">" but not worry about ">>", you can use the literal token in
the grammar and it will override the longest match rule, provided it
doesn't introduce a conflict. (You also lose this feature with
separate compilation.)

So, note that by keeping the implementations separate (they are really
two phases), you have kept item 1. Your parser never sees whitepace
and comments, unless you want it to. You can still do 2 with the
implementation. I don't know whether the ANTLR generated lexer does
so (or not). Since ANTLR is Unicode based, 3 is not an issue.

All of this is why we merged the two files into one in Yacc++ and
ANLTR started doing the same thing. We also merged them because our
original target was going to be building a syntax directed editor
version of Emacs and these ideas made sense in that regard.

--
******************************************************************************
Chris Clark email: christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

George Neuner

unread,

Jun 11, 2022, 7:38:45 PM6/11/22

to

On Thu, 9 Jun 2022 14:52:59 +0000, Roger L Costello
<cost...@mitre.org> wrote:

>[elided quote] seem like compelling reasons for separating the lexical analysis

>from parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them?
>
>/Roger
>
>[1] Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman.

Technically ANTLR does /not/ combine them ... it generates separate
lexer and parser.

What ANTLR /does/ do is allow specifying both the parser and lexer in
a single input file. It also can recognize textual constants within
the parser specification and generate lexer rules for them.

Note that Yacc and Bison also recognize textual constants in the parser
grammar and generate a token id for the (seperately specified) lexer
to return. ANTLR goes further only because it can: ANTLR can create a
lexer whereas Yacc and Bison cannot.

George

Anton Ertl

unread,

Jun 12, 2022, 2:25:50 PM6/12/22

to

George Neuner <gneu...@comcast.net> writes:
>Note that Yacc and Bison also recognize textual constants in the parser
>grammar and generate a token id for the (seperately specified) lexer
>to return.

If you have a rule in yacc/bison:

E: T '+' T

the "token id" for '+' is the ASCII-code of +. Bison generates token
ids only for tokens defined with %token. So if you instead write

E: T PLUS T

you have to define

%token PLUS

and the value of PLUS is communicated to the scanner through the
.tab.h file.

Also note that for the last version of yacc that I have seen
documentation for, if you have a rule

S: L ":=" E

there is no token for ":=", but instead what you get is equivalent to

S: L ':' '=' E

Bison is more capable, you can, e.g., define

%token BECOMES ":="

- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/