Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Has lexing and parsing theory advanced since the 1970's?

61 views

Skip to first unread message

Roger L Costello

unread,

Sep 16, 2021, 12:56:25 PM9/16/21

Lately I have been learning to use Flex & Bison. As I understand it, Flex &
Bison (and other parser generators) are built on solid theory. As a result,
those parser generators were created quickly and cheaply using structured,
well-known techniques. Contrast with parsers developed prior to the theory:
The earliest parser back in the 1950s used utterly ad hoc techniques to
analyze the syntax of the source code of programs they were parsing. During
the 1960s, the field got a lot of academic attention and by the early 1970s,
parsing was no longer an arcane art. In the 1970s Aho, Ullman, Knuth, and many
others put parsing techniques solidly on their theoretical feet.

One of the first techniques they (Aho, Ullman, Knuth, and others) espoused was
to separate lexing from parsing. Lexing built upon regular expressions, which
built upon Finite Automata (FA) theory and Nondeterministic Finite Automata
(NFA) theory. FA and NFA were brilliantly melded together with the famous
Kleene Theorem. Parsing built on top of a rich theory of grammars -- Context
Free Grammars, Context Sensitive Grammars, etc. -- that Chomsky formulated.
Neat!

That said, Flex & Bison is old. Has lexing/parsing theory advanced since the
1970’s? If yes, are there parser generators available today which are based on
those advances in lexing/parsing theory? Or does Flex & Bison still represent
the state-of-the-art in terms of the underlying theory it uses?

[Having been there in the 1970s I can report that the linguists and the computer
scientists were dimly aware of each other but didn't work together and used
different terms, e.g., linguists say phrase structure grammars where we would
say CFG.
Flex is a rewrite of lex which was a Bell Labs summer project to make
it easier to turn regular expressions into DFAs (not NFAs) by Eric
Schmidt. The code was awful, flex was a cleaner rewrite that avoided the
AT&T software license. Bison was orignally a GPL rewrite of yacc but it
has since added reentrant parsers and GLR parsing for ambiguous grammars.
They knew about GLR in the 1970s, but Yacc had to run on a 16 bit PDP-11
and the data structures were too big. -John]

Anton Ertl

unread,

Sep 16, 2021, 1:56:57 PM9/16/21

Roger L Costello <cost...@mitre.org> writes:
>That said, Flex & Bison is old. Has lexing/parsing theory advanced since the

>1970s? If yes, are there parser generators available today which are based on

>those advances in lexing/parsing theory? Or does Flex & Bison still represent
>the state-of-the-art in terms of the underlying theory it uses?

It seems to me that after the success of BNF (context-free grammars)
with Algol 60 people wanted to go further, and Algol 68 used a Van
Wijngaarden grammar (two-level grammars). But this formalism has not
been used for describing and implementing later languages, and even
Algol 60 put more in the grammar (in particular, the type difference
between flags and numbers) than later languages. Most languages use
just BNF and describe the rest of the language in prose, a few use a
formal semantics, but I guess that's not what the question was about.
So on the language definition side, there have been no advances in
grammar formalism, because none are needed.

On the implementation side, there have been many attribute grammar
generators (e.g., Ox), but few uses. In the back end various more or
less formal techniques have been used for instruction selection; e.g.,
lcc uses the tree-parser generator lburg.

- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

Kaz Kylheku

unread,

Sep 17, 2021, 11:37:29 AM9/17/21

On 2021-09-14, Roger L Costello <cost...@mitre.org> wrote:
> Lately I have been learning to use Flex & Bison. As I understand it, Flex &
> Bison (and other parser generators) are built on solid theory. As a result,
> those parser generators were created quickly and cheaply using structured,
> well-known techniques. Contrast with parsers developed prior to the theory:
> The earliest parser back in the 1950s used utterly ad hoc techniques to
> analyze the syntax of the source code of programs they were parsing. During
> the 1960s, the field got a lot of academic attention and by the early 1970s,
> parsing was no longer an arcane art. In the 1970s Aho, Ullman, Knuth, and many
> others put parsing techniques solidly on their theoretical feet.
>
> One of the first techniques they (Aho, Ullman, Knuth, and others) espoused was
> to separate lexing from parsing.

This is for one particular pragmatic reason. The LALR(1) parsing
technique uses one token of lookahead to make key parsing decisions.

If you combine lexing with parsing, then some of your original lookahead
symbols may turn into sequences of symbols, which share the same initial
element.

E.g. supose you have a "++" token and a "+=" token, which appear as a
lookahead symbol which determines which way to reduce.

If we don't separate lexing and parsing, such that we have a
character-level grammar, then we no longer have a "++" and "+="
token. Each of these is two grammar symbols.

And so in that same reduce situation, we no longer have two different
lookahead tokens for "++" and "+=". The lookahead symbol is just "+".

Fixing this requires techniques like multiple symbols of lookahead:
LR(k). (Not that that is new or anything.)

There are certain efficiencies that can be had in lexing, as such.
Dedicated lexical analysis can recognize tokens "blindingly" fast,
using buffering techniques integrated with the recognizer.

> Lexing built upon regular expressions, which
> built upon Finite Automata (FA) theory and Nondeterministic Finite Automata
> (NFA) theory. FA and NFA were brilliantly melded together with the famous
> Kleene Theorem. Parsing built on top of a rich theory of grammars -- Context
> Free Grammars, Context Sensitive Grammars, etc. -- that Chomsky formulated.
> Neat!

However, LR parsing is built on regular expressions again, with
some sophistication thrown in to handle nesting.

You know from ad hoc programming with regexes that you can handle some
forms of nesting if you invoke a regex more than once on a string.

For instance balanced parentheses can be recognized via regex-driven
algorithm if we us a regex to recognize all instances of an opening
parenthesis followed by non-parenthesis characters [^()] followed by a
closing parenthesis, and then reduce this by removing it from the input,
and iterate on it.

That sort of gives us an intuition behind LR parsing. At the core of a
LR parser is a state machine, and if we ignore the push down aspects of
it: how its actions maintain a stack, it's just a finite state machine
that can be described by a regex. The terminating states of the regex
basically rewrite something in the stack of symbols, or push something
into it, and then choose some state to jump to to iterate.

> That said, Flex & Bison is old. Has lexing/parsing theory advanced since the
> 1970’s? If yes, are there parser generators available today which are based on
> those advances in lexing/parsing theory? Or does Flex & Bison still represent
> the state-of-the-art in terms of the underlying theory it uses?

Firstly, Flex and Bison have enough power in them to handle
state-of-the-art *languages*.

Parsing theory has long outstripped the practical realm.

Loosel speaking, if you design a computer language whose syntax is so
complicated with difficult ambiguities that it benefits from being
parsed by some technique only available from a 2020 dated paper, it's
probably an awful language from a syntactic point of view, which
ill serves its users in that regard.

(Look at C++, by the way; it's famous for having a large, awful grammar.
Yet, the GNU C++ implementation is a hand-maintained recursive-descent
parser, that's about a megabyte and a half of C++ code all in one file.
So for all the complexity, C++ can be handled using approaches that take
advantage of approximately zero advances in parsing since 1965.)

Now Bison is actively developed. There are avenues of improvement in
dimensions other than parsing theory. Firstly, Bison does have some
theoretical muscle in it in the way of handling GLR grammars.

There are other pragmatic improvements in it like support for push
parsers, and re-entrant parsers.

It has support for multiple programming languages. Its C++ support goes
as far as supporting AST nodes that are objects which can have
destructors.

Fairly recently, support was added to Bison for multiple start symbols
in the same grammar, so you can parse just a subset of a language.

Another thing I believe I saw recently on the mailing list recently is
that contributions were merged to Bison which allow it to generate
counterexamples for ambiguities, so that it's easier to debug bad
grammars. Previous implementations of Yacc-like parsers just give you
diagnostics like "state 123: reduce-reduce conflict" whose root cause
can be hard to unravel.

> [Having been there in the 1970s I can report that the linguists and the computer
> scientists were dimly aware of each other but didn't work together and used
> different terms, e.g., linguists say phrase structure grammars where we would
> say CFG.
> Flex is a rewrite of lex which was a Bell Labs summer project to make
> it easier to turn regular expressions into DFAs (not NFAs) by Eric
> Schmidt. The code was awful, flex was a cleaner rewrite that avoided the
> AT&T software license. Bison was orignally a GPL rewrite of yacc but it
> has since added reentrant parsers and GLR parsing for ambiguous grammars.
> They knew about GLR in the 1970s, but Yacc had to run on a 16 bit PDP-11
> and the data structures were too big. -John]

Bison was originally written by the same person, Robert Corbett, who
went on to write Berkeley Yacc.

https://en.wikipedia.org/wiki/Berkeley_Yacc

Berkeley Yacc is far less actively maintained than Bison and has fewer
features. It does support reentrant parsing in a way that is
compatible with around Bison 2.5.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[No coincidence that it was the same guy, since bison was a fork of
byacc.
The lookahead example of ++ vs += isn't very good since a LALR
parser only needs to look ahead when it reduces but in general you're
right, it'd be a problem. A much worse problem in combined parsers is
getting rid of noise like comments and white space. In a separate lexer
you can easily discard them between tokens, in a combined lex/parse you
have to include them everywhere they can occur. -John]

Christopher F Clark

unread,

Sep 18, 2021, 1:13:13 PM9/18/21

Roger L Costello <cost...@mitre.org> asked:

Yes, compiler theory has advanced since the 1970s.

Multiple token lookahead is now supported by a variety of compiler
generators.

And, many modern parser generators allow regular expressions on the
right-hand-side of rules. Conversely, you see lexer generators that
accept context-free (recursive) rules to define tokens (while still
allowing regular expressions).

One of the key ways you often see multiple token lookahead implemented
is in the form of predicated grammars. These are of the form, if you
see this, try applying this rule first and if it doesn't match try
other rules. Often the lookahead for them is the complete rule, i.e.
try this rule and if it matches use it, often a form of backtrack
parsing. This has evolved into "Parsing Expression Grammars" (PEGs),
where you list the rules in the order you want them to match. And,
"packrat parsing" has turned this into a practical method with [near?]
linear performance. Thus, in most cases, you aren't paying an
N-squared penalty for backtracking.

You see a particularly interesting evolution of that in ANTLR4, where
you can apply precedence and associativity rules to grammars with
left-recursion and still in most cases get an LL-like parser out as
the generated result. However, it interprets most rules in a PEG like
fashion giving an implicit precedence to them. The result is grammars
that are relatively easy to read, and mostly do the obvious correct
thing.

The GLR approach of making a parsing forest is an alternative to that.
It won't be long before someone combines the two approaches. You can
specify the order in which to try productions, but the generator will
recognize cases where two or more apply and keep enough information
around to allow the application choice to be made at reduction time,
while allowing one to say in the case of ambiguity, I want this choice
to apply or to make the choice be "both" and the semantics will
disambiguate it.

Beyond that, there are extensions to what language class the parser
generators allow.

To my mind, the most notable one is adaptive grammars. These are
extensions that allow one to add to or modify the rules the grammar
supports while the parser is running. The most obvious example is
adding new "types" or "keywords" to the grammar. This is a more
sophisticated solution to the typedef problem. And can result in
grammars where you have properly scoped type analysis all done as part
of parsing.

Another variant are conjunctive grammars. These allow you to take the
intersection of two rules and say match only if the intersection
matches. This allows one to make finer grained decisions on what legal
sentences are. It's effect is similar (but often more general than)
predicated grammars, but similar techniques can be used to implement
it.

Finally, the regular expression model is suitable to extension with
the PCRE operators, in particular captures and back-references, but
also assertions (which are essentially predicated) and greedy versus
non-greedy matching. I have not yet seen a parser generator that
implements that, but it is actually a fairly small tweak to LR
matching, captures are simply a special form of non-terminal where all
occurrences (within a scope) need to match the same "string".

---------

However, beyond all that, which are tweaks to the lexing and parsing
engines and generators is a bigger change. Parser generator writers
have realized that simply getting a correct parse is only a small
fragment of building a front-end.

In particular, one doesn't simply want a raw parse tree. One wants an
AST that discards irrelevant information and more particular
highlights relevant information and makes transformations and
traversals easier.

There have been several advancements along that line.

Steve Johnson (the original yacc author) wrote a paper "Yacc meets
C++" which shows how to turn an AST into C++ classes with relevant
methods. The result is a very object-oriented view of what an AST is.

Alternately, ANTLR incorporates the Visitor and Listener design
patterns into the parse trees it builds. This makes semantic
traversals much easier.

There are also tools like sourcer which treats parse trees as
s-expressions and works something like hygienic macros on them,
allowing one to write transformation rules to reshape the tree as
desired.

Even more along those lines are the "pattern matching" (destructuring)
operators found in many functional languages. Those also allow one to
transform trees at the semantic level.

Finally, much of this has gone to formalizing more standardized IRs.
Much of the parsing work is now simply taking the input and reshaping
it to match an IR used by a sophisticated backend.

--
*****************************************************************************
Chris Clark email:
christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris

Ev Drikos

unread,

Sep 29, 2021, 5:10:47 PM9/29/21

On Thursday, September 16, 2021 at 7:56:25 PM UTC+3, Roger L Costello wrote:
>
> That said, Flex & Bison is old. Has lexing/parsing theory advanced since the
> 1970’s? If yes, are there parser generators available today which are based on
> those advances in lexing/parsing theory? Or does Flex & Bison still represent
> the state-of-the-art in terms of the underlying theory it uses?

Hello,

The routines that recognize tokens are still called scanners and those
that parse the input are still called parsers. That explained, this long
list may give a clue of what an answer to your last question shall be:
https://en.wikipedia.org/wiki/Comparison_of_parser_generators

Yet, I haven't used most of them. So, I'll give you an example with
Syntaxis, a tool I've coded that isn't included in the above list.

If we try to parse this erroneous Fortran line with the command 'fcheck'
(binary available at https://github.com/drikosev/Fortran) we see that
the expected tokens in the error message contain both a '=' and a name:

program ? ; end

Note that the command 'fcheck' uses a deterministic parser (built by
Syntaxis) and the expected tokens in an error message are pre-computed.

To my knowledge, the ability of a parser to shift simultaneously two
distinct terminals in one transition isn't an advancement in theory but
I guess several tools mentioned in the Wikipedia list above possibly
provide similar or better goodies (ie Chris Clark described an ANTLR4
feature that advances theory directly).

Regards,
Ev. Drikos

0 new messages