Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Object-oriented parsers

0 views
Skip to first unread message

Steve Boswell

unread,
Aug 29, 1991, 1:00:38 AM8/29/91
to

This is being posted to comp.compilers (because it deals with parsing),
comp.object (because it deals with OOP), and comp.lang.eiffel (because it
shows a real practical use for a language extension I suggested).

From the articles I've seen, it doesn't look like anyone really has a good
idea of what an object-oriented parser would be like. Let me suggest this
line of thinking.

Lexical/syntactic/semantic style parsing, while affording division of labor,
is really only suited to procedural programming. bison++ (and yacc++ too, I
assume) is not really object-oriented, since it uses a class for data hiding
and encapsulation only, rather than the more important dynamic binding
properties. A completely different strategy is needed. Here is an example
of a lexical class under my paradigm.

IDENTIFIER

LOCAL_VARIABLE CLASS_ATTRIBUTE
\ /
\ /
V V
READ_ONLY WRITABLE FUNCTION PROCEDURE
\ / | /
\ / | /
V V V V
VALUE OPERATION
\ /
\ /
V V
IDENTIFIER_BASE

(VALUE could inherit from CONSTANT_BASE, and READ_ONLY could inherit from
EXPRESSION_BASE also. just to give you an idea of what a final inheritance
network might look like.) Thus, once an identifier is lexed, the symbol
table can be consulted and the proper descendent of IDENTIFER_BASE can be
created and attached to IDENTIFER (or returned by IDENTIFIER if IDENTIFIER is
a pure parsing class.) IDENTIFIER is allowed to know about all the
descendents of IDENTIFIER_BASE. Then dynamic binding on the resulting types
can be used to weed through the possibilities in whatever rule is being
parsed. Here is a (much simpler) example of a rule that would do this.

STATEMENT

ASSIGNMENT IF SWITCH PROCEDURE_CALL
\ \ | /
\ \ | /
\ \ | /
V V V V
STATEMENT_BASE

The parsing method of STATEMENT would have to tell whether an initial
IDENTIFIER starts an ASSIGNMENT or PROCEDURE_CALL. If it's a WRITABLE, then
ASSIGNMENT is parsed; and if it's a PROCEDURE, then PROCEDURE_CALL is parsed.
In CLOS (which looks like the only suitable language for this) it'd be
something like this. Assume that get-next-token is a function that returns
the next token and terminal is the base class of all terminals (lex tokens)
in the language.

(defmethod parse ((this statement))
(parse this (get-next-token)))

(defmethod parse ((this statement) (target writable))
...parse an assignment statement)

(defmethod parse ((this statement) (name procedure_call))
...parse a procedure call)

(defmethod parse ((this statement) (token if_token))
...parse an if statement)

(defmethod parse ((this statement) (token switch_token))
...parse a switch statement)

(defmethod parse ((this statement) (token terminal))
...signal a parse error)

So by dynamic binding, the correct parsing method would be called.

For those of you on comp.lang.eiffel, this sort of structure could be faked
together by putting all the parse methods in STATEMENT_BASE and having the
user program explicitly calculate the dynamic binding, but then why bother to
design your parser like this at all? The language extension I suggested
would lend itself to this quickly.

Discussion welcome.
--
Steve Boswell
wha...@ucsd.edu
wha...@gnu.ai.mit.edu
--
Send compilers articles to comp...@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.

Kevin W Wall

unread,
Aug 29, 1991, 6:32:14 PM8/29/91
to
In article <91-0...@comp.compilers> wha...@gnu.ai.mit.edu (Steve Boswell) writes:
>Lexical/syntactic/semantic style parsing, while affording division of labor,
>is really only suited to procedural programming properties.

Yes, but I'm not sure I see how your approach still provides us with that
very important division of labor. (See below for details.)

...Text + example of class hierarchy deleted...

>Thus, once an identifier is lexed, the symbol table can be consulted and
>the proper descendent of IDENTIFER_BASE can be created and attached to
>IDENTIFER (or returned by IDENTIFIER if IDENTIFIER is a pure parsing class.)
>IDENTIFIER is allowed to know about all the descendents of IDENTIFIER_BASE.
>Then dynamic binding on the resulting types can be used to weed through the
>possibilities in whatever rule is being parsed.

...Simpler example of class hierarchy deleted...

>The parsing method of STATEMENT would have to tell whether an initial
>IDENTIFIER starts an ASSIGNMENT or PROCEDURE_CALL.

...CLOS code fragment deleted...

>So by dynamic binding, the correct parsing method would be called.

I may be missing something here, but what I think you are saying is that
the language's grammar for which you are buiding a parser is implemented
via the class inheritance lattice!?! Maybe I'm just old fashioned but I
think parser generators based on (extended) BNF notations (e.g., bison,
yacc++) are a bit easier to change if all you want to do is tweak the
grammar a bit. I see where this approach might be acceptable if you are
designing a compiler for a language whose definition (i.e., minimally
grammar) is already set in concrete (e.g., C, Ada, etc.) but I don't think
that I would want to use this approach to design a language on-the-fly.
Perhaps it's just my ingrained notion to think of languages in terms of
(BNF) grammars rather than in terms of inheritance. If I did choose to
use your design approach, I certainly would NOT want to do it with a
"compiled" OOPL such as C++. It would take me [personally] all day to get
the inheritance of my classes correct. (Either that or I'd have to have
one heck of an incremental compiler.) I think that you definitely would
need an OOPL that was well suited to prototyping to be able to construct
an "as-yet-undefined" language in this manner. (No doubt one reason you
thought CLOS to be suitable.)

Another thing which I don't understand is how you would do error recovery
using a class inheritance model. (I.e., how does you parser recover when
it sees a syntax error?)

Note: The following comments are NOT aimed at anyone in particular, especially
Steve. There are simply my own $.02 worth.

While I find all this discussion on OO approachs to parsing interesting,
for *most* (i.e., read "[close to] LALR grammars") languages, I think that
the typical approach of separate lexers, syntactic analyzers, semantic
analysis, etc. is sufficient. I think we shouldn't re-invent the wheel.
I.e., use what works. In addition, compiler technology (w/ the exception
of error recovery and code optimization) is a fairly well understood
application. It is taught to all CS (under?)grad students. I think while
this might make for a nice research area, it is of little practical value.
In particular, is there anything that an OO approach can do that the more
traditional approaches CAN'T solve, or even solve just as easily? (Do we
write a FORTH interpreter by changing the inheritance lattice on the fly?
:-)

I for one, find this a case of simply trying to shoe-horn a problem into
an OOD. I always say (sometimes) "use the paradigm that fits the problem;
don't try to fit the problem to the paradigm".

--
In person: Kevin W. Wall AT&T Bell Laboratories
Usenet/UUCP: {att!}cblph!kww 6200 E. Broad St.
Internet: k...@cblph.att.com Columbus, Oh. 43213

Steve Boswell

unread,
Aug 30, 1991, 1:28:21 AM8/30/91
to

In article <91-0...@comp.compilers> k...@cblph.att.com (Kevin W Wall) writes:
>I may be missing something here, but what I think you are saying is that
>the language's grammar for which you are buiding a parser is implemented
>via the class inheritance lattice!?!

Mostly, yes. It keeps the grammar implementation abstract and it's
very extendible (which is one of the main benefits of OOP anyways).

>Maybe I'm just old fashioned but I think parser generators based on
>(extended) BNF notations (e.g., bison, yacc++) are a bit easier to change
>if all you want to do is tweak the grammar a bit.

I disagree. Shift/reduce and reduce/reduce errors are a pain in the
arse, especially when the semantics of the language could resolve the
ambiguity. Tweaking the grammar in big ways (e.g. adding a new type
of OPERATION called ONCE for Eiffel) is comparatively easy.

>I think that you definitely would need an OOPL that was well suited to
>prototyping to be able to construct an "as-yet-undefined" language in
>this manner. (No doubt one reason you thought CLOS to be suitable.)

Actually, the only reason I suggested CLOS was that it allowed dynamic
binding on its arguments and not just the object. (I know C++ does
too, but I personally dislike C++.) This approach would also lend
itself to languages being extended/redefined, which languages tend to
do now and then.

>Another thing which I don't understand is how you would do error recovery
>using a class inheritance model. (I.e., how does you parser recover when
>it sees a syntax error?)

I haven't worked that out yet (or that far ahead yet) but I would
guess that being able to do it in a programmed way would be better
than bison's "error" token (which, as I recall, if you place it in the
wrong place, it will cause s/r and/or r/r errors.)

>In particular, is there anything that an OO approach can do that the more
>traditional approaches CAN'T solve, or even solve just as easily? (Do we
>write a FORTH interpreter by changing the inheritance lattice on the fly? :-)

Yes -- the same reasons you use OOP over structural methods apply here
-- extensibility, reusability, dynamic binding. You CAN do it either
way, but which is easier for the programmer?

I also wasn't suggesting changing the inheritance lattice "on the fly" (do
you mean as the compiler is running?) That's highly intriguing though... :-)

ad...@visix.com

unread,
Aug 30, 1991, 1:34:04 PM8/30/91
to
In article <91-0...@comp.compilers>, k...@cblph.att.com(Kevin W Wall) writes:
|> ... I think while

|> this might make for a nice research area, it is of little practical value.
|> In particular, is there anything that an OO approach can do that the more
|> traditional approaches CAN'T solve, or even solve just as easily?

I agree that it is impractical to fight existing industry trends. But I
believe that there are real technical limitations to the traditional
approaches. I would say that this is research that will stay on the fringe
for at least five years, but it could change the entire field in ten.

|> I for one, find this a case of simply trying to shoe-horn a problem into
|> an OOD. I always say (sometimes) "use the paradigm that fits the problem;
|> don't try to fit the problem to the paradigm".

Sorry, but I'll have to turn the accusation around. Simply because yacc and
lex are so well understood, most language designers assume that they are the
"best" paradigm to use.

Our view of a problem is mostly determined by the paradigms we understand.
I see this discussion as an attempt to get away from the traditional
Dragon Book paradigm; I don't necessarily think the Object-Oriented paradigm
will be better, but I welcome the attempt to try something new.

Adam

Chris Clark

unread,
Sep 2, 1991, 9:36:43 AM9/2/91
to
Steve Boswell made a claim that the object-oriented parser generators
Yacc++ and the Language Objects Library or Bison++ are not very OOP
because they don't support a particular model of grammar development. I
think of the method he describes as object-oriented recursive descent
parsing. It is a very natural method of merging oop with parsing, and was
an initial design approach we discussed for Yacc++. However, what Steve
asks for can be easily done with (even non-oop) parser generators. Yacc++
has oop features which address other problems. I further argue that oop
recursive-descent parsing is equivalent to [E]LL(k) methods unless the 1)
host language supports non-deterministic or lazy-parallel procedure calls
or something equivalent or 2) the parser writers use an arbitrary
look-ahead technique in their parsing methods.

Kevin Wall (k...@cblph.att.com) wrote text marked with >
Steve Boswell (wha...@gnu.ai.mit.edu) wrote text marked with >>
Adam (ad...@visix.com) wrote text marked with >>>
My notes refer to sections marked with [%d]

>> Lexical/syntactic/semantic style parsing, while affording division of
>> labor, is really only suited to procedural programming properties.

. . .


>> IDENTIFIER
>>
>> LOCAL_VARIABLE CLASS_ATTRIBUTE
>> \ /
>> \ /
>> V V
>> READ_ONLY WRITABLE FUNCTION PROCEDURE
>> \ / | /
>> \ / | /
>> V V V V
>> VALUE OPERATION
>> \ /
>> \ /
>> V V

>> IDENTIFIER_BASE [1]


>>
>> Thus, once an identifier is lexed, the symbol table can be consulted and

>> the proper descendent of IDENTIFIER_BASE can be created and attached to
>> IDENTIFIER (or returned by IDENTIFIER if IDENTIFIER is a pure parsing

>> class.) IDENTIFIER is allowed to know about all the descendents of

>> IDENTIFIER_BASE. [2]


>> Then dynamic binding on the resulting types can be used to weed through the

>> possibilities in whatever rule is being parsed. [3]
. . .


>> The parsing method of STATEMENT would have to tell whether an initial

>> IDENTIFIER starts an ASSIGNMENT or PROCEDURE_CALL. [4]
. . .
>> So by dynamic binding, the correct parsing method would be called. [5]

> I may be missing something here, but what I think you are saying is that
> the language's grammar for which you are buiding a parser is implemented

> via the class inheritance lattice!?! . . . [6]

>> Mostly, yes. It keeps the grammar implementation abstract and it's

>> very extendible (which is one of the main benefits of OOP anyways). [7]

> Maybe I'm just old fashioned but I think parser generators based on
> (extended) BNF notations (e.g., bison, yacc++) are a bit easier to change
> if all you want to do is tweak the grammar a bit.

>> I disagree. Shift/reduce and reduce/reduce errors are a pain in the
>> arse, especially when the semantics of the language could resolve the

>> ambiguity. [8]

>> Tweaking the grammar in big ways (e.g. adding a new type

>> of OPERATION called ONCE for Eiffel) is comparatively easy. [9]

> I think that you definitely would need an OOPL that was well suited to
> prototyping to be able to construct an "as-yet-undefined" language in
> this manner. (No doubt one reason you thought CLOS to be suitable.)

>> Actually, the only reason I suggested CLOS was that it allowed dynamic
>> binding on its arguments and not just the object. (I know C++ does
>> too, but I personally dislike C++.) This approach would also lend
>> itself to languages being extended/redefined, which languages tend to

>> do now and then. [10]

> I for one, find this a case of simply trying to shoe-horn a problem into
> an OOD. I always say (sometimes) "use the paradigm that fits the problem;
> don't try to fit the problem to the paradigm".

>>> Sorry, but I'll have to turn the accusation around. Simply because yacc
>>> and lex are so well understood, most language designers assume that they

>>> are the "best" paradigm to use. [11]

What Steve Boswell is arguing for [1-7] is what I think of as OOP
recursive-descent parsing. The model is basically this:

Create a class for each terminal and non-terminal.
Create methods for each non-terminal which correspond to each
production that define the non-terminal.
Use the inheritance hierarchy to provide generalizations of
the non-terminals and terminals.

The basic thrust here is that productions are mapped onto procedures
(methods) which parse the corresponding sentences. This is the
recursive-descent model.

How does one do this with a parser generator? One performs the mapping in
reverse. In particular, one can determine how to support the dynamic
binding Steve desires with the features provided in any parser generator
including yacc. Just define the non-terminals and terminals from the
classes, define the productions that correspond to the parser methods, and
define the inheritance hierarchy using productions. Here is how to do
Steve's example:

First, create a lexer which returns leaf types for the desired inheritance
hierarchy based upon some form of semantic analysis. With Yacc++ and the
Language Objects Library, this can be done with the following semantic
action code in the lexer.

IDENTIFIER : ('a'..'z')* ;
{ my_sym_ptr =
yy_lookup(yy_lex_token(), yy_lex_len(), IDENTIFIER_);
yy_lex_rdc() = yy_sym_type_this(my_sym_ptr);
}

This code says to find the identifier object in the symbol table and cause
the lexer to return the type associated with that object. This is what
Steve asked for in [2]. This precisely matches the runtime lookup of an
object type within a class hierarchy. Different semantic models can be
implemented by changing how the return type is selected.

Second, define the inheritance hierarchy [1,6] as a series of productions
in the parser. The way to read these productions is the class on the LHS
is a base class for the derived classes on the RHS. In this model
multiple inheritance is easy, just list the derived class in the RHS for
each of its parent classes. For Steve's example [1] it would be:

WRITABLE : LOCAL_VARIABLE | CLASS_ATTRIBUTE;
VALUE : READ_ONLY | WRITABLE;
OPERATION : FUNCTION | PROCEDURE;
IDENTIFIER_BASE : VALUE | OPERATION;

The reason this does exactly what Steve wants is certainly worth further
discussion, but not in this posting. It can provide semantic
disambiguation as he requests in [8].

Third, define the productions using the appropriate members of the
hierarchy. This is what Steve requires in [4].

PROCEDURE_CALL : PROCEDURE "(" VALUE ("," VALUE)* ")" ";"

When this grammar definition is run through a parser generator, the
generator determines which token types are legal at each stage of the
parse. The parser generator takes care of [3] and [5].

Steve argues in [7,9] that his model makes extension easier. Is it easier
to add a new class to an inheritance hierarchy than it is to add a
production to a grammar? He then brings up shift-reduce and reduce-reduce
conflicts and semantic disambiguation [8]. Here lies the crux of his easy
extension argument.

If the host language, supports non-determinism, lazy-parallel evaluation
of procedures, unification, dynamic programming, or something similar,
then there may be an advantage to oop recursive-descent parsing.
Otherwise, oop recursive-descent parsing is essentially equivalent to
using an [E]LL(k) parser generator.

How do I justify that claim???

At runtime the code must call a parsing method which will parse the
sentence. In his example [4], he bases the parsing of the sentence upon
its first token (and presumably the current context). If all tokens are
read left-to-right (i.e. you always get tokens via get-next-token), this
is exactly equivalent to [E]LL(1) parsing. If you allow yourself to read
k-tokens of the sentence before determining which production to apply, you
have [E]LL(k) parsing. If you allow yourself even more freedom, such as
reading to the end of the token stream and using a Turing machine to
decide which production to apply, you can get a much larger grammar class,
but that is essentially defining your way out of the problem by using a
more powerful method to begin with. [Since in most lisp dialects you can
construct calls with an arbitrary number of arguments at runtime, I
presume you can make CLOS (or whatever your implementation choice is) do
your arbitrary dispatch. Personally, I find it hard to imagine how you
would describe the parameter bindings for such vary-adic methods. I would
probably use regular-expressions but that shows my bias :-). But I must
admit, I'm a C++ person and do not know any of the oop extensions to
lisp.] Also, if you start depending on scanning ahead too much, you may
find it difficult explaining exactly what language your system supports.

Unfortunately, aside from the arbitrary look-ahead via the runtime
vary-adic call case, Steve's dynamic binding has not bought anything,
except perhaps some runtime efficiency and cleanliness of code.
Recursive-descent compilers based upon switching on the first k-tokens are
a well-known technology. Some compiler writers like them. It is
certainly easy to hack them.

How does a [E][LA]LR(k) parser generator differ? When the generator runs,
it simulates calling the parsing methods in parallel. As long as there is
a token (within k tokens after the end of the production) which allows the
parser generator to make a deterministic decision on which parsing method
should have been called, the parser generator produces a conflict free
parser. Thus, parser generators effectively support a deterministic form
of parallelism and the class of languages they support is larger than the
LL-class. The class of languages supported by LR(k) parser generators is
generally large enough for most syntax descriptions, particularly if you
allow the lexer the priviledge of inspecting the symbol table before
determining the type of an object.

The "dreaded" conflicts come from grammars which are ambiguous. Some
methods for handling conflicts are well-known. The precedence directives
of most parser generators do a good job of handling most shift-reduce
conflicts. Kjell's recent proposal improves that for many dynamic cases.
Using regular-expression (ELR) grammars can help eliminate any
reduce-reduce conflicts. I suggest that you will find it is easier to
eliminate conflicts than to rewrite a grammar to LL(1) form, which you
need to do to use simple recursive-descent. Reason: all LL(1) grammars
are LR(1) grammars. [Note, Yacc++ fixes a problem found in most yacc
implementations of mid-production action code where yacc introduces
artificial reductions which can cause spurious conflicts. Yacc++ executes
mid-production action code as part of the shift and this reduces the
number of conflicts.]

I think the point of an abstract representation [7] is actually better
served by a non-procedural representation of the language to be parsed.
That is exactly what a grammar is. A grammar does not specify the method
used to parse it. If the grammar is LL(1), an LL parser, an LR parser, a
Turing machine, or some other mechanism could be used to parse it. A
parser generator is merely a way to convert the non-procedural
representation into a procedural representation. If the parser is written
as a series of procedures in a programming language, it is much harder to
switch to using a different parsing method. This is also my answer to
[11]. Grammars are the best solution because they are non-procedural and
capture exactly what we want to define. We usually find a well written
ELR grammar to be one of the most concise, precise, and readable
definitions of a language.

The issues of extending a grammar [7,9,10] are certainly well worth
consideration. Here is where we feel oop features should be added to
parser generators. In particular, Yacc++ supports inheritance of
productions from one language to another. This makes it easy to build
experimental extensions of a language by simply deriving the extended
language from the base language and then changing the appropriate
productions in the extended grammar.

We feel that the appropriate use of dynamic binding in an object oriented
lexer/parser system is in allowing lexer or parser definitions to be
switched at runtime, and this is the model that Yacc++ supports. This
allows a system to easily support a family of dialects, such as a base
language and its descendents. It can also facilitate parsing streams
which come from more than one source (as in a distributed application).

Chris Clark email : crackers.clearpoint.com!compres!chris
Compiler Resources, Inc. voice : (508) 435-5016
3 Proctor Street fax : (508) 435-4847
Hopkinton, MA 01748

Markku Sakkinen

unread,
Sep 4, 1991, 1:09:06 AM9/4/91
to
In article <91-0...@comp.compilers> wha...@gnu.ai.mit.edu (Steve Boswell) writes:
>In article <91-0...@comp.compilers> k...@cblph.att.com (Kevin W Wall) writes:
>> ...
> ...

>>I think that you definitely would need an OOPL that was well suited to
>>prototyping to be able to construct an "as-yet-undefined" language in
>>this manner. (No doubt one reason you thought CLOS to be suitable.)
>
>Actually, the only reason I suggested CLOS was that it allowed dynamic
>binding on its arguments and not just the object. (I know C++ does
>too, but I personally dislike C++.) This approach would also lend
>itself to languages being extended/redefined, which languages tend to
>do now and then.

In reality, dynamic binding of virtual member functions (methods) in C++
only considers the owner object (self, "receiver"), so CLOS seems
to be unique among well-known OOPL's in this respect.
It is probably often confusing that the _static_ (compile-time)
overloading in C++ does take the types of all arguments into account,
but dynamic binding does not.

----------------------------------------------------------------------
"All similarities with real persons and events are purely accidental."
official disclaimer of news agency New China

Markku Sakkinen (sakk...@jytko.jyu.fi)
SAKK...@FINJYU.bitnet (alternative network address)
Department of Computer Science and Information Systems
University of Jyvaskyla (a's with umlauts)
PL 35
SF-40351 Jyvaskyla (umlauts again)
Finland
----------------------------------------------------------------------

Desmond Dsouza

unread,
Sep 4, 1991, 12:29:20 PM9/4/91
to

Just fyi:

One nice reference to using object-oriented approaches to both parsing
and semantic analysis is :

Incremental Attribute Evaluation with Side-effects
Gorel Hedin
Springer Verlag LNCS # 371

He makes a good case for abstract production specification and
production specialization, demand attributes as a declarative form of
dynamic binding, incremental evaluation, and alludes to (imho)
multi-methods in dealing with inherited demand attributes.

Also, "An Object-Oriented Notation for Attribute Grammars", ECOOP 89.

--
Desmond
--

-------------------------------------------------------------------------------
Desmond D'Souza, MCC CAD Program | ARPA: dso...@mcc.com | Phone: [512] 338-3324
Box 200195, Austin, TX 78720 | UUCP: {uunet,harvard,gatech,pyramid}!cs.utexas.edu!milano!cadillac!dsouza

0 new messages