Is there a parser generator that produces the equivalent of a
hand-coded recursive descent parser? I'm looking for a generator that
doesn't require an engine and doesn't use external libraries... just
plain old C.
I can watch and debug recursive descent code because I can understand
that. I cant imagine trying to debug a table driven parser or having
to rewrite it in BASIC.
The reason for my request is that my compiler will be written in a
BASIC dialect instead of C. I would generate the parser in ANSI C
then rewrite it in BASIC. I'm using a BASIC compiler to bootstrap my
own. It seems to be a good idea to write the compiler in the language
its going to compileso that's what I'm doing.
Also are there any algorithms for AST building. Everything I've
understand tells me that I really want to build an AST and do code
generation from it versus trying to generate code as I go along. I
thought I understood the process but I'm not there yet.
Thank you,
W.
> I recently got my scanner working for my [first] compiler. My
> compiler is for an existing BASIC language.
Can you tell us which one?
> I am seeing the
> complexity of the language at the parsing level. There are
> approximately 600 keyword or keyword combinations which include
> compound words due to intrinsic functions.
Most BASICs are LL(1) and quite easy to parse, even if they have many
different statements/productions.
> Is there a parser generator that produces the equivalent of a
> hand-coded recursive descent parser? I'm looking for a generator that
> doesn't require an engine and doesn't use external libraries... just
> plain old C.
If your BASIC also can be interpreted, the parser can not be that
complicated.
I've written some compilers and decompilers in BASIC, many years ago,
and found it very convenient to use the special built-in functions of my
various BASICs. That's a bit incompatible with the use of parser
generators, in detail for parser code produced for/in C.
If you want to write your compiler in BASIC, you can use more parser
generators, not only for C. CoCo/R may do what you want, with output of
the recursive descent parser in plain old Pascal. But I think that only
expressions will deserve the assistance of an parser generator, until
you get a feeling for how things can be done, then you can proceed with
handcrafting the remaining parts for statements etc. yourself. You also
can start with a very small subset of the language, and extend it later
to the full language. Then you can play with various approaches to the
generation of code and internal data structures very soon, until you
found a well working model. If you start with the full language instead,
you'll have many places in your code, where every modification of your
model has to be reflected, worth nothing but a giant waste of time.
> I can watch and debug recursive descent code because I can understand
> that. I cant imagine trying to debug a table driven parser or having
> to rewrite it in BASIC.
I can't as well ;-)
>
> The reason for my request is that my compiler will be written in a
> BASIC dialect instead of C. I would generate the parser in ANSI C
> then rewrite it in BASIC. I'm using a BASIC compiler to bootstrap my
> own. It seems to be a good idea to write the compiler in the language
> its going to compileso that's what I'm doing.
IMO that's feasible, with your powerful BASIC.
>
> Also are there any algorithms for AST building. Everything I've
> understand tells me that I really want to build an AST and do code
> generation from it versus trying to generate code as I go along. I
> thought I understood the process but I'm not there yet.
In my recent C compiler project I didn't use any parser generator, and
built all data structures manually in the recursive descent parser. IMO
automatic construction of an AST will only defer the problem of
recognizing what has to be done with the parsed input. Since you want to
implement an recursive descent parser, you can collect all required
information during parsing, perhaps, but not necessarily, in a tree-like
structure, from which you can produce code almost immediately. Remember
that an interpreter also interprets the statements one by one, possibly
after translating (scanning) everything into byte code, while loading a
program.
Provided your BASIC allows for procedures, you'll have to think some
time about a decent framework for procedures and their local variables,
so that you can patch the entry code of procedures easily, after
outputting the code for an procedure body. Or you may use a table,
describing the required amount of local memory and other things (line
numbers for ONERROR etc.), that can be inspected at runtime. Most BASIC
dialects do not really compile very well into machine code, so you may
be better off with a virtual machine and an byte code emulator for the
procedure framework, with an escape to compiled machine code for the
evaluation of expressions etc., which really profit from such a compilation.
Feel free to contact me by e-mail, for a more concrete discussion of
your project.
DoDi
> Is there a parser generator that produces the equivalent of a
> hand-coded recursive descent parser? I'm looking for a generator that
> doesn't require an engine and doesn't use external libraries... just
> plain old C.
CoCo/R leaves it as an option to you to start
reading the generated parser:
What do you mean by keyword combinations?
> Is there a parser generator that produces the equivalent of a
> hand-coded recursive descent parser? I'm looking for a generator
> that doesn't require an engine and doesn't use external libraries...
> just plain old C.
I've seen CoCo/R mentioned... ANTLR also produces recursive descent
(LL(*)) parsers, but converting its output to basic may be about the
least fun thing you could do...
>
> I can watch and debug recursive descent code because I can understand
> that. I cant imagine trying to debug a table driven parser or having
> to rewrite it in BASIC.
You could always write one by hand -- BASIC's grammar isn't overly
complex, and it could be a good learning experience :D
> Also are there any algorithms for AST building. Everything I've
> understand tells me that I really want to build an AST and do code
> generation from it versus trying to generate code as I go along. I
> thought I understood the process but I'm not there yet.
Well that one's up to you... you could make a completely abstract tree:
struct ASTNode {
int isLeaf;
void *leaf;
struct {
int numChildren;
ASTNode *children;
} branch;
}
In this you can just store the whole parse tree, alternatively you just
store the semantics of what is being parsed. In which case basically
all you are doing is building a tree that represents the important bits
of your grammar. eg (from http://en.wikipedia.org/wiki/Tiny_BASIC -- i
really don't know basic :D).
statement ::= PRINT expr-list
IF expression relop expression THEN statement
GOTO expression
INPUT var-list
LET var = expression
GOSUB expression
RETURN
CLEAR
LIST
RUN
END
for which you *might* make a pair of types:
enum statement_type {
print_statement, if_statement, goto_statement, ..., end_statement
};
struct statement_s {
statement_type type;
union {
struct {
expression_list *expressions;
} print_stat;
struct {
relop operator;
expression *lhs, *rhs;
struct statement_s *statement;
} if_stat;
struct {
expression *target;
} goto_statement;
...
};
};
Now when you parse a statement you create a statement struct for the
appropriate branch, and fill in the appropriate bits. But note we
aren't storing syntax info -- all we have in the node is the actual
information we *need* to reconstruct the meaning.
Apologies for any issues in the above codei tend to use languages where
subtyping is an option :D
And other people here can probably explain this somewhat better than i
have :(
--Oliver
I will look into further. I have P.D. Terry's book on compiler
generators where I think CoCo/R is referenced. I'll have to pull that
one out of storage.
Thanks,
W.
FutureBasic - Its a dialect of BASIC for the Macintosh.
> If your BASIC also can be interpreted, the parser can not be that
> complicated.
Maybe because it is my first compiler is just seems complicated.
I don't know how many books I've read on the subject and at the time
they made sense. Once I decided that I would write my own it started
to become a whole lot more challenging than the compilers I've read
about.
> I've written some compilers and decompilers in BASIC, many years ago,
> and found it very convenient to use the special built-in functions of my
> various BASICs. That's a bit incompatible with the use of parser
> generators, in detail for parser code produced for/in C.
>
> If you want to write your compiler in BASIC, you can use more parser
> generators, not only for C. CoCo/R may do what you want, with output of
> the recursive descent parser in plain old Pascal. But I think that only
> expressions will deserve the assistance of an parser generator, until
> you get a feeling for how things can be done, then you can proceed with
> handcrafting the remaining parts for statements etc. yourself. You also
> can start with a very small subset of the language, and extend it later
> to the full language. Then you can play with various approaches to the
> generation of code and internal data structures very soon, until you
> found a well working model. If you start with the full language instead,
> you'll have many places in your code, where every modification of your
> model has to be reflected, worth nothing but a giant waste of time.
This is the approach I am taking. The more I sketched how I want want
to accomplish what I want to do, the more problems I found because the
language appears to be really complex at the parsing level. The user
gets an easy to use language while I pull out small chunks of hair
thinking of how to do all the work under the hood.
> > Also are there any algorithms for AST building. Everything I've
> > understand tells me that I really want to build an AST and do code
> > generation from it versus trying to generate code as I go along. I
> > thought I understood the process but I'm not there yet.
>
> In my recent C compiler project I didn't use any parser generator, and
> built all data structures manually in the recursive descent parser. IMO
> automatic construction of an AST will only defer the problem of
> recognizing what has to be done with the parsed input. Since you want to
> implement an recursive descent parser, you can collect all required
> information during parsing, perhaps, but not necessarily, in a tree-like
> structure, from which you can produce code almost immediately. Remember
> that an interpreter also interprets the statements one by one, possibly
> after translating (scanning) everything into byte code, while loading a
> program.
To add more information, it will be a cross-compiler PPC and Intel on
Mac. I plan to keep the front end separated from the back-end with an
AST. Keeping the front and back separated should allow for
easier cross compiling
easier inclusion of a built in debugger
register usage analysis
code optimization
> Provided your BASIC allows for procedures, you'll have to think some
> time about a decent framework for procedures and their local variables,
> so that you can patch the entry code of procedures easily, after
> outputting the code for an procedure body. Or you may use a table,
> describing the required amount of local memory and other things (line
> numbers for ONERROR etc.), that can be inspected at runtime. Most BASIC
> dialects do not really compile very well into machine code, so you may
> be better off with a virtual machine and an byte code emulator for the
> procedure framework, with an escape to compiled machine code for the
> evaluation of expressions etc., which really profit from such a compilation.
The language is a fully procedual language; local functions, local &
global variables
> Feel free to contact me by e-mail, for a more concrete discussion of
> your project.
>
> DoDi
I've never done anything like this before. I've been gleaning
from a number of books on compilers and finally felt I might
be ready to take the plunge into actually writing one.
Theory is one thing, actually writing one is a something
completely different.
Thank you for the life preserver. Trust me, I will take it.
W.
> I recently got my scanner working for my [first] compiler. My
> compiler is for an existing BASIC language. I am seeing the
> complexity of the language at the parsing level. There are
> approximately 600 keyword or keyword combinations which include
> compound words due to intrinsic functions.
>
> Is there a parser generator that produces the equivalent of a
> hand-coded recursive descent parser? I'm looking for a generator that
> doesn't require an engine and doesn't use external libraries... just
> plain old C.
>
> I can watch and debug recursive descent code because I can understand
> that. I cant imagine trying to debug a table driven parser or having
> to rewrite it in BASIC.
>
> The reason for my request is that my compiler will be written in a
> BASIC dialect instead of C. I would generate the parser in ANSI C
> then rewrite it in BASIC. I'm using a BASIC compiler to bootstrap my
> own. It seems to be a good idea to write the compiler in the language
> its going to compileso that's what I'm doing.
Unless you're trapped on a planet light years away, with nothing else
than a BASIC, I see no reason to use that to bootstrap a compiler.
Even if you have to compile the compiler to BASIC for some strange
reason, I see no reason why you shouldn't use more powerful tools to
generate the BASIC code...
So, here is a simple Recursive Descent Parser generator, written in
Common Lisp, that can generate the parser in lisp, or in a
pseudo-basic. You could modify it to generate the code you want, for
your target basic system.
http://www.informatimago.com/develop/lisp/small-cl-pgms/rdp/
For the scanner, it uses a regexp library with a simplistic algorithm
(trying the regexp for each token in turn.
If speed was needed, a smarter algorithm using a fusionned DFA for the
scanner would be in order, and of course, a table based parser too.
> Also are there any algorithms for AST building. Everything I've
> understand tells me that I really want to build an AST and do code
> generation from it versus trying to generate code as I go along. I
> thought I understood the process but I'm not there yet.
There's no specific algorith to build the AST. It's merely built by
the actions associated to the grammar rules, and executed by the
parser when the rule is used.
The reason why it's interesting to build this AST as an intermediate
data structure, is that apart from the most simple cases, some
semantic analysis, and some optimizations are needed, for which we
need a global view of the AST.
For example to check that the declaredtype of the variables matches
the type of the operators with which the variables are used, or to to
type inference.
The remaining phases can be implemented as progressive transformations
of the AST into less and less abstract trees until what remains is a
list of target processor instructions.
Here is an example of the grammar written for my simple recursive
descent parser.
;;; Example taken from: http://en.wikipedia.org/wiki/Recursive_descent_parser
(defgrammar example
:terminals ((ident "[A-Za-z][A-Za-z0-9]*")
;; real must come first to match the longest first.
(real "^\\([-+]\\?[0-9]\\+\\.[0-9]\\+\\([Ee][-+]\\?[0-9]\\+\\)\\?\\)")
(integer "[-+]\\?[0-9]\\+"))
:start program
:rules ((--> factor
(alt ident
number
(seq "(" expression ")" :action $2))
:action $1)
(--> number (alt integer real) :action $1)
(--> term
factor (rep (alt "*" "/") factor)
:action `(,$1 . ,$2))
(--> expression
(opt (alt "+" "-"))
term
(rep (alt "+" "-") term :action `(,$1 ,$2))
:action `(+ ,(if $1 `(,$1 ,$2) $2) . ,$3))
(--> condition
(alt (seq "odd" expression
:action `(oddp ,$2))
(seq expression
(alt "=" "#" "<" "<=" ">" ">=")
expression
:action `(,$2 ,$1 ,$3)))
:action $1)
(--> statement
(opt (alt (seq ident ":=" expression
:action `(setf ,$1 ,$3))
(seq "call" ident
:action `(call ,$2))
(seq "begin" statement
(rep ";" statement
:action $2)
"end"
:action `(,$2 . ,$3))
(seq "if" condition "then" statement
:action `(if ,$2 ,$4))
(seq "while" condition "do" statement
:action `(while ,$2 ,$4))))
:action $1)
(--> block
(opt "const" ident "=" number
(rep "," ident "=" number
:action `(,$2 ,$4))
";"
:action `((,$2 ,$4) . ,$5))
(opt "var" ident
(rep "," ident :action $2)
";"
:action `(,$2 . ,$3))
(rep "procedure" ident ";" block ";"
:action `(procedure ,$2 ,$4))
statement
:action `(block ,$1 ,$2 ,$3 ,$4))
(--> program
block "." :action $1)))
The forms following the :ACTION keyword build the AST nodes from the
nodes returned by the subnodes identified by $1, $2, etc...
And here is an example of (lisp) AST built by the generated parser;
(parse-example
"
const abc = 123,
pi=3.141592e+0;
var a,b,c;
procedure gcd;
begin
while a # b do
begin
if a<b then b:=b-a ;
if a>b then a:=a-b
end
end;
begin
a:=42;
b:=30.0;
call gcd
end.")
-->
(BLOCK
(((IDENT "abc" 11) (INTEGER "123" 17))
((IDENT "pi" 32) (REAL "3.141592e+0" 35)))
((IDENT "a" 57) (IDENT "b" 59) (IDENT "c" 61))
((PROCEDURE (IDENT "gcd" 79)
(BLOCK NIL NIL NIL
((WHILE (("#" "#" 112) (+ ((IDENT "a" 110))) (+ ((IDENT "b" 114))))
((IF (("<" "<" 151) (+ ((IDENT "a" 150))) (+ ((IDENT "b" 152))))
(SETF (IDENT "b" 159)
(+ ((IDENT "b" 162)) (("-" "-" 163) ((IDENT "a" 164))))))
(IF ((">" ">" 186) (+ ((IDENT "a" 185))) (+ ((IDENT "b" 187))))
(SETF (IDENT "a" 194)
(+ ((IDENT "a" 197)) (("-" "-" 198) ((IDENT "b" 199))))))))))))
((SETF (IDENT "a" 235) (+ ((INTEGER "42" 238))))
(SETF (IDENT "b" 246) (+ ((REAL "30.0" 249)))) (CALL (IDENT "gcd" 264))))
(The integers in third position in the sublists are the positions in
the source of the corresponding token.)
--
__Pascal Bourguignon__ http://www.informatimago.com/
Examples off the top of my head
keyword or compound phase, usage
"compile", directive on what compiler options will be used
"compile long if", conditional compile block statement
"compile xelse", what to do if the conditional compile expression fails
"compile end if", end of conditially compiled block
"clear", reinitialize global and main scoped variables for entire
program
"clear local", initialize variables belonging to function before use
"clear local mode" , same as "clear local" but user global variable are
inaccessible ( not visible to the function)
"local", indicates start of local variables scope
"local mode", start of local scope, procedure can not see global
variables
"def fn <function name> [( paramater list)]", prototype a function
"def fn <function name> = expression", simple function
"def fn <function name> USING functionPointer", virtual function
prototype
> > Is there a parser generator that produces the equivalent of a
> > hand-coded recursive descent parser? I'm looking for a generator
> > that doesn't require an engine and doesn't use external libraries...
> > just plain old C.
>
> I've seen CoCo/R mentioned... ANTLR also produces recursive descent
> (LL(*)) parsers, but converting its output to basic may be about the
> least fun thing you could do...
I will check out CoCo/R.
> > I can watch and debug recursive descent code because I can understand
> > that. I cant imagine trying to debug a table driven parser or having
> > to rewrite it in BASIC.
>
> You could always write one by hand -- BASIC's grammar isn't overly
> complex, and it could be a good learning experience :D
I thought it would be character building and and I am finding it
humbling :-)
I've been trying to follow the AST of the LCC compiler ( I own Hanson &
Frasier's book). The idea behind creating DAG's is good, but maybe
problematic for me. I'm not willing to use code or ideas I dont fully
comprehend. The first time it breaks or something near it breaks is
when your in trouble.
> Apologies for any issues in the above codei tend to use languages where
> subtyping is an option :D
No need to apologize. You explained it the way you know how with an
example. Its up to me now to take it in.
Thank you for your assistance,
W.
> Unless you're trapped on a planet light years away, with nothing else
> than a BASIC, I see no reason to use that to bootstrap a compiler.
>
> Even if you have to compile the compiler to BASIC for some strange
> reason, I see no reason why you shouldn't use more powerful tools to
> generate the BASIC code...
From what I've read, many compilers are grown and extended by using
their own language, I like that idea.
> So, here is a simple Recursive Descent Parser generator, written in
> Common Lisp, that can generate the parser in lisp, or in a
> pseudo-basic. You could modify it to generate the code you want, for
> your target basic system.
>
> http://www.informatimago.com/develop/lisp/small-cl-pgms/rdp/
>
> For the scanner, it uses a regexp library with a simplistic algorithm
> (trying the regexp for each token in turn.
>
> If speed was needed, a smarter algorithm using a fusionned DFA for the
> scanner would be in order, and of course, a table based parser too.
I will check it out.
> > Also are there any algorithms for AST building. Everything I've
> > understand tells me that I really want to build an AST and do code
> > generation from it versus trying to generate code as I go along. I
> > thought I understood the process but I'm not there yet.
>
> There's no specific algorith to build the AST. It's merely built by
> the actions associated to the grammar rules, and executed by the
> parser when the rule is used.
Oh darn, an algorithm would give a standard to work with and compare
to.
> The reason why it's interesting to build this AST as an intermediate
> data structure, is that apart from the most simple cases, some
> semantic analysis, and some optimizations are needed, for which we
> need a global view of the AST.
>
> For example to check that the declaredtype of the variables matches
> the type of the operators with which the variables are used, or to to
> type inference.
>
> The remaining phases can be implemented as progressive transformations
> of the AST into less and less abstract trees until what remains is a
> list of target processor instructions.
>
>
>
>
> Here is an example of the grammar written for my simple recursive
> descent parser.
>
I will glean from your example all that I can. I appreciate it.
Thank you,
W.
> olive...@gmail.com wrote:
>> Mr.E wrote:
>> > There are approximately 600 keyword or keyword combinations which
>> > include compound words due to intrinsic functions.
>>
>> What do you mean by keyword combinations?
>
> Examples off the top of my head
>
> keyword or compound phase, usage
> "compile", directive on what compiler options will be used
> "compile long if", conditional compile block statement
> "compile xelse", what to do if the conditional compile expression fails
> "compile end if", end of conditially compiled block
> "clear", reinitialize global and main scoped variables for entire
> program
> "clear local", initialize variables belonging to function before use
> "clear local mode" , same as "clear local" but user global variable are
> inaccessible ( not visible to the function)
> "local", indicates start of local variables scope
> "local mode", start of local scope, procedure can not see global
> variables
> "def fn <function name> [( paramater list)]", prototype a function
> "def fn <function name> = expression", simple function
> "def fn <function name> USING functionPointer", virtual function
> prototype
The main idea of the recursive descent parser, is that you can select
the production rule from the first terminal.
With these rules:
start --> c1|c2|c3|c4|k1|k2|k3|l1|l2|d1|d2|d3
c1 --> "compile"
c2 --> "compile" "long" "if"
c3 --> "compile" "xelse"
c4 --> "compile" "end" "if"
k1 --> "clear"
k2 --> "clear" "local"
k3 --> "clear" "local" "mode"
l1 --> "local"
l2 --> "local" "mode"
d1 --> "def" "fn" <function name> [ "(" <paramater list> ")" ]
d2 --> "def" "fn" <function name> "=" <expression>
d3 --> "def" "fn" <function name> "USING" <functionPointer>
from the parse-start function and the first terminal read (let's say
it's "compile", you cannot know which rule to apply, which function to
call.
So you have to transform the grammar to make sure that for each non
terminal the set of the first symbols derivable from that non terminal
is disjoint from the set for the other non terminals (at least, for
the other non terminals derivable from the same places).
c --> "compile" c1|c2|c3|c4
c1 -->
c2 --> "long" "if"
c3 --> "xelse"
c4 --> "end" "if"
k --> "clear" k1|kl
k1 -->
kl --> "local" kl1|kl2
kl1 -->
kl2 --> "mode"
l --> "local" l1|l2
l1 -->
l2 --> "mode"
d --> "def" "fn" <function name> d0|d1|d2|d3
d0 -->
d1 --> "(" <paramater list> ")"
d2 --> "=" <expression>
d3 --> "USING" <functionPointer>
Then, for each production rule you can write a function that will know
immediately from the current terminal symbol read (token) which
function to call (what non-terminal production rule to invoke).
parse-start:
if token="compile" then c
parse-c:
accept "compile"
if token="long" then c2
else if token="xelse" then c3
else if token="end" then c4
else c1
parse-d:
accept "def"
accept "fn"
parse-function-name
if token="(" then parse-d1
else if token="=" then parse-d2
else if token="USING" then parse-d3
else parse-d0
parse-d1:
accept "("
parse-paremter-list
accept ")"
...
But notice how this creates "artificial" non-terminals and their
corresponding procedures.
My example Recursive Descent Parser Generator doesn't normalize the
grammar in such a way, so you'd have to make sure to write the grammar
such as the first set of each non-terminal group is disjoint.
It would be better to use a parser generator that did this grammar
normalization (there is a simple algorithm to do this normalization).
Then the procedures generated won't match 1-1 the production rules you
write.
Therefore I wouldn't bother with expecting a readable generated
parser. Table driven parsers are well known and work well and
efficiently.
Using the right language will teach you concepts that makes doing this
so much easier. At the very minimum you need product (~ "struct") and
sum (~ "union") types. In (classic) BASIC you'd have to simulate those
making it a very unnatural and messy implementation.
Consider starting with a much simpler example, say an expression
_interpreter_ to handle this language
e := e + e | e * e | ( e ) | k | v
(where * binds stronger than +, k is an integer constant, and v a
variable of some value).
If you manage this in BASIC, then next make a compiler for it.
This exercise will likely teach you much that will be useful for a full
compiler for BASIC.
(building ASTs)
> Oh darn, an algorithm would give a standard to work with and compare
> to.
There isn't any "algorithm" to be had. Building AST nodes is trivial, or
if it isn't, you can bet writing a compiler won't be.
Tommy
PS: Here's a slightly compressed solution in C for the first half.
#include <ctype.h>
#include <stdio.h>
char *s = "1+x*3+4*(5+y)";
int pExp(void), env[256] = { ['x'] = 2, ['y'] = 3 };
int pFactor(void) {
int v = 0;
if (*s == '(')
++s, v = pExp(), ++s;
else if (isdigit(*s))
while (isdigit(*s))
v = 10*v + *s++ - '0';
else if (isalpha(*s))
v = env[(unsigned)*s++];
return v;}
int pTerm(void) {
int v = pFactor();
while (*s == '*') ++s, v *= pFactor();
return v;}
int pExp(void) {
int v = pTerm();
while (*s == '+') ++s, v += pTerm();
return v;}
int main(int argc, char **argv) {printf("%d\n", pExp()); return 0;}
>Mr.E wrote:
>> From what I've read, many compilers are grown and extended by using
>> their own language, I like that idea.
>
>Using the right language will teach you concepts that makes doing this
>so much easier. At the very minimum you need product (~ "struct") and
>sum (~ "union") types. In (classic) BASIC you'd have to simulate those
>making it a very unnatural and messy implementation.
You don't really need a "union" type. It would just make some things
a little easier. I didn't need one for BCET.
<snip>
>If you manage this in BASIC, then next make a compiler for it.
I did. BCET is written mostly in Basic. Some routines are written in
Assembler, mostly for speed(they were originally developed in Basic).
>This exercise will likely teach you much that will be useful for a full
>compiler for BASIC.
It can be useful. As part of constant expression reduction, I had to
scan the expression tree as if I were interpreting it. But that was
added later. The original compiler would actually generate
instructions to add 1 and 1 for a statement like:
LET A = 1 + 1
<snip>
--
ArarghMail609 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html
> The main idea of the recursive descent parser, is that you can select
> the production rule from the first terminal.
Provided you have a LL(1) grammar. What should be right for BASIC
languages...
But I'd structure the grammar in a different way:
> k1 --> "clear"
> k2 --> "clear" "local"
> k3 --> "clear" "local" "mode"
k --> "clear" [ "local" [ "mode" ] ]
> l1 --> "local"
> l2 --> "local" "mode"
l --> "local" [ "mode" ]
> d1 --> "def" "fn" <function name> [ "(" <paramater list> ")" ]
> d2 --> "def" "fn" <function name> "=" <expression>
> d3 --> "def" "fn" <function name> "USING" <functionPointer>
d --> "def" "fn" <function name> [ ... ]
EBNF better supports grammars for LL languages, because it can reflect
the parser structure very well:
if token = "local" then
if nextToken = "mode" then ... else ...
else
Error...
The same for {...} repetitions, which translate into while loops.
This way it's possible to hand-craft an recursive descent parser from a
well-formed EBNF grammar. The use of additional tools can be reduced to
formal grammar checks, and to the generation of the FIRST and FOLLOW
sets, from which the procedure tree can be generated either
automatically or manually.
In so far I don't understand (and can't encourage) the way you bloat a
formal grammar first, only to produce the effectively same procedure
structure that could be expressed in EBNF immediately.
Problems IMO can arise only from an inappropriate grammar design, which
might not be LL(1), where a LL(1) grammar is feasable. Unfortunately I
don't know of a tool, that would take an arbitrary grammar, and would
transform it into a LL(1) grammar if ever possible:
> My example Recursive Descent Parser Generator doesn't normalize the
> grammar in such a way, so you'd have to make sure to write the grammar
> such as the first set of each non-terminal group is disjoint.
>
> It would be better to use a parser generator that did this grammar
> normalization (there is a simple algorithm to do this normalization).
Can you give a reference, please?
CoCo/R will report an LL(1) error for a dangling else or other
constructs, which in fact do not prevent the construction of an correct
recursive descent parser.
> Therefore I wouldn't bother with expecting a readable generated
> parser. Table driven parsers are well known and work well and
> efficiently.
You omit one important step in the development of an parser:
In the first place the user has to supply a grammar that *exactly*
reflects the given language!
In this development stage a table driven parser cannot assist in finding
eventual errors in a grammar. The parser generator will either produce
an parser, for a different language, or will report formal errors, which
typically are not related to the error in the language/grammar
transformation. And debugging an table driven parser for such errors is
a nightmare :-(
Even if e.g. CoCo will not produce parsers for many simple languages,
due to LL(1) errors, it will give pointers to the places, where the
grammar deserves a redesign. Then an according modification of an EBNF
grammar is much easier to perform, because the grammar reflects the
formally or informally given language description much closer, and with
less bloat, than a LR(k) BNF grammar.
That's why I suggest to try, at least, to develop an EBNF LL grammar in
the first place, until that grammar seems to reflect the intended
language. Having reached this milestone, the developer can decide
whether to follow this road, and to produce an recursive descent LL
parser, either manually or automatically, or to switch to an LR parser,
if the language is inappropriate for LL parsing at all.
IMO this point was never honored appropriately in the discussion about
recursive descent and table driven parsers. Table driven parsers are
fine, provided that a grammar already exists, or the developer is free
to design a very new language. But in cases, where a language is not
described by a formal (BNF) grammar, an according existing compiler most
probably was *not* based on an table driven parser. Then I don't see any
reason, to favour an table driven parser for that language. A recursive
descent parser instead will allow to implement exactly the same hacks,
which possibly have been inserted in the original compiler for that
language.
DoDi
> On 10 Sep 2006 23:44:00 -0400, Tommy Thorn <tommy...@gmail.com>
> wrote:
>
>>This exercise will likely teach you much that will be useful for a full
>>compiler for BASIC.
>
> It can be useful. As part of constant expression reduction, I had to
> scan the expression tree as if I were interpreting it. But that was
> added later.
Why? It is an integral part of the recursive descent parser. At some point
the parser expects an expression, so it calls to Get_Expression, which
scans up to the right bound of the expression (generating tree as a side
effect). Of course, it would be silly to design Get_Expression itself this
way, but that is no different from Get_Identifier or Get_Literal
subprograms. Expression is just a "token" for the parser.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
As I've understood your parser, it is similar in design (initially) as
my scanner. I made the scanner do more of the work with recognizing
the compound keywords. I know all the combinations so the scanner
return a single token for the compound word reducing the tokens to a
terminal, simplifying the parser. As far as the parser is concerned
there are no compound keywords.
I see what you are saying about the parsing and will keep it in mind
while I'm re-drafting how I'm going to parse.
I believe I read a post from our moderator once that said that scanning
can be 20% of compilation time because the scanner has to touch every
character. My scanner is pretty efficient and should make life easier
on my parser and me.
W.
>On 11 Sep 2006 08:39:39 -0400, Arargh...@Arargh.com wrote:
<snip>
>> It can be useful. As part of constant expression reduction, I had to
>> scan the expression tree as if I were interpreting it. But that was
>> added later.
>
>Why? It is an integral part of the recursive descent parser. At some point
>the parser expects an expression, so it calls to Get_Expression, which
>scans up to the right bound of the expression (generating tree as a side
>effect). Of course, it would be silly to design Get_Expression itself this
>way, but that is no different from Get_Identifier or Get_Literal
>subprograms. Expression is just a "token" for the parser.
Several reasons. Some of the reasons are here:
http://www.arargh.com/basic/history.html
a) At the time I wrote the expression tree routines, I didn't really
know what I was doing. I was learning. I still am. Also, I wanted
the trees to reflect the source exactly, so that I could find errors.
b) BCET was developed in a limited 16-bit IDE. There simply wasn't
room to get too fancy. Still isn't.
c) Since the code generator has to desend the trees anyway, constant
expression reduction was added there. Although, it was added as a
seperate routine, which allows it to be turned off.
Not attempting to be an apologist for BASIC but our fathers BASIC is
different from todays BASIC. Different dialects have emerged. Some
havent changed much and others are highly improved. The dialect I use
has records (~structs), unions, functions, pointers and much that is
available to a C programmer.
> PS: Here's a slightly compressed solution in C for the first half.
I will study your example and make good use of it.
Thank you very much for your input.
W.
Ah, I show my age again :-)
> I will study your example and make good use of it.
>
> Thank you very much for your input.
I'll readily admit I had much fun hacking it together, though I hope
you realize that this is not production ready code and the style is
aweful (to make it fit).
Our dear moderator be willing, I've made it a tad more realistic and
finish the 2nd half: a JIT binary code compiler (works on x86 Linux in
32-bit mode).
Do feel free to ask questions.
Enjoy,
Tommy
#include <ctype.h>
#include <stdio.h>
char *s;
int lookahead;
int intValue;
char *symbolValue;
unsigned symbolLength;
void nexttoken(void) {
while (isspace(*s)) ++s;
if (isdigit(*s)) {
intValue = 0;
lookahead = '0';
while (isdigit(*s)) intValue = 10*intValue + *s++ - '0';
} else if (isalpha(*s)) {
lookahead = 'a';
symbolValue = s;
while (isalnum(*s)) ++s;
symbolLength = symbolValue - s;
} else lookahead = *s++; }
void match(unsigned expect) {
if (lookahead != expect) lookahead = -1;
nexttoken(); }
typedef struct node *ast_t;
struct node {
int kind;
ast_t l, r;
int intValue;
} nodes[9999];
int next = 0;
ast_t mk(int kind, ast_t l, ast_t r, int k) {
nodes[next].kind = kind;
nodes[next].l = l;
nodes[next].r = r;
nodes[next].intValue = k;
return &nodes[next++]; }
ast_t pExp(void);
ast_t pFactor(void) {
ast_t v;
switch (lookahead) {
case '(': match('('); v = pExp(); match(')'); break;
case 'a': v = mk('a', 0,0, symbolValue[0]); match('a'); break;
case '0': v = mk('0', 0,0, intValue); match('0'); break; }
return v;}
ast_t pTerm(void) {
ast_t v = pFactor();
while (lookahead == '*') match('*'),v=mk('*',v,pFactor(),0);
return v;}
ast_t pExp(void) {
ast_t v = pTerm();
while (lookahead == '+') match('+'),v=mk('+',v,pTerm(),0);
return v;}
void prast(ast_t t) {
if (t->kind == '0') printf("%d", t->intValue);
else if (t->kind == 'a') printf("%c", t->intValue);
else { printf("(");
prast(t->l); printf("%c", t->kind); prast(t->r); printf(")");}}
char code[9999], *cp = code;
int env[256] = { ['x'] = 2, ['y'] = 3 };
void gen(ast_t t) {
if (t->kind == '0') {
*cp++ = 0x50; // push %eax
*cp++ = 0xB8; // movl $<intValue>, %eax
*(int *)cp = t->intValue;
cp += 4;
} else if (t->kind == 'a') {
*cp++ = 0x50; // push %eax
*cp++ = 0xA1; // mov <env[..]>, %eax
*(int **)cp = &env[t->intValue];
cp += 4;
} else {
gen(t->r);
gen(t->l);
*cp++ = 0x5B; // pop %ebx
if (t->kind == '+') *cp++=1, *cp++=0xD8; // add %ebx,%eax
else *cp++=0xF, *cp++=0xAF, *cp++=0xC3; // imul %ebx, %eax
}}
int main(int argc, char **argv) {
ast_t res;
s = "1 + x*3 + 4*(5 + y)";
if (argc > 1) s = argv[1];
nexttoken();
res = pExp();
if (lookahead != 0){printf("Syntax error at:%s\n", s); return -1;}
prast(res);
printf("\n");
gen(res);
*cp++=0x5B; // pop dummy
*cp++=0xC3; // ret
printf("%d\n", ((int(*)()) code)());
return 0;
}
> I'll readily admit I had much fun hacking it together, though I hope
> you realize that this is not production ready code and the style is
> aweful (to make it fit).
>
> Our dear moderator be willing, I've made it a tad more realistic and
> finish the 2nd half: a JIT binary code compiler (works on x86 Linux in
> 32-bit mode).
>
> Do feel free to ask questions.
>
> Enjoy,
> Tommy
>
Thank you very much.
Just looking at the code I almost understand it all. I'm looking
forward to running it.
For me this is a labor of love but at the moment it's just laborious.
I've wanted to learn to write a compiler for years. So many concepts
and intricacies that I never had to think about just using a language
are becoming painfully obvious to me as a first time compiler writer.
Syntax analysis, semantic analysis, error recovery has my head
throbbing. I really want my first compiler to be completely hand
written so I can really appreciate the intricacies of writing one. At
the same time I'm thinking there are tools available that would make
the job so much simpler, probably faster, with fewer errors. I could
build my own house or I can give the specs to a machine and let it
build a chunk of it for me. Which one is better, hmmmm?
W.
Which part is laborious? Hopefully not understanding the snippets I've
shown you.
> I've wanted to learn to write a compiler for years. So many
> concepts and intricacies that I never had to think about just using
> a language are becoming painfully obvious to me as a first time
> compiler writer.
Welcome to the new world. By far the best way to fully learn a
programming language is to write a compiler for it. That is where you
really have to consider all details. For example, before actually doing
to work, most people think that C is an LALR(1) language or simpler (*).
For me compilers are really fun. I can only recommend that you take it
very slow. Focus on compiling a tiny language fully from source to
something executable, and then grow it slowly to a full language. This
is much better than trying to make a parser for the full language right
off the bat.
> Syntax analysis, semantic analysis, error recovery has my head
> throbbing. I really want my first compiler to be completely hand
> written so I can really appreciate the intricacies of writing one.
For a small language, that trivial enough to do. Even Linus shunned
automatic parser generators when we wrote his Semantic C checker
(sparse). I agree with him, for a stable language, automatic tools are
not worth the trouble. You end up wasting time and energy dealing with
and learning about them, rather than working on your compiler.
> At the same time I'm thinking there are tools available that would make
> the job so much simpler, probably faster, with fewer errors.
No tool can replace understanding. Once you understand, you'll see that
the parser generators don't really save you much time.
I have a follow up to my example, but I think I'll post that in a
separate article as this is long enough already.
Tommy
(*) It's not. The lexer has to know which names are types, otherwise it
becomes impossible to tell if the statement
x * y;
is a declation of a pointer to x or an expression multiplying x and y.
> I did. BCET is written mostly in Basic. Some routines are written in
> Assembler, mostly for speed(they were originally developed in Basic).
It looks like you haven't been working on BCET for a while. Have you
considered making it open source?
If anyone knows of an open source Basic compiler that can run on
Windows, please let me know! It doesn't have to be fancy...
Eric
>Arargh...@Arargh.com wrote:
>
>> I did. BCET is written mostly in Basic. Some routines are written in
>> Assembler, mostly for speed(they were originally developed in Basic).
>
>It looks like you haven't been working on BCET for a while.
Still working on it, just haven't put up a new version. It's up to
0.6.3 in internal development.
>Have you considered making it open source?
I have thought of it, however, some parts of the runtime have
restrictions. Without the runtime, the compiler wouldn't be that
useful.
>If anyone knows of an open source Basic compiler that can run on
>Windows, please let me know! It doesn't have to be fancy...
Thanks.
The example I posted was just the basics to get started, but it's enough
to allow me to make a point about the value of an internal representation.
Once you have the AST, it's much easier to add transformation and
optimizations. Fx. to eliminate (some) common subexpressions, all that
is needed is to instrument the tree building function "mk()" to search
for previous occurences, eg.:
ast_t mk(token_t kind, ast_t l, ast_t r, int k)
{
for (p = &nodes[0]; p != &nodes[next]; ++p)
if (p->kind==kind && p->l==l && p->r==r && p->intValue==k) {
p->shared = 1;
return p;
}
...
In addition, the code generation must of course save and reuse the value
of a common subexpression.
Constant folding is also very easy at this point:
// k1 + k2 -> [k1 + k2]
if (kind == '+' && l->kind == INT && r->kind == INT)
return mk(INT, 0, 0, l->intValue + r->intValue);
// k1 * k2 -> [k1 * k2]
if (kind == '*' && l->kind == INT && r->kind == INT)
return mk(INT, 0, 0, l->intValue * r->intValue);
As are some algebraic simplifications:
// x * 0 -> 0
if (kind == '*' && r->kind == INT && r->intValue == 0)
return mk(INT, 0, 0, 0);
// x * 1 -> x
if (kind == '*' && r->kind == INT && r->intValue == 1)
return l;
// x + 0 -> x
if (kind == '+' && r->kind == INT && r->intValue == 0)
return l;
I've put up a cleaner version of the expression compiler with these and
other enhancements on http://not.meko.dk/Hacks/Compiler
Tommy
> to work, most people think that C is an LALR(1) language or simpler (*).
> (*) It's not. The lexer has to know which names are types, otherwise it
> becomes impossible to tell if the statement
> x * y;
> is a declation of a pointer to x or an expression multiplying x and y.
A more practical example, found in this group already:
a = (b)-c;
is (b) a type cast, or an expression?
Sorry, I don't remember who "invented" this fine example :-(
As a solution for this problem in my LL(1.5) C parser, I leave the
distinction to a stage between the lexer and parser, inside the
preprocessor. Only in this stage "words" are mapped into keywords,
typenames and other identifiers. The according "symbol" table is
initialized with the C keywords, the preprocessor adds to it all
#defined names, and the parser adds all encountered typedef names. This
procedure works fine, provided that typedefs always have global scope,
as is required in C. C++ makes it more complicated, with namespaces and
local typedefs :-(
DoDi
> >Have you considered making it open source?
>
> I have thought of it, however, some parts of the runtime have
> restrictions. Without the runtime, the compiler wouldn't be that
> useful.
When you say "restrictions", I assume you mean "proprietary code"?
Maybe we could re-implement the runtime to remove the proprietary code?
Can you explain this a little more...are you generating some kind of
intermediate code and then translating that to native code, or ???
Eric
>Arargh...@Arargh.com wrote:
>
>> >Have you considered making it open source?
>>
>> I have thought of it, however, some parts of the runtime have
>> restrictions. Without the runtime, the compiler wouldn't be that
>> useful.
>
>When you say "restrictions", I assume you mean "proprietary code"?
>Maybe we could re-implement the runtime to remove the proprietary code?
More along the line of: Used and modified some source from another
product, for which I have permission to do so, but never received
rights to distribute that source.
More details here: http://www.arargh.com/basic/download.html
>Can you explain this a little more...are you generating some kind of
>intermediate code and then translating that to native code, or ???
BCET compiles to MASM (ML these days) assembler source. I have a
partially working conversion program that translates ML (MASM) source
to Nasm source for use on FreeBSD or Linux systems, but I never got
all of the runtime to work there.
--
ArarghMail609a at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html
To reply by email, remove the garbage from the reply address.
(fx:snip)
> #defined names, and the parser adds all encountered typedef names. This
> procedure works fine, provided that typedefs always have global scope,
> as is required in C.
Not so far as I'm aware: I couldn't find such a restriction for C90,
and could find implications in the text that it would be allowed.
Experimentally, `gcc -ansi -pedantic` was quite happy with a typedef
inside `main`.
--
Chris "falling further in" Dollin
I'm full of sweetness and light. And I'm /keeping/ it.
>>#defined names, and the parser adds all encountered typedef names. This
>>procedure works fine, provided that typedefs always have global scope,
>>as is required in C.
>
>
> Not so far as I'm aware: I couldn't find such a restriction for C90,
> and could find implications in the text that it would be allowed.
C99 states that type declarations are not allowed in C procedure
definitions.
> Experimentally, `gcc -ansi -pedantic` was quite happy with a typedef
> inside `main`.
The standard may differ between C and C++.
What gcc does, that's another story ;-)
DoDi
> Chris Dollin wrote:
>
>>>#defined names, and the parser adds all encountered typedef names. This
>>>procedure works fine, provided that typedefs always have global scope,
>>>as is required in C.
>>
>> Not so far as I'm aware: I couldn't find such a restriction for C90,
>> and could find implications in the text that it would be allowed.
>
> C99 states that type declarations are not allowed in C procedure
> definitions.
Interesting. The draft has no such restriction in the typedef section,
and an explicit example with a typedef inside a function. Could you
provide a cite for me?
>> Experimentally, `gcc -ansi -pedantic` was quite happy with a typedef
>> inside `main`.
>
> The standard may differ between C and C++.
Don't care about C++; the issue was whether or not typedefs have to
be global /in C/.
> What gcc does, that's another story ;-)
Which is why I specified `-ansi -pedantic`. (And I've just checked with
`-std=c99`.) gcc might of course be wrong, but I don't recall this
case ever having com up in comp.lang.c.
--
Chris "falling further in" Dollin
"I'm still here and I'm holding the answers" - Karnataka, /Love and Affection/
>>C99 states that type declarations are not allowed in C procedure
>>definitions.
>
>
> Interesting. The draft has no such restriction in the typedef section,
> and an explicit example with a typedef inside a function. Could you
> provide a cite for me?
Just from the DRAFT: 2 December 1996:
>>
7.1.3 ... [dcl.typedef]
1. ... The typedef specifier shall not be used in a function-definition
(8.4),
<<
Can somebody lookup this paragraph in the final spec?
(The numbers have changed!)
DoDi
[I can't find that language, but there is a lengthy footnote on page
141 with examples that specifically say that you can typedef a
function type and use it in declarations but not in
definitions. -John]
> Just from the DRAFT: 2 December 1996:
> >>
> 7.1.3 ... [dcl.typedef]
> 1. ... The typedef specifier shall not be used in a function-definition
> (8.4),
> <<
>
> Can somebody lookup this paragraph in the final spec?
> (The numbers have changed!)
The fact that you're quoting a named section "dcl.typedef" is a dead
giveaway you're reading the C++ spec and not a C one.
Neil.
You are right. It looks like I misinterpreted the English wording, by
reading "in function definitions" as "inside function bodies" :-(
DoDi
>Chris Dollin wrote:
>
>>>C99 states that type declarations are not allowed in C procedure
>>>definitions.
>>
>>
>> Interesting. The draft has no such restriction in the typedef section,
>> and an explicit example with a typedef inside a function. Could you
>> provide a cite for me?
>
>Just from the DRAFT: 2 December 1996:
That's a very old draft. Even n869.pdf (the last draft of C99) is
dated January 1999. And there's now n1124.pdf (draft of C0x), which
contains essentially the full text of C99 with a few proposed changes.
I suggest you download it.
>7.1.3 ... [dcl.typedef]
>1. ... The typedef specifier shall not be used in a function-definition
>(8.4),
That doesn't mean typedef can't be used within a function. It means
that you can't use typedef in the function definition itself. That is,
you can't do something like this:
typedef void somefunc(void) {puts("whatever");};
Or perhaps it's badly worded and really means you can't do something
like this:
typedef void somefunc_t(void);
somefunc_t somefunc {puts("whatever");}
In fact, there are some instances in which a typedef declaration
must have block scope.
>Can somebody lookup this paragraph in the final spec?
I don't have the official standard. But I have n1124.pdf. That has
examples of typedef at block scope. And it says (if I've interpreted
it correctly) that typedef for an array with unspecified size, for
example, has to be at block scope. It gives this example:
----------------------------------------------------------------------
One form of initialization that completes array types involves typedef
names. Given the declaration
typedef int A[]; // OK - declared with block scope
the declaration
A a = { 1, 2 }, b = { 3, 4, 5 };
is identical to
int a[] = { 1, 2 }, b[] = { 3, 4, 5 };
due to the rules for incomplete types.
----------------------------------------------------------------------
As the comment says, the typedef is OK at block scope. It also gives
this example, which shows a typedef within a function body (In fact,
this is another one that is required within block scope.):
----------------------------------------------------------------------
extern int n;
int A[n]; // invalid: file scope VLA
extern int (*p2)[n]; // invalid: file scope VM
int B[100]; // valid: file scope but not VM
void fvla(int m, int C[m][m]); // valid: VLA with prototype scope
void fvla(int m, int C[m][m]) // valid: adjusted to auto pointer to
// VLA
{
typedef int VLA[m][m]; // valid: block scope typedef VLA
struct tag {
int (*y)[n]; // invalid: y not ordinary identifier
int z[n]; // invalid: z not ordinary identifier
};
int D[m]; // valid: auto VLA
static int E[m]; // invalid: static block scope VLA
extern int F[m]; // invalid: F has linkage and is VLA
int (*s)[m]; // valid: auto pointer to VLA
extern int (*r)[m]; // invalid: r has linkage and points to VLA
static int (*q)[m] = &B; // valid: q is a static block pointer to
// VLA
}
----------------------------------------------------------------------
I can find nothing else that would ban the use of typedef within a
function body, and the text you quoted above is not there.
>(The numbers have changed!)
>
>DoDi
>[I can't find that language, but there is a lengthy footnote on page
>141 with examples that specifically say that you can typedef a
>function type and use it in declarations but not in
>definitions. -John]
Right.
> to work, most people think that C is an LALR(1) language or simpler (*).
> (*) It's not. The lexer has to know which names are types, otherwise it
> becomes impossible to tell if the statement
> x * y;
> is a declation of a pointer to x or an expression multiplying x and y.
A more practical example, found in this group already:
a = (b)-c;
is (b) a type cast, or an expression?
Sorry, I don't remember who "invented" this fine example :-(
As a solution for this problem in my LL(1.5) C parser, I leave the
distinction to a stage between the lexer and parser, inside the
preprocessor. Only in this stage "words" are mapped into keywords,
typenames and other identifiers. The according "symbol" table is
initialized with the C keywords, the preprocessor adds to it all
#defined names, and the parser adds all encountered typedef
names. This procedure works fine, provided that typedefs always have
global scope, as is required in C. C++ makes it more complicated, with
> Tommy Thorn wrote:
>
>> to work, most people think that C is an LALR(1) language or simpler (*).
>
>> (*) It's not. The lexer has to know which names are types, otherwise it
>> becomes impossible to tell if the statement
>> x * y;
>> is a declation of a pointer to x or an expression multiplying x and y.
>
> A more practical example, found in this group already:
> a = (b)-c;
> is (b) a type cast, or an expression?
>
> Sorry, I don't remember who "invented" this fine example :-(
Has anyone ever tried to solve these C / C++ parsing problems without
"feedback"?
Ada has a seemingly similar problem. For example, "X(Y)" might mean
call function X passing actual parameter Y, or array indexing with X
as the array and Y as the index, or convert expression Y to type X.
It might even mean call parameterless function X, which returns an
array, and use Y as the index.
The typical solution in Ada compilers is for the parser to build a
tree that means "this could be any of ...", with X and Y as subtrees,
and then semantic analysis has symbol tables, so it knows what X is,
and changes that node into a "function call" or "array indexing" or
whatever is appropriate. (Note: X could be an overloaded name.)
My question is: can that technique work for the above C and C++
ambiguities? Does it make the problem easier or harder? Is the
answer different for C++ than for C?
- Bob
P.S. Why do you say that the "a = (b)-c;" ambiguity is more practical
than the "x * y;" ambiguity? They seem like more-or-less the same
thing, to me: the parser wants to know whether each identifier is a type
name.
>Has anyone ever tried to solve these C / C++ parsing problems without
>"feedback"?
Construct analysis never has to be done by the parser, it can always
be deferred to a later stage. Feedback to the parser simply allows
early exclusion of impossible alternatives.
>My question is: can that technique work for the above C and C++
>ambiguities? Does it make the problem easier or harder? Is the
>answer different for C++ than for C?
It can make the parser simpler - or not, depending on how you choose
to construct the IR - but adds a disambiguation step afterward which
must rewrite the generic IR into a specific one.
George
It has certaintly been suggested, though I don't know if anyone
actually do that as it's probably simpler in practice to add a little
feedback to the lexer. Moreover, knowing the difference between types
and identifier enables much better error reporting in case of parse
errors.
The much more interesting part is what happens *after* parsing and
semantic analysis.
Tommy
> As a solution for this problem in my LL(1.5) C parser, I leave the
> distinction to a stage between the lexer and parser, inside the
> preprocessor. Only in this stage "words" are mapped into keywords,
> typenames and other identifiers. The according "symbol" table is
> initialized with the C keywords, the preprocessor adds to it all
> #defined names, and the parser adds all encountered typedef
> names. This procedure works fine, provided that typedefs always have
> global scope, as is required in C.
Didn't we go round this a while ago? C typedefs can have non-global
scope.
--
Chris "HO. HO. HO." Dollin
[Oops, we did, didn't we. -John]
They are the same thing, to the parser. However,
a = (b)-c;
might just possibly crop up in a program written by a human being, in
either of its meanings (if (b) were the result of macro expansion,
perhaps); whereas no human would ever write
x * y;
in its meaning of "multiply x and y, and then discard the result".
(Someone with less respect for the C standard's formal grammar might
say, "Why don't we just make x * y; always declare a new pointer, and
ignore that other case that never comes up in practice?" They can't
say that about a = (b)-c; .)
-Arthur
In C you can write an expression (a)-b*c. If a is a type then(a)-b*c is
equivalent to ((a)-b)*c, but if a is a variable then (a)-b*c is
equivalent to (a)-(b*c). To put it an other way: In Ada you will want
a parse tree of the same structure independently of whether a subtree is
an array indexing or a function call. It is simply a question of
putting the right labels on the nodes. In C and C++ it is not enough to
put some other labels on some nodes. You will want to reorder the tree,
and that reordering will be so complex, that it is easier to use a
context sensitive grammar.
Karsten Nyblad
148f3wg02 at sneakemail dot com
> The typical solution in Ada compilers is for the parser to build a
> tree that means "this could be any of ...", with X and Y as subtrees,
> and then semantic analysis has symbol tables, so it knows what X is,
> and changes that node into a "function call" or "array indexing" or
> whatever is appropriate. (Note: X could be an overloaded name.)
>
> My question is: can that technique work for the above C and C++
> ambiguities?
What benefits you would like with this technique?
> Does it make the problem easier or harder? Is the answer different
> for C++ than for C?
Well, in C it's just not a great problem to determine whether current
identifier is a typedef name, so it's hard to me to see what we would have
from passing that choice to a late phase. That approach would be reasonable
for some precompiling techniques like precompiled headers, but there are
other issues with the preprocessing phase in both C and C++.
In C++, there are too many disambiguation variants possible to track them
all into representation trees.
Robert A Duff wrote:
> P.S. Why do you say that the "a = (b)-c;" ambiguity is more practical
> than the "x * y;" ambiguity? They seem like more-or-less the same
> thing, to me: the parser wants to know whether each identifier is a type
> name.
In the case of "x * y;" the context can be used, to determine whether
declarations or expressions can occur at all. Even if both are allowed,
one might argue that the result of a multiplication should be used
somehow, e.g. assigned to a variable.
Not a very strong point, of course, but it leaves room for discussions.
The ambiguity of "a = (b)-c;" is out of question, both interpretations
of "(b)-c" result in a syntactically valid expression.
DoDi
These operations are very close and they syntax is absolutely same.
And even some optimizations can be performed without distinguishing
them.
> The typical solution in Ada compilers is for the parser to build a
> tree that means "this could be any of ..."...
>
> My question is: can that technique work for the above C and C++
> ambiguities? Does it make the problem easier or harder? Is the
> answer different for C++ than for C?
Type declarations and arithmetic expressions are different: you can
use +, -, /, && etc in arithmetic expressions but you cannot do it in
type declarations. If we just "glue" declarations and expressions, then
parser will accept programs like this:
typedef int b;
...
a = (b+k)-c;
Semantic analysis, of course, can detect such errors, but it I don't
think it is easier than "feedback" parsing.
--
Ivan Boldyrev
Чуешь "Тайд", чуешь?
> whereas no human would ever write
>
> x * y;
>
> in its meaning of "multiply x and y, and then discard the result".
In the general case I would agree. As I read this message I was
writing a multibyte multiply routine for a 8 bit processor. This
routine uses the sideffects of the x*y; namely the 16 bit result that
is returned in a pair of registers which I go on to use.
w..
[Sounds like fun, but it also is most definitely not C. -John]
Although you've gotten lots of feedback, there still seems to be 2
points worth addressing.
1) James Roskind did build a C grammar which attempts to eliminate the
need for feedback. (We were consultants to Honeywell building a C
compiler at the time.) He analyzed all the cases where the use of
an identifier and a typename could be confusing (many of them had
to do with function prototypes). As I recall, he had some success.
However, the truly ambiguous examples being discussed in this
thread, make me doubt exactly what I am remembering. Perhaps, it
was just that he had success identifying places, where a new type
or variable *could* be introduced.
In any case, he then tried to apply the same analysis to C++, and
successfully proved that the same technique could not be used to
distinguish the ambiguities in C++. All that could be done in C++
is to push the ambiguities far-enough away that the amount of
lookahead required made them impractical to distinguish.
2) As to "feedback" (and this is in fact the normal term used for one
solution to this problem), the term is somewhat misleading. The
term comes from the fact that the parser would like certain tokens
recategorized based upon the parsers state. One easy way to do
that is to track the parsers state in the lexer (often done by
setting variables shared between the lexer and parser when certain
parser states are entered) and using that to guide which tokens the
lexer emits to the parser. Because, the flow of tokens from the
lexer to the parser is generally considered one-way, this passing
of information back from the parser to the lexer is considered a
"feedback loop", and thus the term.
One thing of particular interest in this feedback, it is rare for
the feedback to change the token boundaries (i.e. which regular
expressions are grouped into tokens). Thus, Frank Deremer long ago
suggessted separating the lexer into two parts, the scanner and the
screener. The scanner is the part of the lexer which separates the
incoming character stream into lexemes and does some preliminary
classification based upon which regular expression was matched.
The screener then looks at some (or all) of those lexeme and
recategorizes them based upon symbol table (dictionary)
information.
Notably, the timing of those two phases can be separated. You can
delay the screening process until the information is actually
needed. This resolves the key problem of "feedback". Generally,
only the screener needs the feedback, although that is not
universally true (for example, in Verilog parsing context is
required to distinguish hex numbers from identifiers and that can
change token boundaries). Thus, in cases where the feedback does
not affect the scanner, the scanner is decoupled from the parser,
and the feedback loop is broken, which allows the scanner to run
arbitrarily ahead of the parser. And that ability to run the
scanner ahead of the parser defuses the "feedback" issue, whicdh is
that the two need to be kept in lock-step, which can be
particularly difficult if your parser needs lookahead tokens or
uses backtracking, since the state necessary for determining the
exact token type can be unavailable at the time the token needs to
be scanned.
However, in the C/C++ case, screening can be delayed until parsing.
In your Ada example, it sounds like some parts of screening can be
delayed until after parsing. The key issue of screening is that it
needs a symbol table (dictionary) and not just a simple stack.
Thus, you cannot convert a screener directly into a context-free
grammar. My favorite solution to this problem is the one Terence
Parr came up with "semantic predicates", which are bits of code
which run at specific parts of the parsing process (i.e. when the
parser gets to a specific point in the grammar) and provide a
go/no-go answer. So, one could write a semantic predicate, that
queries the symbol table for, "in the current context is this
identifier a typedef" and then depending on whether you wanted
typedefs to be legal or not, the parse would either continue with
the current rule or chose another alternative. Note, semantic
predicates are essentially equivalent to attribute grammars, so
that is an alternate solution.
Hope this helps,
-Chris
*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
>1) James Roskind did build a C grammar which attempts to eliminate the
> need for feedback. (We were consultants to Honeywell building a C
> compiler at the time.) He analyzed all the cases where the use of
> an identifier and a typename could be confusing (many of them had
> to do with function prototypes). As I recall, he had some success.
> However, the truly ambiguous examples being discussed in this
> thread, make me doubt exactly what I am remembering. Perhaps, it
> was just that he had success identifying places, where a new type
> or variable *could* be introduced.
>
> In any case, he then tried to apply the same analysis to C++, and
> successfully proved that the same technique could not be used to
> distinguish the ambiguities in C++. All that could be done in C++
> is to push the ambiguities far-enough away that the amount of
> lookahead required made them impractical to distinguish.
Do you recall some of the problem cases?
I would think that it is always possible to delay categorizing an
identifier until after parsing. The difficulty lies in designing an
initial IR which incorporates the ambiguities in addition to the
unambiguous canon IR.
For example, in the case of "a = (b)-c", the parser could construct an
AST like the following
OP:=
IDENT:a
OP:-
EXPR
IDENT:b
IDENT:c
and after qualification of 'b' as a type expression or a variable the
AST can be rewritten to reflect the cast or variable access.
Similarly "x * y" can be parsed as simply
OP:*
IDENT:x
IDENT:y
and figured out afterward.
Of course it is much more work to deliberately construct an ambiguous
IR, and then analyze and rewrite it as the identifiers are qualified
and the ambiguities are resolved. It also delays issuing syntax
errors until the resolution pass.
Note that I am in no way championing this style of compiler design,
and I have never written a C or C++ compiler ... most of my work has
been in scripting DSLs, Lisp and Pascal derivatives ... but, to date,
I haven't encountered any language syntax ambiguities that I believed
could not be resolved after parsing was complete.
George
George Neuner <gneuner2/@comcast.net> wrote:
> Do you recall some of the problem cases?
I don't recall all the exact details.
I do recall the basic issue was with the C++ rule that if some text
could be parsed as an declaration, it was an declaration, and
otherwise it was a expression (or do I have that reversed?). This was
compounded by the "function prototype" syntax. So, if one saw
something like:
a( b (c), d (e), f (g) );
depending on whether b, d, and f were typedefs or not, this might be a
function prototype or it might be a function call. Notice, that, the
pattern is repeating and can be extended indefintely, and if f isn't a
typedef, then b(c) is a cast expression and not a parameter
declaration. Moreover, there are even more complicated examples,
e.g. parenthesis can be aribrarily deep, and I think some operators
like & and * can be used.
There are some interesting cases, like
a (b * c, d (e));
If "b" is a typedef, then "b * c" must be a declaration and not an
expression, because the left hand side of a binary * (i.e. "b") cannot
be a type, and if it is a unary * then "b * c" must be a cast and a
cast would require some parenthesis, either around "b" or around or
"* c". In that case, the whole a(...) clause must be a function
prototype (or a syntax error) and "d" must be a typedef.
> Similarly "x * y" can be parsed as simply
>
> OP:*
> IDENT:x
> IDENT:y
>
> and figured out afterward.
The problem in C derivatives, is you want the expressions to have a
different "shape" if it is a declaration and not an expression. In
the expression case, * is the root and the trees are divided below it.
In the declaration case, the * is a unary operator that binds tightly
to the 2nd tree, and there is no "root operator" that joins the
type-specification (left tree) and the declarator (right tree).
Getting a tree of the right shape, should be something that the parser
does, even if it leaves out a few details (i.e. are the expressions
parameters to a function or substripts of an array is a detail that
wouldn't change the shape of the list of expressions, just the details
of their meaning).
C (and most of its derivatives) are awful languages to parse because
of this ambiguous declaration syntax. The core ideas behind it are
simple, elegant, and marvelously brilliant. Unfortunately, as the
language grew and got extended, the combination didn't scale well.
However, until someone attempted the extension, I'm not sure the fact
that the syntax wouldn't scale would have been obvious. In fact, it
wasn't until we got experience writing C++ that the flaws became
noticeable at all, and even then it looked like a few special cases
could patch the problem.
Unfortunately, these problems are not just issues for compiler
writers, which if that were the only flaw, it would have been fine.
It can actually give people writing code in derivatives like C++
problems. The rules between constructors, function prototpyes,
function calls, and casts are not immediately obvious. Moreover, when
one forgets, one can get a very natural looking program, that doesn't
mean what one intended.
> Note that I am in no way championing this style of compiler design,
> and I have never written a C or C++ compiler ... most of my work has
> been in scripting DSLs, Lisp and Pascal derivatives ... but, to date,
> I haven't encountered any language syntax ambiguities that I believed
> could not be resolved after parsing was complete.
If you are willing to accept multiple parse trees for the same
token sequence it is always possible to 'successfully' parse
a language.
I have a parser for C that does not use any symbol table. The most
common ambiguous parse is:
f(i); // function taking one argument, or perhaps
// a declaration of i to have type f with redundant ()
I am currently working on a C++ parser. The main problem with
C++ is that there are a lot more constructs that lead to an
ambiguous parse.
I know of at least one other tool vendor who fully parses C/C++ source
using syntax information only, before building a symbol table.
> I would think that it is always possible to delay categorizing an
> identifier until after parsing. The difficulty lies in designing an
> initial IR which incorporates the ambiguities in addition to the
> unambiguous canon IR.
>
> For example, in the case of "a = (b)-c", the parser could construct an
> AST like the following
[...]
IMO you disregard the amount of postprocessing, or error recovery
(backtracking), when the first assumption was wrong. Argument binding,
operator precedence, and more, might be different for typecasts and
binary "-" or other operators like "(". When a tree has to be
reorganized later, why construct an tree in the first pass at all?
Weren't a multi-level grammar the simpler approach to (such) ambiguities?
DoDi
> Robert A Duff asks if anybody has tried parsing C or C++ without using
> a context sensitive grammar, and mentions that in Ada the grammar for
> array indexing and function calls are similar.
Thanks to all who answered!
> In C you can write an expression (a)-b*c. If a is a type then(a)-b*c is
> equivalent to ((a)-b)*c, but if a is a variable then (a)-b*c is
> equivalent to (a)-(b*c). To put it an other way: In Ada you will want
> a parse tree of the same structure independently of whether a subtree is
> an array indexing or a function call. It is simply a question of
> putting the right labels on the nodes.
Actually, that's not true of Ada, although the examples I gave did not
show it. In particular, X(Y) could mean a call to X passing Y as argument,
or an array indexing (among other things). If it's an array indexing,
X could be an array, or a call to a zero-argument function returning an
array, or the dereference of a call to a zero-argument function
returning a pointer to an array. Therefore, we don't know (based purely
on a context-free grammar) whether Y is a subtree of the call or not.
We don't even know how many nodes there are (call, indexing, call and
indexing, dereference and call and indexing). Therefore, we must
restructure the tree (a little) based on semantic information.
For Ada, I can't imagine doing all that with feedback from semantic
analysis to parser.
>...In C and C++ it is not enough to put some other labels on some
>nodes. You will want to reorder the tree, and that reordering will
>be so complex, that it is easier to use a context sensitive grammar.
That I can believe.
- Bob
> I know of at least one other tool vendor who fully parses C/C++ source
> using syntax information only, before building a symbol table.
Again, are there any practical benefits with this approach?
>> I know of at least one other tool vendor who fully parses C/C++ source
>> using syntax information only, before building a symbol table.
>
> Again, are there any practical benefits with this approach?
I cannot speak for why others have taken this approach, but
possibilities include:
o gets a parser up and running quickly. This makes it possible
to see what kind of characteristics the source has and tune
subsequent processing appropriately (ie, in this situation people
are not interested in handling 100% of source 100% correctly).
In my case I am interested in measuring source code usage and
so don't need a full semantic analysis of every construct.
o improved error recovery. Syntax errors are often difficult
to recover from, ie there tend to be lots of cascading errors.
Semantic errors tend to be localized and easier to attach meaningful
messages to.
In my case I often have to analyze source where I might not have any
information on where the headers that need to be included are located
(ie, include path information), or in some cases I might not even have
the headers :-(
o it is possible to analyze fragments of code, eg single statements
or declarations (perhaps typed on the command line).
This sounds like the GLR "parse forest" approach. It may be the best
one can do.
However, if you look at the function prototype examples I supplied,
one can easily see that one has nested ambiguities. That is: not only
is the outer function prototype ambiguous, but nested in each
parameter/argument is another set of ambiguities and potentially
nested within them a third, fourth, and so on level of ambiguities.
Thus, you get effectively an n-squared problem.
This is partially why Jim Roskind gave up. It isn't just unrolling a
loop, and doing a linear exploration of the ambiguities. There is a
whole matrix of ambiguous cases which potentially interact. Or to
borrow from Adventure, "You are in a twisty little maze of ambiguous
productions; some of them may be slightly different."
Perhaps, I've just gotten old and crusty, but it doesn't seem like
that would inspire much confidence in how the rules interact. I think
Dr Roskind found some cases where he thought the rules did not
interact well and wouldn't do want one would expect.
Of course, if you are given an existing language and told to parse it,
it is a little late to revisit that decision. The good news is that
C++ and it's friends have already caused the parser generator writers
to attempt to extend their reach with techniques like backtracking,
PEGs, GLR, predicated parsing, etc. As a result, some of those
languages can be parsed reliably.
Still looking at the title of this thread, I wouldn't hold out much
hope for a simple hand-written recursive descent solution to such
languages. Moreover, those who think they have such solutions are
likely deluding themselves. I doubt they could prove their solution
is correct to my satisfaction.
Just my opinions,
> Robert A Duff wrote:
>
>> The typical solution in Ada compilers is for the parser to build a
>> tree that means "this could be any of ...", with X and Y as subtrees,
>> and then semantic analysis has symbol tables, so it knows what X is,
>> and changes that node into a "function call" or "array indexing" or
>> whatever is appropriate. (Note: X could be an overloaded name.)
>>
>> My question is: can that technique work for the above C and C++
>> ambiguities?
>
> What benefits you would like with this technique?
If two or more modules depend on each other for their proper operation,
then they are tightly coupled. Cyclic dependencies make software
harder to understand than non-cyclic, other things being equal, because
you can't understand each module independently.
Can I use lookahead and/or backtracking in my parser? I'd like to be
able to answer that question without understanding semantic analysis
(whether the parser is hand written or tool generated). And when
understanding semantic analysis, I don't want to have to understand
subtle timing issues about when exactly the parser calls
"get_next_token" relative to when typedefs are inserted in the symbol
table -- I want to assume that I'm handed a tree on a silver platter,
and I don't care how it was constructed.
As someone else in this thread mentioned, if you're writing a tool
that depends only on syntactic issues (e.g. a pretty-printer), it
would be nice to be able to write a parser without writing a semantic
analyzer.
- Bob
> Therefore, we don't know (based purely on a context-free grammar)
> whether Y is a subtree of the call or not. We don't even know how
> many nodes there are (call, indexing, call and indexing, dereference
> and call and indexing). Therefore, we must restructure the tree (a
> little) based on semantic information.
When in Ada *intentionally* the syntax for calls and array indexing is
the same, a parser IMO should produce the same tree structure for both
cases. Different handling is required only in further compiler stages
(semantical checks, code generation...), which could work with tree
node attributes, instead of restructuring the parse tree.
DoDi
Array access may be considered special case of function call. IMHO,
distinction is often not that important until code generation phase.
But at this phase you should know everything about compiled program :)
It is same sort of problem as analyzing a=b+c. Are b and c numbers or
strings (in a language where you can add strings).
One can say that function call expression is overloaded for function
calls and array access. I don't see any problem here.
> For Ada, I can't imagine doing all that with feedback from semantic
> analysis to parser.
You don't need it. Anyway, unlike C, you can parse Y same way each
time.
--
Ivan Boldyrev
> However, if you look at the function prototype examples I supplied,
> one can easily see that one has nested ambiguities. That is: not only
> is the outer function prototype ambiguous, but nested in each
> parameter/argument is another set of ambiguities and potentially
> nested within them a third, fourth, and so on level of ambiguities.
> Thus, you get effectively an n-squared problem.
True, but not very common in practice.
> Of course, if you are given an existing language and told to parse it,
> it is a little late to revisit that decision. The good news is that
> C++ and it's friends have already caused the parser generator writers
> to attempt to extend their reach with techniques like backtracking,
> PEGs, GLR, predicated parsing, etc. As a result, some of those
> languages can be parsed reliably.
I thought the 'good' news about C++ parsing was that language designers
have (re)learned the importance of having a grammar that is easy to
parse using existing tools.
I think the interest in GLR and its ilk is driven by researchers
interest in doing something new and now having the computing resources
to do it.
> Still looking at the title of this thread, I wouldn't hold out much
> hope for a simple hand-written recursive descent solution to such
> languages. Moreover, those who think they have such solutions are
> likely deluding themselves. I doubt they could prove their solution
> is correct to my satisfaction.
The latest version GNU C & C++ compilers both use recursive descent. In
my view this was a big mistake for C which will make it much harder for
third parties to figure out what syntax gcc actually supports. I know
at least one other major C++ vendor uses a recursive descent parser, so
perhaps there are some overriding advantages of parsing this way in C++.
>> Thus, you get effectively an n-squared problem.
>
> True, but not very common in practice.
Unless one is writing obfuscated code, either intentionally (or
accidentally!)
> I thought the 'good' news about C++ parsing was that language designers
> have (re)learned the importance of having a grammar that is easy to
> parse using existing tools.
I hope you are right.
> I think the interest in GLR and its ilk is driven by researchers
> interest in doing something new and now having the computing resources
> to do it.
Actually, I think it is a mix. Researchers are always looking for
something new--that's their job. However, the problems of the day
show them which avenues to look down. GLR and backtracking got
popular because some breaktrhoughs were made in their implementation,
which allowed reasearchers to attack some previously ambiguous
problems and C languages rewarded them for doing so. You could get
some nice results that were interesting to others by prusing that
research. My take is that the problems guided the research as that's
the way I viewed it--I wanted solutions that would enable parsing of
C++ in a rigorous manner, so picked up hints that looked likie they
woulld lead that way. However, I can see perceiving it in reverse.
A worthy aside on this, it looks like PERL regexps are close to
falling that same way. There are several researchers that are
pursuing lines of reasearch all aimed at cracking that problem. For
example, Villi Laurikiri's work on tagged regular expressions and
Cezar Champeanu and Sheng Yu's work on pattern expressions. My guess
is that within 5 years, we will see PERL regexps on a fairly sound
basis and several tools that generate new DFA variants that can
process them.
>> Still looking at the title of this thread, I wouldn't hold out much
>> hope for a simple hand-written recursive descent solution to such
>> languages. Moreover, those who think they have such solutions are
>> likely deluding themselves. I doubt they could prove their solution
>> is correct to my satisfaction.
>
> The latest version GNU C & C++ compilers both use recursive descent. In
> my view this was a big mistake for C which will make it much harder for
> third parties to figure out what syntax gcc actually supports. I know
> at least one other major C++ vendor uses a recursive descent parser, so
> perhaps there are some overriding advantages of parsing this way in C++.
Well, hand-written recursive descent does expose one to the full Turing
complete power of one's programming language, but as you say (and I
definitely agree with), that can make it real difficult to be certain
what syntax is actually supported.
I don't think one needs a full Turing machine to parse C++, and if one
is smart, one picks a smaller more constrained and thus more
accessible, provable, and reliable notation for solving the problem
(e.g. an LL or LR grammar with some extensions to deal with specific
difficult spots). However, I think that course is too difficult for
most C++ compiler writers, as it takes dilligence and persistence to
find the solution in the constrained form.(*) Where as with a recusrive
descent parser, one can just hack in a solution and the [side-]effects
on the rest of the language, when something unexpected falls into that
hack, be damned--one can always add another layer or two or three or
four of hacks that patch the hack and attempt to constrain it to what
one actually meant.
*: the dilligence and persistence mentioned above are further
strained, because many of the tools which extend LL or LR parsing are
not entirely mature. Thus, one has to fight with not only getting
something theoretically correct, but also acceptable to a tool which
may implement the theory quite imperfectly. Perhaps, the RD compiler
writers are smarter than I give them credit for. I'm just not
convinced yet.
Just some more opinions,
-Chris
*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------