Just Curious,
Mark
hj...@cs.cmu.edu
--
Send compilers articles to comp...@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
The only uncertainty here is what determining what happens while a
name is being declared.
If I understand aright, you're defaulting an alphanumeric token to be
an identifier unless it first appears in a typedef as one of the
declared names. So I expect you'll have to have a production which
will declare an identifier as a typedef name. For example
(oversimplifying away things like pointer and array specifiers):
typedef_decl
: typedef_prelude identifier {decl_typedef($1, $2); $$= $1;}
| typedef_decl ',' identifier {decl_typedef($1, $3); $$= $1;}
;
typedef : typedef_decl ';' ;
where decl_typedef() is responsible for the paranoia checks and symbol
table manipulations. After being hit by the above production, that
specific identifier will be recognized by LEX as an identifier.
What about lookahead and backtracking? The terminating ';' is
guaranteed to eat YACC's one token lookahead, and backtracking only
occurs when an error is encountered -- and I hope you're not expecting
perfection in the face of a syntax error?
Also, note that an attempt to declare the same name twice in the same
typedef will be a syntax error. For example:
typedef int foo, foo;
Here, when "typedef int foo" is reduced (as
typedef_prelude identifier_), you are guaranteed that the input has
not been lexed any farther than "typedef int foo," so when the second
"foo" is lexed, it will be recognized as a typedef name, even with
array specifier stuff in the grammar.
--
Raul Deluth Miller-Rockwell <rock...@socrates.umd.edu>
One place where every yacc/lex based C compiler I know of is
broken is on a typedef name redefined in an inner scope:
typedef int foo;
main ()
{
char foo;
}
is legal ANSI C. I think there was a thread on this a while back.
bb
We use a different approach:
As we use the GLR parsing technique, which is an LR parser which is able
to split up in several LR parsers on a conflict, we are able to solve this
problem in a very general way. The scanner just returns all possible
interpretations of a token, the parser splits up to pursuit the different
possibilities; some of these will die in the sequel and the correct choice
survives. This leaves all information about the grammar out of the lexical
scanner and is guaranteed to work for any case without any inspection of
the grammar by a programmer.
Jan Rekers (Jan.R...@cwi.nl) Centrum voor Wiskunde en Informatica (CWI)
P.O. Box 4079, 1009 AB Amsterdam, The Netherlands
Microsoft C gets it right, strangely enough. pcc doesn't. Therefore, it's
not a problem with yacc, but with the compiler implementation. (Yes, msc is
built using yacc, and I've even used pure AT&T yacc to build it.)
--
Sean Eric Fagan
s...@kithrup.COM
One place where every yacc/lex based C compiler I know of is
broken is on a typedef name redefined in an inner scope:
typedef int foo;
main ()
{
char foo;
}
is legal ANSI C. I think there was a thread on this a while back.
Well, take GNU C (versions 1.something and 2) off your list of broken
yacc/lex-based compilers. (Ok, it's not lex-based and it is bison-based,
but I don't think that matters.) I just tried this example on both
versions and it works fine.
--
James Craig Burley, Software Craftsperson bur...@gnu.ai.mit.edu
Well, I assume that you're being careful about redeclarations. For
instance,
typedef int a;
{
float a;
a++;
}
is legal in ANSI C. What you need, of course, is to have a few places
in the grammar that allow either an identifier or a typedef name.
In general, I know of no way of verifying that your parser has
everything right, although meticulous comparison of the ANSI rules and
your parser will help, and thorough exercising with C validation
suites will help.
Dale Worley Dept. of Math., MIT d...@math.mit.edu
> One place where every yacc/lex based C compiler I know of is
> broken is on a typedef name redefined in an inner scope:
>
> typedef int foo;
>
> main ()
> {
> char foo;
> }
>
> is legal ANSI C. I think there was a thread on this a while back.
I missed this thread, but anyway:
The above shouldn't be a problem, because this is not really an ambiguous
occurrence. You can deal with that by having a production
any_ident : ident | type_ident;
and using any_ident for the identifier in a declarator (and several other
places). This should be possible without introducing any ambiguities.
But for some parts of the C syntax this is not so easy, for labels you
probably have to expand the any_ident production to allow programs like
typef int foo;
main ()
{ foo: ;
}
because otherwise there is a shift-reduce conflict
(reduce type_ident to any_ident for labels, shift for declarations).
[It's not impossible, but it's tricky and messy to get right. -John]
It works fine with my yacc/lex based parser. The trick is that when you
see the char, you have to make sure that the lexer is told of the change
of state soon enough. But thanks for noting the potential problem; I'll
add it to my list of test cases.
In answer to the original question: you don't know if the parser will or
will not get a lookahead. That depends on the parser implementation. For
yacc and bison, you can study the -v output to find out.
In general, for an LR parser, you can tell if the parser MUST get a
lookahead by examining the action table: if all the non error entries are
identical, you don't need a lookahead token; otherwise, you do.
---
Bill { uunet | decwrl | telesci }!twwells!bill
bi...@twwells.com
/* start politically incorrect statement */
I may misunderstand, but it seems to me the problem is aggravated by
having the scanner read the symbol table to decide what kind of token foo
is. If so, that seems to me to violate software engineering principles -
which in several cases leads to problems. The scanner should only know
about lexical information - and lexically foo is just an identifier. The
parser knows about levels of scope, context etc and should look up
identifiers in the symbol table(s) to determine what they denote. I know
the dragon book suggests a design where the scanner reads the symbol
table, but that seems to me to be fraught with difficulty in block
structured languages with nesting etc - and even more so in languages like
Ada with multiple imported symbol tables (or ML with its different name
spaces which depend on the context.) For a simple language like C with an
almost flat name space - it probably works fine to follow the dragon
design.
So foo should be sent to the parser as an identifier token (with a way of
recovering the string (or a code for the string) from the token, and not
as type_identifier token or a label_identifier token
I have used this approach several times since it was taught to me at school
(we used another text)
What do real compilers do, esp for other languages besides C ?
P.S. The dragon book is a fine text in many ways, I just wonder about this
one detail.
--
Alex Blakemore blak...@software.org (703) 742-7125
Software Productivity Consortium 2214 Rock Hill Rd, Herndon VA 22070
O.K. I haven't got out the grammar and done the actual table construction
(read: disclaimer), but declarations ARE the one place where you do need
the separate tokens for ident and type_ident. any other place, the
any_ident->ident|type_ident rule works fine (On labels, for instance, the
: in the lookahead stream resolves the ambiguity. I have also sucessfully
used the above productions to allow a typedef name to also be a tag name).
Consider the code fragment:
typedef int z;
main() {
long z;
}
is z being redeclared as a local variable in main(), or are you just
specifying the empty declaration for a long int type? The ambiguity
depends upon which token you return from the lexical analyzer when a is
encountered for the second time. The ANSI C grammar in the back of K&RII
is not ambiguous: it assumes that the lexer resolves the ambiguity, not
the parser.
The fix to this problem is much easier than I first thought: Just use
lex's right-context sensitivity operator (/) to search ahead in the input
stream for one of [,{;] (preceeded by optional whitespace) when an
identifier is encountered. In cases that match, always return the IDENT
token; on cases that don't, lookup the name and return TYPE_NAME if the
identifier is a typedef name, return IDENT otherwise.
As for my original statement
>One place where every yacc/lex based C compiler I know of is broken
I knew sun's cc was broken & any C compiler I had work on was too,
couldn't figure out a way to easily fix the problem, and over-generalized :-)
bb
>We use a different approach:
>As we use the GLR parsing technique, which is an LR parser which is
>able to split up in several LR parsers on a conflict, we are able to
>solve this problem in a very general way. The scanner just returns
>all possible interpretations of a token, the parser splits up to
>pursuit the different possibilities; some of these will die in the
>sequel and the correct choice survives. This leaves all information
>about the grammar out of the lexical scanner and is guaranteed to
>work for any case without any inspection of the grammar by a programmer.
I hate to disagree with my friend Jan, but this idea of splitting
off alternative parses introduces other nasty problems and it is
not clear that you would be any better off. Consider the
following C expression:
(foo)-3
This has two different parses depending on whether foo is declared
as a variable or as a typedef name. I.e., if foo is an int variable
then the expression is equivalent to foo-3; but if foo is a typedef
name, then the expression is equivalent to (foo)(-3) where a type
cast is being applied.
If we feed a program containing the k consecutive statements
v1 = (foo)-1;
v2 = (foo)-2;
v3 = (foo)-3;
.
.
vk = (foo)-k;
into Rekers' parser, we should get back at least 2**k different
parses.
The only way to prevent this combinatorial explosion of
possibilities is to kill off the alternative parse possibilities
as they are created. For example, the C grammar rule
<Expression> ::= ( <type-name> ) <Expression>
would need to have an associated semantic test to check whether
the type-name has actually been declared as a type name and, if
not, suppress this particular parsing possibility.
Thus there has to be feedback from semantic analysis into the
parser. I don't believe this is any great improvement over the
conventional approach of having feedback from semantic analysis to
the lexer so that typedef names can be returned as a different kind
of token.
Nigel Horspool
University of Victoria
>How do I KNOW that there are no situations
>which, due to parser lookahead, would cause the scanner to return an
>incorrect token type for an identifier (i.e. return "identifier", even
>though the identifier was just/will be made into a "typedef name")?
The way to KNOW would be to prove that the scanner and parser conform to
specifications (not easy, since we have neither an exact spec or a
semantics of C :-), and to prove properties of one's microsyntax or
grammar.
Another way to handle the identifier/keyword problem is to introduce a
token class <name> = <letter> {<letter> | <digit>} or whatever, and then
either insert a "filtering pass" that checks each name against a list of
reserved words and replaces the token with <identifier> or the appropriate
keyword token (I feel the "but-that's-grossly-inefficient" flames already,
but I'd rather be right than fast :-), or else to insert a blub of code at
the point that the name token has been recognized to differentiate between
the two.
--
Debora Weber-Wulff d...@inf.fu-berlin.de
Institut fuer Informatik +49 30 89691 124
Nestorstr. 8-9
D-W-1000 Berlin 31
>Just Curious,
>Mark
Mark,
I believe I can accurately address your question. You ask:
How do I KNOW that there are no situations which, due to
parser lookahead, would cause the scanner to return an
incorrect token type for an identifier?
You can definitely KNOW this answer by either constructing the Item sets
by hand or better yet, obtaining the set yacc has constructed (by invoking
yacc with the -v option) during parser generation. Since you are using
the ANSI specification, you will see that "tIdentifer" and "tTypedefName"
both appear in the same Item set. Since a typedef name is lexically
identical to an identifier, this places the burden of resolving this
ambiguity on the scanner. You now KNOW that you do have a situation (ie.
a parser configuration exists that places the burden of resolving lexical
ambiguity on the scanner) where the scanner is unable to return the
*correct* token type without some external help.
Brian Bliss also points this out:
The ANSI C grammar in the back of K&RII is not ambiguous: it
assumes that the lexer resolves the ambiguity, not the parser.
It's not clear to me why you qualify your question with "..., due to
parser lookahead, ...." The parser will only look ahead when it is unable
to make a parsing decision, such as shift, reduce or goto. Since the
ambiguity between typedef name and identifier is lexical rather than
syntactical, I don't understand your reason for qualifying your question
by parser lookahead.
Assuming I may continue, I would like to offer a strategy for handling
such a conflict.
Instantiate a flag, such as "typedefRecognition," and initialize it to
"ON" prior to parser invocation, after completed declarations and "OFF"
after recognition of typedef names. The scanner will consult the setting
of this flag after recognizing the lexeme as an "identifier" but before
returning the token ID. If "typedefRecognition" is "ON," the scanner
searching the name space for a symbol with an attribute matching
"typedefName."
Something like this might be in order:
parser code:
typedefRecognition = ON ;
status = yyparse() ;
scanner code:
... identifier scanning ...
tokenToReturn = tIdentifer ;
if(typedefRecognition == ON && isTypedefName(ActiveScope, lexeme)) {
typedefRecognition = OFF ;
tokenToReturn = tTypedefName ;
}
grammar file:
type-specifier:
tVoid typedefRecOn
| tChar typedefRecOn
.
.
.
| tTypedefName typedefRecOn
;
typedefRecOn:
{
typedefRecognition = ON ;
}
;
Hope this is helpful,
Jeff
----------------------------------------------------------------------
Jeff Woods
Software Engineer (214) 497-4501
Software Development Tools Group (214) 497-4500 FAX
CONVEX Computer Corporation jwo...@convex.COM
typedef int foo;
a new rule
int : foo; // 1
will be introduced together with the new terminal 'foo'. Also, the basic
set of rules should contain a rule looking something like
declaration : int lis_of_identifiers ';' ; // 2
list_of_identifiers : identifier // 3
| list_of_identifiers ',' identifier // 4
Then, the text
foo var;
will be reduced by the sequence of rules 1, 3, 2. The example presented
does not address issues of scope, i.e. what happens when typedefs are
redefined in the inner scope and then become visible again upon exiting
the scope. This relates to the way new rules are added/deleted to/from
the basic set. I have submitted to SIGPLAN NOTICES a description of a
metalanguage for the USSA system, based on this approach. The
metalanguage is powerful enoupgh to handle various scenarious of rules
modification, so that solutions to visibility and scope problems may be
expressed.
There are also papers printed in SIGPLAN NOTICES, regarding this approach:
Henning Christiansen. "A Survey of Adaptable Grammars," Sigplan Notices,
V.25 N.11, 35-44, 1990.
Boris Burshteyn. "On the Modification of the Formal Grammar at Parse
Time," Sigplan notices, V.25 N.5, 117-123, 1990.
Boris Burshteyn. "Generation and Recognition of Formal Languages by
Modifiable Grammars," Sigplan Notices, V.25 N.12, 45-53, 1990.
Thanks, Boris Burshteyn - a man who can spell 'automaton'.
Under "grammar file:", I introduced the nonterminal typedefRecOn erroneously.
It should have been typedefRecOff and should have been re-initialized with
"OFF" instead of "ON."
I should also mention that I am only trying to offer Mark some help.
There are many other places this flag must be "turned on/off" (such as
before/after cast-expressions, declarations, etc...). I have
omitted showing these because I felt they weren't needed to make my
point.
If you feel this example is not clear enough, please send me email and
I will be glad to address it in more detail.
Thanks again Gene,
Jeff
This approach bombs on
/* empty declaration with only a typedef name */
typedef int foo;
foo;
and on
/* tag names */
typedef int foo;
struct foo var;
the latter can be solved as suggested earlier:
In article <92-0...@comp.compilers>, s...@dcs.edinburgh.ac.uk writes:
|> [Reusing a typedef name] shouldn't be a problem, because this is not really
|> an ambiguous occurrence. You can deal with that by having a production
|>
|> any_ident : ident | type_ident;
|>
|> and using any_ident for the identifier in a declarator (and several other
|> places). This should be possible without introducing any ambiguities.
a scheme which will work for tag names and labels just fine,
but not names in the identifier space.
blak...@software.org writes:
|>So foo should be sent to the parser as an identifier token (with a way of
|>recovering the string (or a code for the string) from the token, and not
|>as type_identifier token or a label_identifier token
again, This can't be done without introducing ambiugities into the grammar.
(at least the grammar given in the back of K&RII) try introducting a
production typedef_name->T_IDENT and see how many shift reduce conflicts
you get.
bb
>Instantiate a flag, such as "typedefRecognition," and initialize it to
>"ON" prior to parser invocation, after completed declarations and "OFF"
>after recognition of typedef names. The scanner will consult the setting
>of this flag after recognizing the lexeme as an "identifier" but before
>returning the token ID. If "typedefRecognition" is "ON," the scanner
>searching the name space for a symbol with an attribute matching
>"typedefName."
again, consider the two code fragments
typedef int foo;
main () {
long foo;
}
and
typedef int foo;
main () {
long foo x;
}
in the latter, we definitely want to return T_TYPEDEF_NAME as the token
for foo. in the former, we want to return T_IDENT for foo, as it is being
redecalred as an identifier in an inner scope. The two programs are
identical up to "foo", so there is no amount flag twiddling that we can do
to resolve the problem correctly; we must use some sort of lookahead.
The lookahead can either be done by the lexical analyzer, as I've been
attempting to do, or by the parser, which means we return the same token
for both typedef names and other identifiers. The latter introduces many
ambiguities in the grammar in K&RII, in declaration statements, in
expression in the form of a single identifier appearing after declarations
and before the executable statements within a block, and within function
prototypes, and probably other places as well.
bb
So does this one :-)
[description of modifiable grammars deleted]
The following might seem like a flippant answer, but I am serious :-
One solution to the typedef problem, is to avoid it completely i.e. don't
bother trying to parse C and don't create any extensions of C. If you
really must parse C, use one of the hacks mentioned already i.e.
interogating the symbol table.
It seems to me that a lot of time is wasted trying to provide a solution
to something that is essentially broken. I'm disgusted at language
definitions that require hacks like symbol table interogation, or even
worse a modifiable grammar, just to be able to parse the syntax. C at
least has the excuse of being nearly 20 years old, but for newer
languages, there is no excuse at all. IMHO any person who designs a
grammar that needs additional help to be parsed, should have a _very_ good
reason for doing so.
For example, the problems caused by the same syntax for function calls and
array access in Ada is at least in part made up for by the fact that an
array can be swapped for a function very easily without having to change a
lot of code (an array is a discrete function after all). Having said that
I personally don't find this argument compelling, but at least it _is_ an
argument. I mean, just what is BK's or DMR's excuse for the typedef hack?
There have been (and are) too many languages that don't seem to have had
any real thought put to the design of the syntax. The designers seem to
start with "it should look like Pascal" or "it should look like C" and
blindly add syntax without giving it due consideration.
So please, any budding language designers/extenders out there, think long
and hard about about the syntax of the language and its impact on checking
context conditions (or static semantics if you prefer)
This is a heartfelt plea from somebody who has spent way too much time
poring over the definitions of languages like :- C, Pascal, Ada,
Oberon-2, ML, Haskell, ... etc. trying to make sense of their syntax and
context conditions.
Stephen J. Bevan be...@cs.man.ac.uk
PS Q: which language has the best syntax?
A: Lisp/Scheme, as there isn't any concrete syntax.
You program directly in the abstract syntax.
Makes writing the concrete -> abstract converter easy :-)
[On the other hand, some of us would put "easy to parse using yacc" fairly
low on our list of criteria for good language design. -John]
So how about a FAQ? Volunteers anyone?
[If someone wants to write the definitive treatise on this topic, I'd be
happy to add it to the archives. Otherwise I'm declaring this topic closed.
-John]
) Something like this might be in order:
)
) parser code:
)
) typedefRecognition = ON ;
) status = yyparse() ;
)
...
) grammar file:
)
) type-specifier:
) tVoid typedefRecOn
) | tChar typedefRecOn
) .
) .
) .
) | tTypedefName typedefRecOn
) ;
)
) typedefRecOn:
) {
) typedefRecognition = ON ;
) }
) ;
)
) Hope this is helpful,
)
) Jeff
Yes, something *like* that is possible, but it is rather tricky. The
switches in grammar rules that toggle typedef- recognition must be
situated very carefully, and in places that some may find
counter-intuitive. In the example above, "typedefRecOn" should be located
*after* "tTypedefName" and "tVoid", not before. (!)
Also (for purposes of maintainability if for no other reason), you want to
be sure that the scanner is always "eager" to do lookahead, not "lazy" in
states where a lookahead is not urgently needed to select a state-change.
A few months ago I posted a YACC/LEX program which, to the best of my
knowledge, does everything correctly, although given the trickiness of the
problem, I would not be astonished to hear otherwise.
Also I was under the false impression that ANSI C was meant to be a strict
extension of C, and therefore implemented some features of C-Classic which
ANSI C dropped, thinking the formal grammar I started from was "buggy".
Thus my statement that I thought the parser was "ANSI C" was technically
incorrect, although probably not enough so to cause any problems. It is
"ANSI C with C-Classic Extensions", I guess.
This is an excellent and interesting question, and pulled out a very
nice line of responses. My answers are:
1) I have no "general method" of proving correctness.
2) In this case, correctness can be proved.
3) If your grammar came from the standard, the proof will
identify an error in your grammar.
4) A correct grammar exists, and I distribute it for free.
For folks that can't figure out why there is so much discussion about this
problem, consider the code fragment:
T(*b)[4];
Notice how the meaning is very different depending on whether T is a
typedef-name, or an identifier (and hopefully a function in this case).
For this reason most C lexers distinguish between typedef names and other
identifiers. Alas, as the poster points out, there is concern that the
feedback loop (parser annotates symbol table, which lexer interrogates,
and then feeds tokens to the parser) will fail in some complex scenario.
Other threads of response included the use of a GLR parser, which tries
all alternatives. Alas, as will be shown, the ambiguity in C is too
great, and such an approach will not provide a unique syntactically
correct parse when it does exist.
Other respondees mentioned that almost no yacc based parsers get this area
of the standard right. Since the grammar included in the standard is
misleading, the cause for this problem will become abundantly clear. I
will start with a description of how a proof of correctness could proceed,
take a slight aside to show why the traditional grammar tends to cause
this proof to fail, and explain how the grammar can be written (provably)
correctly. Since I am typing this on the fly, I will surely make some
errors, but I hope that most of the readers will find the content
interesting. Note that my grammar distribution includes a very large
paper that deals with many C/C++ ambiguities in much greater detail, and
with many more examples.
The fact that yacc uses LR(1) scanning is pivotal to the argument in this
special case. In this special case you can note that the typedef-ness of
an identifier can *only* change after a declarator is complete.
Specifically, the standard states that an identifier enters scope only
after its declarator is complete. The completion of a declarator is
always marked by one of three tokens:
";" end of declaration statement
"," end of individual declarator (another is comming)
"=" end of declarator, initializer is comming.
To prove correctness, you must show that reductions involved with
annotation of your symbol table take place while one of the three tokens I
just mentioned is waiting in the lookahead buffer. The argument notes
that since none of the three tokens I identified relies on the
typedef-ness of an identifier, everything will work out (i.e., the
identifier vs typedef-name separation by the lexer can always be done in a
correct and timely manner). Basically, all you have to do is look at the
position of the action code that is placed in the grammar to annotate the
symbol table, and be sure that this happens before one of the above three
terminating tokens (context counts) has being processed.
As an example, suppose you had a rule that looked like:
declaration: storage-class INT identifier ';' { /* annotate sym table */ }
Please don't bother to argue that you would never have such a specific
rule in the grammar, rather note that the action is being done *after*
reading the ';'. This is "bad." If you had such a rule, then there is a
chance that yacc would have read the next token (it is allowed to if it
needs to), and there is a chance that the next token *might* be an
identifier. Hence, with the above rule, there is a chance that the lexer
would process an upcoming identifier *prior* to the annotation of the
symbol table, and hence incorrect results could arise. For example:
typedef int T ;
T a;
might be incorrectly parsed (the second "T" might be read *and* tokenized
by the lexer before the symbol table is annotated). A guaranteed correct
rule could be written:
declaration: storage-class INT identifier { /* annotate sym table*/ } ';'
Note that here the action takes place with the parser at *most* looking
ahead at the ';', and hence the lexer would not have had to process future
identifier/typedef-name distinctions.
Getting back to your example, since you copied the grammar from the ANSI
Standard, your grammar is probably wrong. To expose the problem, try
parsing:
typedef int A, B(A); /* my bet is you get a syntax error*/
When correctly parsed, the above is equivalent to:
typedef int A;
typedef int B(A);
or simply:
typedef int A;
typedef int B(int); /* B's type is function taking an int
returning and int */
Again, to verify my interpretation, look at the verbose discussion of
declarations in the ANSI standard and note that an identifier enters scope
when its declarator is complete. Despite this assertion in prose, most
grammars are based on the K&R or ANSI grammars, which gathers an
"init-declarator-list" before combining the declarators with the
"declaration-specifiers." In the above example, gathering the entire list
"A, B(A)" before annotating the symbol table for "A" would allow the lexer
to incorrectly tokenize the second "A".
You might wonder how GNU GCC (or others related compilers) manage to get
the above parse right, when their grammar looks a lot like this standard
(and I assert, faulty) grammar. In the case of GCC, this area of the
action code includes a hack wherein the "decl-specifiers" are fetched from
lower points in the stack using $-1 (using negative numbers allows action
code to reach down into questionable areas of the yacc stack, whereas
positive numbers correspond to the tokens in the productions with the
action). Basically, by reaching backwards when they see a ',' (or '='),
they are able to annotate the symbol table even though they use the
standard grammar. Once again, it is provable that they get the correct
parse (if they can prove that $-1 always contains the desired
decl-specifier value. This latter assertion is far from trivial to prove,
as they must correctly process function declarations, which occasionally
have no decl-specifiers! They get it right, but a second hack is required
atop the first hack, and these two hack unite to make C++ parsing a horror
for them).
For those compiler hackers out there, it is worth noting that the problems
are greater than what was just given in the above example. Not only can
"typdef-ness" be established by a declarator, it can be taken away. For
example, consider the following:
#include <assert.h>
typedef char F;
main() {
long a=sizeof(F), F, b=sizeof(F);
assert(1 == a);
assert(sizeof(long) == b);
}
With proper rewriting, a grammar can be made to correctly process such
declaration without "cheating" and looking at $-1 (or where ever else this
info might be socked away). The trick is to combine the declarator with
the declaration specifier (and annotate the symbol table), then combine
with the optional initializer (rather than forming an init-declarator),
and then if there is a ',', form a token that once again acts just like
the decl-specs to process the next declarator. If you want to see the
correct grammar, take a look at my distribution (which includes a C and a
C++ grammar, along with a cheap FLEX input file).
Since one of the respondees included suggesting that a GLR parser could
just try all interpretations, and that only one would be valid, the
following is a rather interesting example:
typedef int T;
main() {
int T=100, a=(T)+1;
assert(101 == T);
}
Notice that the text "(T)+1" could incorrectly be interpreted, if "T" were
still a typedef-name, as a cast to type T of the expression "+1". Note
that the GLR approach would yield two distinct and valid parses, and hence
there would be no single winner (without additional heuristics).
Another person focused on the issues involving redeclaration of
typedefnames at inner scopes. Indeed, this was something of a bear to
tackle (without kludging). Examples of the code that usually causes
problems include:
typedef int T1 ;
typedef int T2 ;
typedef int T3;
main() {
const T1 T1; /* redeclares T1 to be an int */
const T2 (T2); /* redeclares T2 to be an int */
const T3; /* syntax error : missing declarator */
}
Bruce Blodgett (formerly of Honeywell Corporation) and I independently
implemented clean resolutions of this within a YACC grammar, but the
results can be seen within my distribution. Interestingly, this careful
analysis of C identified some unaddressed scenarios in the ANSI standard.
Since these results go beyond the scope of this posting, interested
readers are encouraged to look at my paper which is included with my
distribution.
Doug Lea and Doug Schmidt have graciously offered to provide anonymous ftp
sites for the 8 files, as well as the Berkeley YACC source (if you need
it).
ics.uci.edu (128.195.1.1) in the ftp/gnu directory (even though neither of
the archives are GNU related) as:
c++grammar2.0.tar.Z
byacc1.8.tar.Z
mach1.npac.syr.edu (128.230.7.14) in the ftp/pub/C++ directory as:
c++grammar2.0.tar.Z
byacc1.8.tar.z
Jim Roskind
Independent Consultant
(407)729-4348 or (617)290-4990 x5570
j...@hq.ileaf.com or ...!uunet!leafusa!jar
p.s.: I'm always looking for interesting compiler and/or grammar work.
On the other other hand, I would expect "easy for humans to parse quickly"
should be very high on everyone's list. The worst problem with these
kinds of syntax design mistakes is not so much the inconvenience that it
causes compiler implementors; rather, it is the fact that if the lexer
and/or parser have to jump through hoops to classify a construct, then so
does the *person* trying to read this code. The existence of these kinds
of not-quite-ambiguities make the language itself (i.e., *any* program
written in that language) inherently less readable. This is bad but sorta
manageable in C; it's just scary in C++.
> Q: which language has the best syntax?
> A: Lisp/Scheme, as there isn't any concrete syntax.
> You program directly in the abstract syntax.
I'm not quite sure what you mean here; I do believe that it is the
difficulty that most humans have parsing LISP that is one of the major
factors that has prevented that language from becoming as popular as its
proponents might like for it to be.
(Is this thread more suitable for comp.lang.misc now?)
--
Doug Landauer -- land...@eng.sun.com _
Sun Microsystems, Inc. -- STE, SunPro, Languages
I worked on the parser for the Compass C compiler. The solution we used
to the typedef vs. identifier problem was to have the scanner return two
different token types, based on the symbol table. In order to make this
work correctly, a few things have to be done carefully:
An identifier has to be inserted into the symbol table when the
punctuation following its declarator (',', ';', or '=') is read. This
is not particularly tricky, since an LALR(1) parser never has to look
ahead at that point, and our parser looks ahead only when needed.
Making sure that tricky declarations like
typedef int foo;
foo;
foo x;
volatile foo;
int foo;
are handled correctly is done by having the grammar of declarations
enforce the rule that "if a typedef name is used as a declarator, the
declaration-specifiers of the declaration must contain a type-specifier".
Thus, if "foo" is a typedef, then
volatile foo;
is an empty declaration (and is illegal), while
int foo;
is a redeclaration of "foo" as an integer variable. To do this, the
grammar must keep track of whether a type-specifier has been seen in the
declaration-specifiers as it reads the next token and tries to decide
whether it is another declaration-specifier or a declarator. Thus you
have grammar rules like:
declaration-specifiers-without-type-specifier ::=
declaration-specifiers-without-type-specifier type-modifier
declaration-specifiers-with-type-specifier ::=
declaration-specifiers-with-type-specifier type-modifier
| declaration-specifiers-without-type-specifier type-specifier
| declaration-specifiers-with-type-specifier type-specifier
declaration ::=
declaration-specifiers-without-type-specifier
non-typedef-declarator
| declaration-specifiers-with-type-specifier
non-typedef-declarator
| declaration-specifiers-with-type-specifier
typedef-declarator
If the grammar is aware of these distinctions, it can be made LALR(1).
This multiples the grammar rules, of course, but this is not the only
place that a practical C grammar has several clones of a single rule in
the Standard's grammar.
Dale Worley Dept. of Math., MIT d...@math.mit.edu
Maybe that's why there are so many (syntactically) badly designed
languages :-)
IMHO by definition the "syntax" should be parsable by a context free
grammar. As I said previously if you don't design it like this you should
have a good reason. I have nothing against languages that deviate from
the rule as long as there is some real benefit from it. The Ada example I
gave is one example of this. I'm at a loss to think of another one.
Also, I don't consider "yacc" to be the last word in syntax analysis.
Like C it's old and past it's prime. There are better tools freely
available (e.g. GMD toolbox), if only people would use them.
Stephen J. Bevan be...@cs.man.ac.uk
>> [On the other hand, some of us would put "easy to parse using yacc" fairly
>> low on our list of criteria for good language design. -John]
I'm not too worried about Yacc in particular, but I think "easy to
parse automatically" needs to come fairly high up -- see below.
>On the other other hand, I would expect "easy for humans to parse quickly"
>should be very high on everyone's list.
A successful language must be easy to (a) write [else no-one will
use it]; (b) read [else it will be unmaintainable]; and (c) parse [else
implementations will be expensive and rare]. Anyone [poetic licence!]
can write a Pascal parser -- so Pascal compilers are ten-a-penny, every CS
student studies them, and Pascal is (or was until recently) popular well
beyond the intrinsic merits of the language. Contrariwise, Algol is a pig
for amateur compilation, and languishes despite its merits.
>[...] if the lexer
>and/or parser have to jump through hoops to classify a construct, then so
>does the *person* trying to read this code.
This doesn't follow. Traditional lexers and parsers take a
microscopic view of the code (typically based on "what is the next
symbol?"). Human readers typically take a more global view, even on first
reading. I don't carry a symbol table in my head, so I expect to have to
refer back *or ahead* in the code to discover what some variable "is"; I
can make use of comments or the names of variables to intuit what some
code is trying to do; layout often helps to give the global structure of
the program. Of course, if the programmer writes obfuscated code, *then*
the human may have to jump through the same hoops.
--
Andy Walker, Maths Dept., Nott'm Univ., UK.
a...@maths.nott.ac.uk
[I'd expect that the obstacles to Algol60 implementations are call-by-name
and the lack of standardized I/O. On the other hand, Algol68 is indeed
difficulto to parse. -John]
Consider the production for a C declaration statement, ala K&RII:
declaration -> dec_specifiers init_dec_list
^ by requiring a colon (like pascal) or some
other token right there, the entire problem
could have been avoided.
bb