I have a parser, written using Yacc and Lex, for ANSI C. The grammar is taken pretty much verbatim from the standard. The scanner uses the symbol table to decide whether to return "identifier" or "typedef name" as the token type for an identifier. How do I KNOW that there are no situations which, due to parser lookahead, would cause the scanner to return an incorrect token type for an identifier (i.e. return "identifier", even though the identifier was just/will be made into a "typedef name")? Is there a general answer to this question for other parsing strategies (possibly with other amounts of lookahead) and other grammars (languages)?
Just Curious, Mark
hj...@cs.cmu.edu -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
Mark Hjelm: I have a parser, written using Yacc and Lex, for ANSI C. The grammar is taken pretty much verbatim from the standard. The scanner uses the symbol table to decide whether to return "identifier" or "typedef name" as the token type for an identifier. How do I KNOW that there are no situations which, due to parser lookahead, would cause the scanner to return an incorrect token type for an identifier (i.e. return "identifier", even though the
The only uncertainty here is what determining what happens while a name is being declared.
If I understand aright, you're defaulting an alphanumeric token to be an identifier unless it first appears in a typedef as one of the declared names. So I expect you'll have to have a production which will declare an identifier as a typedef name. For example (oversimplifying away things like pointer and array specifiers):
where decl_typedef() is responsible for the paranoia checks and symbol table manipulations. After being hit by the above production, that specific identifier will be recognized by LEX as an identifier.
What about lookahead and backtracking? The terminating ';' is guaranteed to eat YACC's one token lookahead, and backtracking only occurs when an error is encountered -- and I hope you're not expecting perfection in the face of a syntax error?
Also, note that an attempt to declare the same name twice in the same typedef will be a syntax error. For example:
typedef int foo, foo;
Here, when "typedef int foo" is reduced (as typedef_prelude identifier_), you are guaranteed that the input has not been lexed any farther than "typedef int foo," so when the second "foo" is lexed, it will be recognized as a typedef name, even with array specifier stuff in the grammar.
-- Raul Deluth Miller-Rockwell <rockw...@socrates.umd.edu> -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
[after a sample production intended to recognize new "typedef names".] After being hit by the above production, that specific identifier will be recognized by LEX as an identifier. ^^^^^^^^^^ That should read: After being hit by the above production, that specific alphanumeric token will be recognized as a typedef name. ^^^^^^^^^^^^ My apologies. I hope I didn't confuse anyone. -- Raul Deluth Miller-Rockwell <rockw...@socrates.umd.edu> [Sorry, I should have caught that. -John] -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
In article <92-01-...@comp.compilers>, hje...@cs.cmu.edu (Mark Hjelm) writes:
|> I have a parser, written using Yacc and Lex, for ANSI C. The grammar is |> taken pretty much verbatim from the standard. The scanner uses the symbol |> table to decide whether to return "identifier" or "typedef name" as the |> token type for an identifier. How do I KNOW that there are no situations |> which, due to parser lookahead, would cause the scanner to return an |> incorrect token type for an identifier ... ?
One place where every yacc/lex based C compiler I know of is broken is on a typedef name redefined in an inner scope:
typedef int foo;
main () { char foo; }
is legal ANSI C. I think there was a thread on this a while back.
bb -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
In article <92-01-...@comp.compilers>, hje...@cs.cmu.edu (Mark Hjelm) writes:
|> I have a parser, written using Yacc and Lex, for ANSI C. The grammar is |> taken pretty much verbatim from the standard. The scanner uses the symbol |> table to decide whether to return "identifier" or "typedef name" as the |> token type for an identifier. [How do I reliably get it right?]
We use a different approach: As we use the GLR parsing technique, which is an LR parser which is able to split up in several LR parsers on a conflict, we are able to solve this problem in a very general way. The scanner just returns all possible interpretations of a token, the parser splits up to pursuit the different possibilities; some of these will die in the sequel and the correct choice survives. This leaves all information about the grammar out of the lexical scanner and is guaranteed to work for any case without any inspection of the grammar by a programmer.
Jan Rekers (Jan.Rek...@cwi.nl) Centrum voor Wiskunde en Informatica (CWI) P.O. Box 4079, 1009 AB Amsterdam, The Netherlands -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
In article <92-01-...@comp.compilers> bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes: >One place where every yacc/lex based C compiler I know of is >broken is on a typedef name redefined in an inner scope:
Microsoft C gets it right, strangely enough. pcc doesn't. Therefore, it's not a problem with yacc, but with the compiler implementation. (Yes, msc is built using yacc, and I've even used pure AT&T yacc to build it.)
-- Sean Eric Fagan s...@kithrup.COM -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
In article <92-01-...@comp.compilers> bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes:
One place where every yacc/lex based C compiler I know of is broken is on a typedef name redefined in an inner scope:
typedef int foo;
main () { char foo; }
is legal ANSI C. I think there was a thread on this a while back.
Well, take GNU C (versions 1.something and 2) off your list of broken yacc/lex-based compilers. (Ok, it's not lex-based and it is bison-based, but I don't think that matters.) I just tried this example on both versions and it works fine. -- James Craig Burley, Software Craftsperson bur...@gnu.ai.mit.edu -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
In article <92-01-...@comp.compilers> hje...@cs.cmu.edu (Mark Hjelm) writes:
How do I KNOW that there are no situations which, due to parser lookahead, would cause the scanner to return an incorrect token type for an identifier (i.e. return "identifier", even though the identifier was just/will be made into a "typedef name")?
Well, I assume that you're being careful about redeclarations. For instance,
typedef int a; { float a; a++; }
is legal in ANSI C. What you need, of course, is to have a few places in the grammar that allow either an identifier or a typedef name.
In general, I know of no way of verifying that your parser has everything right, although meticulous comparison of the ANSI rules and your parser will help, and thorough exercising with C validation suites will help.
Dale Worley Dept. of Math., MIT d...@math.mit.edu -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes: > One place where every yacc/lex based C compiler I know of is > broken is on a typedef name redefined in an inner scope:
> typedef int foo;
> main () > { > char foo; > }
> is legal ANSI C. I think there was a thread on this a while back.
I missed this thread, but anyway:
The above shouldn't be a problem, because this is not really an ambiguous occurrence. You can deal with that by having a production
any_ident : ident | type_ident;
and using any_ident for the identifier in a declarator (and several other places). This should be possible without introducing any ambiguities.
But for some parts of the C syntax this is not so easy, for labels you probably have to expand the any_ident production to allow programs like
typef int foo; main () { foo: ;
}
because otherwise there is a shift-reduce conflict (reduce type_ident to any_ident for labels, shift for declarations). [It's not impossible, but it's tricky and messy to get right. -John] -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
In article <92-01-...@comp.compilers> bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes:
: One place where every yacc/lex based C compiler I know of is : broken is on a typedef name redefined in an inner scope: : : typedef int foo; : main () : { : char foo; : } : : is legal ANSI C. I think there was a thread on this a while back.
It works fine with my yacc/lex based parser. The trick is that when you see the char, you have to make sure that the lexer is told of the change of state soon enough. But thanks for noting the potential problem; I'll add it to my list of test cases.
In answer to the original question: you don't know if the parser will or will not get a lookahead. That depends on the parser implementation. For yacc and bison, you can study the -v output to find out.
In general, for an LR parser, you can tell if the parser MUST get a lookahead by examining the action table: if all the non error entries are identical, you don't need a lookahead token; otherwise, you do.
--- Bill { uunet | decwrl | telesci }!twwells!bill b...@twwells.com -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes: > One place where every yacc/lex based C compiler I know of is > broken is on a typedef name redefined in an inner scope:
> typedef int foo;
> main () > { > char foo; > }
/* start politically incorrect statement */ I may misunderstand, but it seems to me the problem is aggravated by having the scanner read the symbol table to decide what kind of token foo is. If so, that seems to me to violate software engineering principles - which in several cases leads to problems. The scanner should only know about lexical information - and lexically foo is just an identifier. The parser knows about levels of scope, context etc and should look up identifiers in the symbol table(s) to determine what they denote. I know the dragon book suggests a design where the scanner reads the symbol table, but that seems to me to be fraught with difficulty in block structured languages with nesting etc - and even more so in languages like Ada with multiple imported symbol tables (or ML with its different name spaces which depend on the context.) For a simple language like C with an almost flat name space - it probably works fine to follow the dragon design.
So foo should be sent to the parser as an identifier token (with a way of recovering the string (or a code for the string) from the token, and not as type_identifier token or a label_identifier token
I have used this approach several times since it was taught to me at school (we used another text)
What do real compilers do, esp for other languages besides C ?
P.S. The dragon book is a fine text in many ways, I just wonder about this one detail. -- Alex Blakemore blakem...@software.org (703) 742-7125 Software Productivity Consortium 2214 Rock Hill Rd, Herndon VA 22070 -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
In article <92-01-...@comp.compilers>, s...@dcs.edinburgh.ac.uk writes:
|> [Reusing a typedef name] shouldn't be a problem, because this is not really |> an ambiguous occurrence. You can deal with that by having a production |> |> any_ident : ident | type_ident; |> |> and using any_ident for the identifier in a declarator (and several other |> places). This should be possible without introducing any ambiguities. |> |> But for some parts of the C syntax this is not so easy, for labels you |> probably have to expand the any_ident production to allow programs like |> |> typef int foo; |> main () |> { foo: ; |> } |> |> because otherwise there is a shift-reduce conflict |> (reduce type_ident to any_ident for labels, shift for declarations). [It's not impossible, but it's tricky and messy to get right. -John]
O.K. I haven't got out the grammar and done the actual table construction (read: disclaimer), but declarations ARE the one place where you do need the separate tokens for ident and type_ident. any other place, the any_ident->ident|type_ident rule works fine (On labels, for instance, the : in the lookahead stream resolves the ambiguity. I have also sucessfully used the above productions to allow a typedef name to also be a tag name). Consider the code fragment:
typedef int z; main() { long z;
}
is z being redeclared as a local variable in main(), or are you just specifying the empty declaration for a long int type? The ambiguity depends upon which token you return from the lexical analyzer when a is encountered for the second time. The ANSI C grammar in the back of K&RII is not ambiguous: it assumes that the lexer resolves the ambiguity, not the parser.
The fix to this problem is much easier than I first thought: Just use lex's right-context sensitivity operator (/) to search ahead in the input stream for one of [,{;] (preceeded by optional whitespace) when an identifier is encountered. In cases that match, always return the IDENT token; on cases that don't, lookup the name and return TYPE_NAME if the identifier is a typedef name, return IDENT otherwise.
As for my original statement
>One place where every yacc/lex based C compiler I know of is broken
I knew sun's cc was broken & any C compiler I had work on was too, couldn't figure out a way to easily fix the problem, and over-generalized :-)
bb -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
Jan.Rek...@cwi.nl (Jan Rekers) writes: >We use a different approach: >As we use the GLR parsing technique, which is an LR parser which is >able to split up in several LR parsers on a conflict, we are able to >solve this problem in a very general way. The scanner just returns >all possible interpretations of a token, the parser splits up to >pursuit the different possibilities; some of these will die in the >sequel and the correct choice survives. This leaves all information >about the grammar out of the lexical scanner and is guaranteed to >work for any case without any inspection of the grammar by a programmer.
I hate to disagree with my friend Jan, but this idea of splitting off alternative parses introduces other nasty problems and it is not clear that you would be any better off. Consider the following C expression: (foo)-3 This has two different parses depending on whether foo is declared as a variable or as a typedef name. I.e., if foo is an int variable then the expression is equivalent to foo-3; but if foo is a typedef name, then the expression is equivalent to (foo)(-3) where a type cast is being applied.
If we feed a program containing the k consecutive statements v1 = (foo)-1; v2 = (foo)-2; v3 = (foo)-3; . . vk = (foo)-k; into Rekers' parser, we should get back at least 2**k different parses.
The only way to prevent this combinatorial explosion of possibilities is to kill off the alternative parse possibilities as they are created. For example, the C grammar rule <Expression> ::= ( <type-name> ) <Expression> would need to have an associated semantic test to check whether the type-name has actually been declared as a type name and, if not, suppress this particular parsing possibility.
Thus there has to be feedback from semantic analysis into the parser. I don't believe this is any great improvement over the conventional approach of having feedback from semantic analysis to the lexer so that typedef names can be returned as a different kind of token.
Nigel Horspool University of Victoria -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
hje...@cs.cmu.edu (Mark Hjelm) writes: >How do I KNOW that there are no situations >which, due to parser lookahead, would cause the scanner to return an >incorrect token type for an identifier (i.e. return "identifier", even >though the identifier was just/will be made into a "typedef name")?
The way to KNOW would be to prove that the scanner and parser conform to specifications (not easy, since we have neither an exact spec or a semantics of C :-), and to prove properties of one's microsyntax or grammar.
Another way to handle the identifier/keyword problem is to introduce a token class <name> = <letter> {<letter> | <digit>} or whatever, and then either insert a "filtering pass" that checks each name against a list of reserved words and replaces the token with <identifier> or the appropriate keyword token (I feel the "but-that's-grossly-inefficient" flames already, but I'd rather be right than fast :-), or else to insert a blub of code at the point that the name token has been recognized to differentiate between the two.
-- Debora Weber-Wulff d...@inf.fu-berlin.de Institut fuer Informatik +49 30 89691 124 Nestorstr. 8-9 D-W-1000 Berlin 31 -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
hje...@cs.cmu.edu (Mark Hjelm) writes: >I have a parser, written using Yacc and Lex, for ANSI C. The grammar is >taken pretty much verbatim from the standard. The scanner uses the symbol >table to decide whether to return "identifier" or "typedef name" as the >token type for an identifier. How do I KNOW that there are no situations >which, due to parser lookahead, would cause the scanner to return an >incorrect token type for an identifier (i.e. return "identifier", even >though the identifier was just/will be made into a "typedef name")? Is >there a general answer to this question for other parsing strategies >(possibly with other amounts of lookahead) and other grammars (languages)? >Just Curious, >Mark
Mark,
I believe I can accurately address your question. You ask:
How do I KNOW that there are no situations which, due to parser lookahead, would cause the scanner to return an incorrect token type for an identifier?
You can definitely KNOW this answer by either constructing the Item sets by hand or better yet, obtaining the set yacc has constructed (by invoking yacc with the -v option) during parser generation. Since you are using the ANSI specification, you will see that "tIdentifer" and "tTypedefName" both appear in the same Item set. Since a typedef name is lexically identical to an identifier, this places the burden of resolving this ambiguity on the scanner. You now KNOW that you do have a situation (ie. a parser configuration exists that places the burden of resolving lexical ambiguity on the scanner) where the scanner is unable to return the *correct* token type without some external help.
Brian Bliss also points this out:
The ANSI C grammar in the back of K&RII is not ambiguous: it assumes that the lexer resolves the ambiguity, not the parser.
It's not clear to me why you qualify your question with "..., due to parser lookahead, ...." The parser will only look ahead when it is unable to make a parsing decision, such as shift, reduce or goto. Since the ambiguity between typedef name and identifier is lexical rather than syntactical, I don't understand your reason for qualifying your question by parser lookahead.
Assuming I may continue, I would like to offer a strategy for handling such a conflict.
Instantiate a flag, such as "typedefRecognition," and initialize it to "ON" prior to parser invocation, after completed declarations and "OFF" after recognition of typedef names. The scanner will consult the setting of this flag after recognizing the lexeme as an "identifier" but before returning the token ID. If "typedefRecognition" is "ON," the scanner searching the name space for a symbol with an attribute matching "typedefName."
Something like this might be in order:
parser code:
typedefRecognition = ON ; status = yyparse() ;
scanner code:
... identifier scanning ...
tokenToReturn = tIdentifer ;
if(typedefRecognition == ON && isTypedefName(ActiveScope, lexeme)) { typedefRecognition = OFF ; tokenToReturn = tTypedefName ; }
This message regards typedef names recognition problem. It can be solved in the modifiable gramars approach. Modifiable grammars allow to add/delete rules from the basic set of rules while scanning the source text. When reducing statement
typedef int foo;
a new rule
int : foo; // 1
will be introduced together with the new terminal 'foo'. Also, the basic set of rules should contain a rule looking something like
will be reduced by the sequence of rules 1, 3, 2. The example presented does not address issues of scope, i.e. what happens when typedefs are redefined in the inner scope and then become visible again upon exiting the scope. This relates to the way new rules are added/deleted to/from the basic set. I have submitted to SIGPLAN NOTICES a description of a metalanguage for the USSA system, based on this approach. The metalanguage is powerful enoupgh to handle various scenarious of rules modification, so that solutions to visibility and scope problems may be expressed.
There are also papers printed in SIGPLAN NOTICES, regarding this approach:
Henning Christiansen. "A Survey of Adaptable Grammars," Sigplan Notices, V.25 N.11, 35-44, 1990.
Boris Burshteyn. "On the Modification of the Formal Grammar at Parse Time," Sigplan notices, V.25 N.5, 117-123, 1990.
Boris Burshteyn. "Generation and Recognition of Formal Languages by Modifiable Grammars," Sigplan Notices, V.25 N.12, 45-53, 1990.
Thanks, Boris Burshteyn - a man who can spell 'automaton'. -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
Thankyou to Gene Ressler for pointing out a type-o in my earlier note.
Under "grammar file:", I introduced the nonterminal typedefRecOn erroneously. It should have been typedefRecOff and should have been re-initialized with "OFF" instead of "ON."
I should also mention that I am only trying to offer Mark some help. There are many other places this flag must be "turned on/off" (such as before/after cast-expressions, declarations, etc...). I have omitted showing these because I felt they weren't needed to make my point.
If you feel this example is not clear enough, please send me email and I will be glad to address it in more detail.
Thanks again Gene,
Jeff -- Jeff Woods Software Engineer (214) 497-4501 Software Development Tools Group (214) 497-4500 FAX CONVEX Computer Corporation jwo...@convex.COM -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
earlier I wrote: >The fix to this problem is much easier than I first thought: Just use >lex's right-context sensitivity operator (/) to search ahead in the input >stream for one of [,{;] (preceeded by optional whitespace) when an >identifier is encountered. In cases that match, always return the IDENT >token; on cases that don't, lookup the name and return TYPE_NAME if the >identifier is a typedef name, return IDENT otherwise.
This approach bombs on
/* empty declaration with only a typedef name */ typedef int foo; foo;
and on
/* tag names */ typedef int foo; struct foo var;
the latter can be solved as suggested earlier:
In article <92-01-...@comp.compilers>, s...@dcs.edinburgh.ac.uk writes:
|> [Reusing a typedef name] shouldn't be a problem, because this is not really |> an ambiguous occurrence. You can deal with that by having a production |> |> any_ident : ident | type_ident; |> |> and using any_ident for the identifier in a declarator (and several other |> places). This should be possible without introducing any ambiguities.
a scheme which will work for tag names and labels just fine, but not names in the identifier space.
|>So foo should be sent to the parser as an identifier token (with a way of |>recovering the string (or a code for the string) from the token, and not |>as type_identifier token or a label_identifier token
again, This can't be done without introducing ambiugities into the grammar. (at least the grammar given in the back of K&RII) try introducting a production typedef_name->T_IDENT and see how many shift reduce conflicts you get.
bb -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
jwo...@convex.com writes: >Instantiate a flag, such as "typedefRecognition," and initialize it to >"ON" prior to parser invocation, after completed declarations and "OFF" >after recognition of typedef names. The scanner will consult the setting >of this flag after recognizing the lexeme as an "identifier" but before >returning the token ID. If "typedefRecognition" is "ON," the scanner >searching the name space for a symbol with an attribute matching >"typedefName."
again, consider the two code fragments
typedef int foo; main () { long foo;
}
and
typedef int foo; main () { long foo x;
}
in the latter, we definitely want to return T_TYPEDEF_NAME as the token for foo. in the former, we want to return T_IDENT for foo, as it is being redecalred as an identifier in an inner scope. The two programs are identical up to "foo", so there is no amount flag twiddling that we can do to resolve the problem correctly; we must use some sort of lookahead.
The lookahead can either be done by the lexical analyzer, as I've been attempting to do, or by the parser, which means we return the same token for both typedef names and other identifiers. The latter introduces many ambiguities in the grammar in K&RII, in declaration statements, in expression in the form of a single identifier appearing after declarations and before the executable statements within a block, and within function prototypes, and probably other places as well.
bb -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
This message regards typedef names recognition problem.
So does this one :-)
[description of modifiable grammars deleted]
The following might seem like a flippant answer, but I am serious :-
One solution to the typedef problem, is to avoid it completely i.e. don't bother trying to parse C and don't create any extensions of C. If you really must parse C, use one of the hacks mentioned already i.e. interogating the symbol table.
It seems to me that a lot of time is wasted trying to provide a solution to something that is essentially broken. I'm disgusted at language definitions that require hacks like symbol table interogation, or even worse a modifiable grammar, just to be able to parse the syntax. C at least has the excuse of being nearly 20 years old, but for newer languages, there is no excuse at all. IMHO any person who designs a grammar that needs additional help to be parsed, should have a _very_ good reason for doing so.
For example, the problems caused by the same syntax for function calls and array access in Ada is at least in part made up for by the fact that an array can be swapped for a function very easily without having to change a lot of code (an array is a discrete function after all). Having said that I personally don't find this argument compelling, but at least it _is_ an argument. I mean, just what is BK's or DMR's excuse for the typedef hack?
There have been (and are) too many languages that don't seem to have had any real thought put to the design of the syntax. The designers seem to start with "it should look like Pascal" or "it should look like C" and blindly add syntax without giving it due consideration.
So please, any budding language designers/extenders out there, think long and hard about about the syntax of the language and its impact on checking context conditions (or static semantics if you prefer)
This is a heartfelt plea from somebody who has spent way too much time poring over the definitions of languages like :- C, Pascal, Ada, Oberon-2, ML, Haskell, ... etc. trying to make sense of their syntax and context conditions.
PS Q: which language has the best syntax? A: Lisp/Scheme, as there isn't any concrete syntax. You program directly in the abstract syntax. Makes writing the concrete -> abstract converter easy :-) [On the other hand, some of us would put "easy to parse using yacc" fairly low on our list of criteria for good language design. -John] -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
It looks like we could use a FAQ file on the subject of typedef identifiers in C parsers. Folks are coming to the discussion late, and starting the discussion over from square one. One well-meaning fellow recently said that having the lexer look at the symbol-table violates "engineering principles" and thus should not be done. He has obviously never tried to implement the C grammar with a standard one-token-lookahead parser generator, or he would know that it is not quite that simple.
So how about a FAQ? Volunteers anyone? [If someone wants to write the definitive treatise on this topic, I'd be happy to add it to the archives. Otherwise I'm declaring this topic closed. -John] -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
) Something like this might be in order: ) ) parser code: ) ) typedefRecognition = ON ; ) status = yyparse() ; ) ... ) grammar file: ) ) type-specifier: ) tVoid typedefRecOn ) | tChar typedefRecOn ) . ) . ) . ) | tTypedefName typedefRecOn ) ; ) ) typedefRecOn: ) { ) typedefRecognition = ON ; ) } ) ; ) ) Hope this is helpful, ) ) Jeff
Yes, something *like* that is possible, but it is rather tricky. The switches in grammar rules that toggle typedef- recognition must be situated very carefully, and in places that some may find counter-intuitive. In the example above, "typedefRecOn" should be located *after* "tTypedefName" and "tVoid", not before. (!)
Also (for purposes of maintainability if for no other reason), you want to be sure that the scanner is always "eager" to do lookahead, not "lazy" in states where a lookahead is not urgently needed to select a state-change.
A few months ago I posted a YACC/LEX program which, to the best of my knowledge, does everything correctly, although given the trickiness of the problem, I would not be astonished to hear otherwise.
Also I was under the false impression that ANSI C was meant to be a strict extension of C, and therefore implemented some features of C-Classic which ANSI C dropped, thinking the formal grammar I started from was "buggy". Thus my statement that I thought the parser was "ANSI C" was technically incorrect, although probably not enough so to cause any problems. It is "ANSI C with C-Classic Extensions", I guess. -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
> From: hje...@cs.cmu.edu (Mark Hjelm) > Organization: School of Computer Science, Carnegie Mellon
> I have a parser, written using Yacc and Lex, for ANSI C. The grammar is > taken pretty much verbatim from the standard. The scanner uses the symbol > table to decide whether to return "identifier" or "typedef name" as the > token type for an identifier. How do I KNOW that there are no situations > which, due to parser lookahead, would cause the scanner to return an > incorrect token type for an identifier (i.e. return "identifier", even > though the identifier was just/will be made into a "typedef name")? Is > there a general answer to this question for other parsing strategies > (possibly with other amounts of lookahead) and other grammars (languages)?
This is an excellent and interesting question, and pulled out a very nice line of responses. My answers are:
1) I have no "general method" of proving correctness. 2) In this case, correctness can be proved. 3) If your grammar came from the standard, the proof will identify an error in your grammar. 4) A correct grammar exists, and I distribute it for free.
For folks that can't figure out why there is so much discussion about this problem, consider the code fragment:
T(*b)[4];
Notice how the meaning is very different depending on whether T is a typedef-name, or an identifier (and hopefully a function in this case). For this reason most C lexers distinguish between typedef names and other identifiers. Alas, as the poster points out, there is concern that the feedback loop (parser annotates symbol table, which lexer interrogates, and then feeds tokens to the parser) will fail in some complex scenario.
Other threads of response included the use of a GLR parser, which tries all alternatives. Alas, as will be shown, the ambiguity in C is too great, and such an approach will not provide a unique syntactically correct parse when it does exist.
Other respondees mentioned that almost no yacc based parsers get this area of the standard right. Since the grammar included in the standard is misleading, the cause for this problem will become abundantly clear. I will start with a description of how a proof of correctness could proceed, take a slight aside to show why the traditional grammar tends to cause this proof to fail, and explain how the grammar can be written (provably) correctly. Since I am typing this on the fly, I will surely make some errors, but I hope that most of the readers will find the content interesting. Note that my grammar distribution includes a very large paper that deals with many C/C++ ambiguities in much greater detail, and with many more examples.
The fact that yacc uses LR(1) scanning is pivotal to the argument in this special case. In this special case you can note that the typedef-ness of an identifier can *only* change after a declarator is complete. Specifically, the standard states that an identifier enters scope only after its declarator is complete. The completion of a declarator is always marked by one of three tokens:
";" end of declaration statement "," end of individual declarator (another is comming) "=" end of declarator, initializer is comming.
To prove correctness, you must show that reductions involved with annotation of your symbol table take place while one of the three tokens I just mentioned is waiting in the lookahead buffer. The argument notes that since none of the three tokens I identified relies on the typedef-ness of an identifier, everything will work out (i.e., the identifier vs typedef-name separation by the lexer can always be done in a correct and timely manner). Basically, all you have to do is look at the position of the action code that is placed in the grammar to annotate the symbol table, and be sure that this happens before one of the above three terminating tokens (context counts) has being processed.
As an example, suppose you had a rule that looked like:
Please don't bother to argue that you would never have such a specific rule in the grammar, rather note that the action is being done *after* reading the ';'. This is "bad." If you had such a rule, then there is a chance that yacc would have read the next token (it is allowed to if it needs to), and there is a chance that the next token *might* be an identifier. Hence, with the above rule, there is a chance that the lexer would process an upcoming identifier *prior* to the annotation of the symbol table, and hence incorrect results could arise. For example:
typedef int T ; T a;
might be incorrectly parsed (the second "T" might be read *and* tokenized by the lexer before the symbol table is annotated). A guaranteed correct rule could be written:
Note that here the action takes place with the parser at *most* looking ahead at the ';', and hence the lexer would not have had to process future identifier/typedef-name distinctions.
Getting back to your example, since you copied the grammar from the ANSI Standard, your grammar is probably wrong. To expose the problem, try parsing:
typedef int A, B(A); /* my bet is you get a syntax error*/
When correctly parsed, the above is equivalent to:
typedef int A; typedef int B(A);
or simply:
typedef int A; typedef int B(int); /* B's type is function taking an int returning and int */
Again, to verify my interpretation, look at the verbose discussion of declarations in the ANSI standard and note that an identifier enters scope when its declarator is complete. Despite this assertion in prose, most grammars are based on the K&R or ANSI grammars, which gathers an "init-declarator-list" before combining the declarators with the "declaration-specifiers." In the above example, gathering the entire list "A, B(A)" before annotating the symbol table for "A" would allow the lexer to incorrectly tokenize the second "A".
You might wonder how GNU GCC (or others related compilers) manage to get the above parse right, when their grammar looks a lot like this standard (and I assert, faulty) grammar. In the case of GCC, this area of the action code includes a hack wherein the "decl-specifiers" are fetched from lower points in the stack using $-1 (using negative numbers allows action code to reach down into questionable areas of the yacc stack, whereas positive numbers correspond to the tokens in the productions with the action). Basically, by reaching backwards when they see a ',' (or '='), they are able to annotate the symbol table even though they use the standard grammar. Once again, it is provable that they get the correct parse (if they can prove that $-1 always contains the desired decl-specifier value. This latter assertion is far from trivial to prove, as they must correctly process function declarations, which occasionally have no decl-specifiers! They get it right, but a second hack is required atop the first hack, and these two hack unite to make C++ parsing a horror for them).
For those compiler hackers out there, it is worth noting that the problems are greater than what was just given in the above example. Not only can "typdef-ness" be established by a declarator, it can be taken away. For example, consider the following:
With proper rewriting, a grammar can be made to correctly process such declaration without "cheating" and looking at $-1 (or where ever else this info might be socked away). The trick is to combine the declarator with the declaration specifier (and annotate the symbol table), then combine with the optional initializer (rather than forming an init-declarator), and then if there is a ',', form a token that once again acts just like the decl-specs to process the next declarator. If you want to see the correct grammar, take a look at my distribution (which includes a C and a C++ grammar, along with a cheap FLEX input file).
Since one of the respondees included suggesting that a GLR parser could just try all interpretations, and that only one would be valid, the following is a rather interesting example:
typedef int T; main() { int T=100, a=(T)+1; assert(101 == T); }
Notice that the text "(T)+1" could incorrectly be interpreted, if "T" were still a typedef-name, as a cast to type T of the expression "+1". Note that the GLR approach would yield two distinct and valid parses, and hence there would be no single winner (without additional heuristics).
Another person focused on the issues involving redeclaration of typedefnames at inner scopes. Indeed, this was something of a bear to tackle (without kludging). Examples of the code that usually causes problems include:
typedef int T1 ; typedef int T2 ; typedef int T3; main() { const T1 T1; /* redeclares T1 to be an int */ const T2 (T2); /* redeclares T2 to be an int */ const T3; /* syntax error : missing declarator */ }
Bruce Blodgett (formerly of Honeywell Corporation) and I independently implemented clean resolutions of this within a YACC grammar, but the results can be seen within my distribution. Interestingly, this careful analysis of C identified some unaddressed scenarios in the ANSI standard. Since these results go beyond the scope of this posting, interested readers are encouraged to look at my paper which is included with my distribution.
Doug Lea and Doug Schmidt have graciously offered to provide anonymous ftp sites for the 8 files, as well as the Berkeley YACC source (if you need it).
ics.uci.edu (128.195.1.1) in the ftp/gnu directory (even though neither of the archives are GNU related) as:
c++grammar2.0.tar.Z byacc1.8.tar.Z
mach1.npac.syr.edu (128.230.7.14) in the ftp/pub/C++ directory as:
> So please, any budding language designers/extenders out there, think long > and hard about about the syntax of the language and its impact on checking > context conditions (or static semantics if you prefer)
> [On the other hand, some of us would put "easy to parse using yacc" fairly > low on our list of criteria for good language design. -John]
On the other other hand, I would expect "easy for humans to parse quickly" should be very high on everyone's list. The worst problem with these kinds of syntax design mistakes is not so much the inconvenience that it causes compiler implementors; rather, it is the fact that if the lexer and/or parser have to jump through hoops to classify a construct, then so does the *person* trying to read this code. The existence of these kinds of not-quite-ambiguities make the language itself (i.e., *any* program written in that language) inherently less readable. This is bad but sorta manageable in C; it's just scary in C++.
> Q: which language has the best syntax? > A: Lisp/Scheme, as there isn't any concrete syntax. > You program directly in the abstract syntax.
I'm not quite sure what you mean here; I do believe that it is the difficulty that most humans have parsing LISP that is one of the major factors that has prevented that language from becoming as popular as its proponents might like for it to be.
(Is this thread more suitable for comp.lang.misc now?) -- Doug Landauer -- landa...@eng.sun.com _ Sun Microsystems, Inc. -- STE, SunPro, Languages -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
[To the moderator: Even though the topic has been closed, I hope that this article will be accepted anyway, since it describes how these problems were solved in a production compiler that handled all the tricky cases of ANSI C.] [This is the absolute definite last word on parsing typedefs. Really. -John]
I worked on the parser for the Compass C compiler. The solution we used to the typedef vs. identifier problem was to have the scanner return two different token types, based on the symbol table. In order to make this work correctly, a few things have to be done carefully:
An identifier has to be inserted into the symbol table when the punctuation following its declarator (',', ';', or '=') is read. This is not particularly tricky, since an LALR(1) parser never has to look ahead at that point, and our parser looks ahead only when needed.
Making sure that tricky declarations like
typedef int foo; foo; foo x; volatile foo; int foo;
are handled correctly is done by having the grammar of declarations enforce the rule that "if a typedef name is used as a declarator, the declaration-specifiers of the declaration must contain a type-specifier". Thus, if "foo" is a typedef, then
volatile foo;
is an empty declaration (and is illegal), while
int foo;
is a redeclaration of "foo" as an integer variable. To do this, the grammar must keep track of whether a type-specifier has been seen in the declaration-specifiers as it reads the next token and tries to decide whether it is another declaration-specifier or a declarator. Thus you have grammar rules like:
If the grammar is aware of these distinctions, it can be made LALR(1). This multiples the grammar rules, of course, but this is not the only place that a practical C grammar has several clones of a single rule in the Standard's grammar.
Dale Worley Dept. of Math., MIT d...@math.mit.edu -- Send compilers articles to compil...@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.