Lookahead vs. Scanner Feedback

1291 views
Skip to first unread message

Mark Hjelm

unread,
Jan 3, 1992, 1:16:32 PM1/3/92
to
I have a parser, written using Yacc and Lex, for ANSI C. The grammar is
taken pretty much verbatim from the standard. The scanner uses the symbol
table to decide whether to return "identifier" or "typedef name" as the
token type for an identifier. How do I KNOW that there are no situations
which, due to parser lookahead, would cause the scanner to return an
incorrect token type for an identifier (i.e. return "identifier", even
though the identifier was just/will be made into a "typedef name")? Is
there a general answer to this question for other parsing strategies
(possibly with other amounts of lookahead) and other grammars (languages)?

Just Curious,
Mark

hj...@cs.cmu.edu
--
Send compilers articles to comp...@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.

Raul Deluth Miller-Rockwell

unread,
Jan 4, 1992, 7:40:47 PM1/4/92
to
Mark Hjelm:

I have a parser, written using Yacc and Lex, for ANSI C. The
grammar is taken pretty much verbatim from the standard. The
scanner uses the symbol table to decide whether to return
"identifier" or "typedef name" as the token type for an identifier.
How do I KNOW that there are no situations which, due to parser
lookahead, would cause the scanner to return an incorrect token
type for an identifier (i.e. return "identifier", even though the

The only uncertainty here is what determining what happens while a
name is being declared.

If I understand aright, you're defaulting an alphanumeric token to be
an identifier unless it first appears in a typedef as one of the
declared names. So I expect you'll have to have a production which
will declare an identifier as a typedef name. For example
(oversimplifying away things like pointer and array specifiers):

typedef_decl
: typedef_prelude identifier {decl_typedef($1, $2); $$= $1;}
| typedef_decl ',' identifier {decl_typedef($1, $3); $$= $1;}
;

typedef : typedef_decl ';' ;

where decl_typedef() is responsible for the paranoia checks and symbol
table manipulations. After being hit by the above production, that
specific identifier will be recognized by LEX as an identifier.

What about lookahead and backtracking? The terminating ';' is
guaranteed to eat YACC's one token lookahead, and backtracking only
occurs when an error is encountered -- and I hope you're not expecting
perfection in the face of a syntax error?

Also, note that an attempt to declare the same name twice in the same
typedef will be a syntax error. For example:

typedef int foo, foo;

Here, when "typedef int foo" is reduced (as
typedef_prelude identifier_), you are guaranteed that the input has
not been lexed any farther than "typedef int foo," so when the second
"foo" is lexed, it will be recognized as a typedef name, even with
array specifier stuff in the grammar.

--
Raul Deluth Miller-Rockwell <rock...@socrates.umd.edu>

Raul Deluth Miller-Rockwell

unread,
Jan 4, 1992, 10:14:48 PM1/4/92
to
I wrote:
[after a sample production intended to recognize new "typedef
names".]
After being hit by the above production, that specific identifier
will be recognized by LEX as an identifier.
^^^^^^^^^^
That should read: After being hit by the above production, that
specific alphanumeric token will be recognized as a typedef name.
^^^^^^^^^^^^
My apologies. I hope I didn't confuse anyone.

--
Raul Deluth Miller-Rockwell <rock...@socrates.umd.edu>
[Sorry, I should have caught that. -John]

Brian Bliss

unread,
Jan 6, 1992, 7:00:15 PM1/6/92
to
In article <92-0...@comp.compilers>, hje...@cs.cmu.edu (Mark Hjelm) writes:
|> I have a parser, written using Yacc and Lex, for ANSI C. The grammar is
|> taken pretty much verbatim from the standard. The scanner uses the symbol
|> table to decide whether to return "identifier" or "typedef name" as the
|> token type for an identifier. How do I KNOW that there are no situations
|> which, due to parser lookahead, would cause the scanner to return an
|> incorrect token type for an identifier ... ?

One place where every yacc/lex based C compiler I know of is
broken is on a typedef name redefined in an inner scope:

typedef int foo;

main ()
{
char foo;
}

is legal ANSI C. I think there was a thread on this a while back.

bb

Jan Rekers

unread,
Jan 7, 1992, 8:20:45 AM1/7/92
to
In article <92-0...@comp.compilers>, hje...@cs.cmu.edu (Mark Hjelm) writes:
|> I have a parser, written using Yacc and Lex, for ANSI C. The grammar is
|> taken pretty much verbatim from the standard. The scanner uses the symbol
|> table to decide whether to return "identifier" or "typedef name" as the
|> token type for an identifier. [How do I reliably get it right?]

We use a different approach:
As we use the GLR parsing technique, which is an LR parser which is able
to split up in several LR parsers on a conflict, we are able to solve this
problem in a very general way. The scanner just returns all possible
interpretations of a token, the parser splits up to pursuit the different
possibilities; some of these will die in the sequel and the correct choice
survives. This leaves all information about the grammar out of the lexical
scanner and is guaranteed to work for any case without any inspection of
the grammar by a programmer.

Jan Rekers (Jan.R...@cwi.nl) Centrum voor Wiskunde en Informatica (CWI)
P.O. Box 4079, 1009 AB Amsterdam, The Netherlands

Sean Eric Fagan

unread,
Jan 7, 1992, 1:48:51 AM1/7/92
to
In article <92-0...@comp.compilers> bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes:
>One place where every yacc/lex based C compiler I know of is
>broken is on a typedef name redefined in an inner scope:

Microsoft C gets it right, strangely enough. pcc doesn't. Therefore, it's
not a problem with yacc, but with the compiler implementation. (Yes, msc is
built using yacc, and I've even used pure AT&T yacc to build it.)

--
Sean Eric Fagan
s...@kithrup.COM

Craig Burley

unread,
Jan 7, 1992, 2:22:18 PM1/7/92
to
In article <92-0...@comp.compilers> bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes:

One place where every yacc/lex based C compiler I know of is
broken is on a typedef name redefined in an inner scope:

typedef int foo;

main ()
{
char foo;
}

is legal ANSI C. I think there was a thread on this a while back.

Well, take GNU C (versions 1.something and 2) off your list of broken
yacc/lex-based compilers. (Ok, it's not lex-based and it is bison-based,
but I don't think that matters.) I just tried this example on both
versions and it works fine.
--
James Craig Burley, Software Craftsperson bur...@gnu.ai.mit.edu

Dale R. Worley

unread,
Jan 7, 1992, 5:00:20 PM1/7/92
to
In article <92-0...@comp.compilers> hje...@cs.cmu.edu (Mark Hjelm) writes:
How do I KNOW that there are no situations
which, due to parser lookahead, would cause the scanner to return an
incorrect token type for an identifier (i.e. return "identifier", even
though the identifier was just/will be made into a "typedef name")?

Well, I assume that you're being careful about redeclarations. For
instance,

typedef int a;
{
float a;
a++;
}

is legal in ANSI C. What you need, of course, is to have a few places
in the grammar that allow either an identifier or a typedef name.

In general, I know of no way of verifying that your parser has
everything right, although meticulous comparison of the ANSI rules and
your parser will help, and thorough exercising with C validation
suites will help.

Dale Worley Dept. of Math., MIT d...@math.mit.edu

s...@dcs.edinburgh.ac.uk

unread,
Jan 7, 1992, 7:24:16 AM1/7/92
to
bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes:

> One place where every yacc/lex based C compiler I know of is
> broken is on a typedef name redefined in an inner scope:
>
> typedef int foo;
>
> main ()
> {
> char foo;
> }
>
> is legal ANSI C. I think there was a thread on this a while back.

I missed this thread, but anyway:

The above shouldn't be a problem, because this is not really an ambiguous
occurrence. You can deal with that by having a production

any_ident : ident | type_ident;

and using any_ident for the identifier in a declarator (and several other
places). This should be possible without introducing any ambiguities.

But for some parts of the C syntax this is not so easy, for labels you
probably have to expand the any_ident production to allow programs like

typef int foo;
main ()
{ foo: ;
}

because otherwise there is a shift-reduce conflict
(reduce type_ident to any_ident for labels, shift for declarations).
[It's not impossible, but it's tricky and messy to get right. -John]

T. William Wells

unread,
Jan 7, 1992, 9:59:43 PM1/7/92
to
In article <92-0...@comp.compilers> bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes:
: One place where every yacc/lex based C compiler I know of is

: broken is on a typedef name redefined in an inner scope:
:
: typedef int foo;
: main ()
: {
: char foo;
: }
:
: is legal ANSI C. I think there was a thread on this a while back.

It works fine with my yacc/lex based parser. The trick is that when you
see the char, you have to make sure that the lexer is told of the change
of state soon enough. But thanks for noting the potential problem; I'll
add it to my list of test cases.

In answer to the original question: you don't know if the parser will or
will not get a lookahead. That depends on the parser implementation. For
yacc and bison, you can study the -v output to find out.

In general, for an LR parser, you can tell if the parser MUST get a
lookahead by examining the action table: if all the non error entries are
identical, you don't need a lookahead token; otherwise, you do.

---
Bill { uunet | decwrl | telesci }!twwells!bill
bi...@twwells.com

Alex Blakemore

unread,
Jan 8, 1992, 12:20:02 PM1/8/92
to
bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes:
> One place where every yacc/lex based C compiler I know of is
> broken is on a typedef name redefined in an inner scope:
>
> typedef int foo;
>
> main ()
> {
> char foo;
> }

/* start politically incorrect statement */
I may misunderstand, but it seems to me the problem is aggravated by
having the scanner read the symbol table to decide what kind of token foo
is. If so, that seems to me to violate software engineering principles -
which in several cases leads to problems. The scanner should only know
about lexical information - and lexically foo is just an identifier. The
parser knows about levels of scope, context etc and should look up
identifiers in the symbol table(s) to determine what they denote. I know
the dragon book suggests a design where the scanner reads the symbol
table, but that seems to me to be fraught with difficulty in block
structured languages with nesting etc - and even more so in languages like
Ada with multiple imported symbol tables (or ML with its different name
spaces which depend on the context.) For a simple language like C with an
almost flat name space - it probably works fine to follow the dragon
design.

So foo should be sent to the parser as an identifier token (with a way of
recovering the string (or a code for the string) from the token, and not
as type_identifier token or a label_identifier token

I have used this approach several times since it was taught to me at school
(we used another text)

What do real compilers do, esp for other languages besides C ?

P.S. The dragon book is a fine text in many ways, I just wonder about this
one detail.
--
Alex Blakemore blak...@software.org (703) 742-7125
Software Productivity Consortium 2214 Rock Hill Rd, Herndon VA 22070

Brian Bliss

unread,
Jan 8, 1992, 12:55:13 PM1/8/92
to
In article <92-0...@comp.compilers>, s...@dcs.edinburgh.ac.uk writes:
|> [Reusing a typedef name] shouldn't be a problem, because this is not really

|> an ambiguous occurrence. You can deal with that by having a production
|>
|> any_ident : ident | type_ident;
|>
|> and using any_ident for the identifier in a declarator (and several other
|> places). This should be possible without introducing any ambiguities.
|>
|> But for some parts of the C syntax this is not so easy, for labels you
|> probably have to expand the any_ident production to allow programs like
|>
|> typef int foo;
|> main ()
|> { foo: ;
|> }
|>
|> because otherwise there is a shift-reduce conflict
|> (reduce type_ident to any_ident for labels, shift for declarations).
[It's not impossible, but it's tricky and messy to get right. -John]

O.K. I haven't got out the grammar and done the actual table construction
(read: disclaimer), but declarations ARE the one place where you do need
the separate tokens for ident and type_ident. any other place, the
any_ident->ident|type_ident rule works fine (On labels, for instance, the
: in the lookahead stream resolves the ambiguity. I have also sucessfully
used the above productions to allow a typedef name to also be a tag name).
Consider the code fragment:

typedef int z;
main() {
long z;
}

is z being redeclared as a local variable in main(), or are you just
specifying the empty declaration for a long int type? The ambiguity
depends upon which token you return from the lexical analyzer when a is
encountered for the second time. The ANSI C grammar in the back of K&RII
is not ambiguous: it assumes that the lexer resolves the ambiguity, not
the parser.

The fix to this problem is much easier than I first thought: Just use
lex's right-context sensitivity operator (/) to search ahead in the input
stream for one of [,{;] (preceeded by optional whitespace) when an
identifier is encountered. In cases that match, always return the IDENT
token; on cases that don't, lookup the name and return TYPE_NAME if the
identifier is a typedef name, return IDENT otherwise.

As for my original statement

>One place where every yacc/lex based C compiler I know of is broken

I knew sun's cc was broken & any C compiler I had work on was too,
couldn't figure out a way to easily fix the problem, and over-generalized :-)

bb

R. Nigel Horspool

unread,
Jan 8, 1992, 4:40:04 PM1/8/92
to
Jan.R...@cwi.nl (Jan Rekers) writes:

>We use a different approach:
>As we use the GLR parsing technique, which is an LR parser which is
>able to split up in several LR parsers on a conflict, we are able to
>solve this problem in a very general way. The scanner just returns
>all possible interpretations of a token, the parser splits up to
>pursuit the different possibilities; some of these will die in the
>sequel and the correct choice survives. This leaves all information
>about the grammar out of the lexical scanner and is guaranteed to
>work for any case without any inspection of the grammar by a programmer.

I hate to disagree with my friend Jan, but this idea of splitting
off alternative parses introduces other nasty problems and it is
not clear that you would be any better off. Consider the
following C expression:
(foo)-3
This has two different parses depending on whether foo is declared
as a variable or as a typedef name. I.e., if foo is an int variable
then the expression is equivalent to foo-3; but if foo is a typedef
name, then the expression is equivalent to (foo)(-3) where a type
cast is being applied.

If we feed a program containing the k consecutive statements
v1 = (foo)-1;
v2 = (foo)-2;
v3 = (foo)-3;
.
.
vk = (foo)-k;
into Rekers' parser, we should get back at least 2**k different
parses.

The only way to prevent this combinatorial explosion of
possibilities is to kill off the alternative parse possibilities
as they are created. For example, the C grammar rule
<Expression> ::= ( <type-name> ) <Expression>
would need to have an associated semantic test to check whether
the type-name has actually been declared as a type name and, if
not, suppress this particular parsing possibility.

Thus there has to be feedback from semantic analysis into the
parser. I don't believe this is any great improvement over the
conventional approach of having feedback from semantic analysis to
the lexer so that typedef names can be returned as a different kind
of token.

Nigel Horspool
University of Victoria

Debora Weber-Wulff

unread,
Jan 8, 1992, 10:41:57 AM1/8/92
to
hje...@cs.cmu.edu (Mark Hjelm) writes:

>How do I KNOW that there are no situations
>which, due to parser lookahead, would cause the scanner to return an
>incorrect token type for an identifier (i.e. return "identifier", even
>though the identifier was just/will be made into a "typedef name")?

The way to KNOW would be to prove that the scanner and parser conform to
specifications (not easy, since we have neither an exact spec or a
semantics of C :-), and to prove properties of one's microsyntax or
grammar.

Another way to handle the identifier/keyword problem is to introduce a
token class <name> = <letter> {<letter> | <digit>} or whatever, and then
either insert a "filtering pass" that checks each name against a list of
reserved words and replaces the token with <identifier> or the appropriate
keyword token (I feel the "but-that's-grossly-inefficient" flames already,
but I'd rather be right than fast :-), or else to insert a blub of code at
the point that the name token has been recognized to differentiate between
the two.

--
Debora Weber-Wulff d...@inf.fu-berlin.de
Institut fuer Informatik +49 30 89691 124
Nestorstr. 8-9
D-W-1000 Berlin 31

Jeff Woods

unread,
Jan 9, 1992, 11:34:06 AM1/9/92
to
hje...@cs.cmu.edu (Mark Hjelm) writes:
>I have a parser, written using Yacc and Lex, for ANSI C. The grammar is
>taken pretty much verbatim from the standard. The scanner uses the symbol
>table to decide whether to return "identifier" or "typedef name" as the
>token type for an identifier. How do I KNOW that there are no situations
>which, due to parser lookahead, would cause the scanner to return an
>incorrect token type for an identifier (i.e. return "identifier", even
>though the identifier was just/will be made into a "typedef name")? Is
>there a general answer to this question for other parsing strategies
>(possibly with other amounts of lookahead) and other grammars (languages)?

>Just Curious,
>Mark

Mark,

I believe I can accurately address your question. You ask:

How do I KNOW that there are no situations which, due to
parser lookahead, would cause the scanner to return an

incorrect token type for an identifier?

You can definitely KNOW this answer by either constructing the Item sets
by hand or better yet, obtaining the set yacc has constructed (by invoking
yacc with the -v option) during parser generation. Since you are using
the ANSI specification, you will see that "tIdentifer" and "tTypedefName"
both appear in the same Item set. Since a typedef name is lexically
identical to an identifier, this places the burden of resolving this
ambiguity on the scanner. You now KNOW that you do have a situation (ie.
a parser configuration exists that places the burden of resolving lexical
ambiguity on the scanner) where the scanner is unable to return the
*correct* token type without some external help.

Brian Bliss also points this out:

The ANSI C grammar in the back of K&RII is not ambiguous: it
assumes that the lexer resolves the ambiguity, not the parser.

It's not clear to me why you qualify your question with "..., due to
parser lookahead, ...." The parser will only look ahead when it is unable
to make a parsing decision, such as shift, reduce or goto. Since the
ambiguity between typedef name and identifier is lexical rather than
syntactical, I don't understand your reason for qualifying your question
by parser lookahead.

Assuming I may continue, I would like to offer a strategy for handling
such a conflict.

Instantiate a flag, such as "typedefRecognition," and initialize it to
"ON" prior to parser invocation, after completed declarations and "OFF"
after recognition of typedef names. The scanner will consult the setting
of this flag after recognizing the lexeme as an "identifier" but before
returning the token ID. If "typedefRecognition" is "ON," the scanner
searching the name space for a symbol with an attribute matching
"typedefName."

Something like this might be in order:

parser code:

typedefRecognition = ON ;
status = yyparse() ;

scanner code:

... identifier scanning ...

tokenToReturn = tIdentifer ;

if(typedefRecognition == ON && isTypedefName(ActiveScope, lexeme)) {
typedefRecognition = OFF ;
tokenToReturn = tTypedefName ;
}

grammar file:

type-specifier:
tVoid typedefRecOn
| tChar typedefRecOn
.
.
.
| tTypedefName typedefRecOn
;

typedefRecOn:
{
typedefRecognition = ON ;
}
;

Hope this is helpful,

Jeff


----------------------------------------------------------------------
Jeff Woods
Software Engineer (214) 497-4501
Software Development Tools Group (214) 497-4500 FAX
CONVEX Computer Corporation jwo...@convex.COM

Boris Burshteyn

unread,
Jan 9, 1992, 1:33:28 PM1/9/92
to
This message regards typedef names recognition problem. It can be solved
in the modifiable gramars approach. Modifiable grammars allow to
add/delete rules from the basic set of rules while scanning the source
text. When reducing statement

typedef int foo;

a new rule

int : foo; // 1

will be introduced together with the new terminal 'foo'. Also, the basic
set of rules should contain a rule looking something like

declaration : int lis_of_identifiers ';' ; // 2

list_of_identifiers : identifier // 3
| list_of_identifiers ',' identifier // 4

Then, the text

foo var;

will be reduced by the sequence of rules 1, 3, 2. The example presented
does not address issues of scope, i.e. what happens when typedefs are
redefined in the inner scope and then become visible again upon exiting
the scope. This relates to the way new rules are added/deleted to/from
the basic set. I have submitted to SIGPLAN NOTICES a description of a
metalanguage for the USSA system, based on this approach. The
metalanguage is powerful enoupgh to handle various scenarious of rules
modification, so that solutions to visibility and scope problems may be
expressed.

There are also papers printed in SIGPLAN NOTICES, regarding this approach:

Henning Christiansen. "A Survey of Adaptable Grammars," Sigplan Notices,
V.25 N.11, 35-44, 1990.

Boris Burshteyn. "On the Modification of the Formal Grammar at Parse
Time," Sigplan notices, V.25 N.5, 117-123, 1990.

Boris Burshteyn. "Generation and Recognition of Formal Languages by
Modifiable Grammars," Sigplan Notices, V.25 N.12, 45-53, 1990.

Thanks, Boris Burshteyn - a man who can spell 'automaton'.

Jeff Woods

unread,
Jan 10, 1992, 11:51:59 AM1/10/92
to
Thankyou to Gene Ressler for pointing out a type-o in my earlier note.

Under "grammar file:", I introduced the nonterminal typedefRecOn erroneously.
It should have been typedefRecOff and should have been re-initialized with
"OFF" instead of "ON."

I should also mention that I am only trying to offer Mark some help.
There are many other places this flag must be "turned on/off" (such as
before/after cast-expressions, declarations, etc...). I have
omitted showing these because I felt they weren't needed to make my
point.

If you feel this example is not clear enough, please send me email and
I will be glad to address it in more detail.

Thanks again Gene,

Jeff

Brian Bliss

unread,
Jan 10, 1992, 2:15:17 PM1/10/92
to

earlier I wrote:
>The fix to this problem is much easier than I first thought: Just use
>lex's right-context sensitivity operator (/) to search ahead in the input
>stream for one of [,{;] (preceeded by optional whitespace) when an
>identifier is encountered. In cases that match, always return the IDENT
>token; on cases that don't, lookup the name and return TYPE_NAME if the
>identifier is a typedef name, return IDENT otherwise.

This approach bombs on

/* empty declaration with only a typedef name */
typedef int foo;
foo;

and on

/* tag names */
typedef int foo;
struct foo var;

the latter can be solved as suggested earlier:

In article <92-0...@comp.compilers>, s...@dcs.edinburgh.ac.uk writes:

|> [Reusing a typedef name] shouldn't be a problem, because this is not really


|> an ambiguous occurrence. You can deal with that by having a production
|>
|> any_ident : ident | type_ident;
|>
|> and using any_ident for the identifier in a declarator (and several other
|> places). This should be possible without introducing any ambiguities.

a scheme which will work for tag names and labels just fine,
but not names in the identifier space.

blak...@software.org writes:
|>So foo should be sent to the parser as an identifier token (with a way of
|>recovering the string (or a code for the string) from the token, and not
|>as type_identifier token or a label_identifier token

again, This can't be done without introducing ambiugities into the grammar.
(at least the grammar given in the back of K&RII) try introducting a
production typedef_name->T_IDENT and see how many shift reduce conflicts
you get.

bb

Brian Bliss

unread,
Jan 13, 1992, 12:00:35 PM1/13/92
to
jwo...@convex.com writes:

>Instantiate a flag, such as "typedefRecognition," and initialize it to
>"ON" prior to parser invocation, after completed declarations and "OFF"
>after recognition of typedef names. The scanner will consult the setting
>of this flag after recognizing the lexeme as an "identifier" but before
>returning the token ID. If "typedefRecognition" is "ON," the scanner
>searching the name space for a symbol with an attribute matching
>"typedefName."

again, consider the two code fragments


typedef int foo;
main () {

long foo;
}


and

typedef int foo;
main () {

long foo x;
}


in the latter, we definitely want to return T_TYPEDEF_NAME as the token
for foo. in the former, we want to return T_IDENT for foo, as it is being
redecalred as an identifier in an inner scope. The two programs are
identical up to "foo", so there is no amount flag twiddling that we can do
to resolve the problem correctly; we must use some sort of lookahead.

The lookahead can either be done by the lexical analyzer, as I've been
attempting to do, or by the parser, which means we return the same token
for both typedef names and other identifiers. The latter introduces many
ambiguities in the grammar in K&RII, in declaration statements, in
expression in the form of a single identifier appearing after declarations
and before the executable statements within a block, and within function
prototypes, and probably other places as well.

bb

Stephen J Bevan

unread,
Jan 13, 1992, 11:12:30 AM1/13/92
to
This message regards typedef names recognition problem.

So does this one :-)

[description of modifiable grammars deleted]

The following might seem like a flippant answer, but I am serious :-

One solution to the typedef problem, is to avoid it completely i.e. don't
bother trying to parse C and don't create any extensions of C. If you
really must parse C, use one of the hacks mentioned already i.e.
interogating the symbol table.

It seems to me that a lot of time is wasted trying to provide a solution
to something that is essentially broken. I'm disgusted at language
definitions that require hacks like symbol table interogation, or even
worse a modifiable grammar, just to be able to parse the syntax. C at
least has the excuse of being nearly 20 years old, but for newer
languages, there is no excuse at all. IMHO any person who designs a
grammar that needs additional help to be parsed, should have a _very_ good
reason for doing so.

For example, the problems caused by the same syntax for function calls and
array access in Ada is at least in part made up for by the fact that an
array can be swapped for a function very easily without having to change a
lot of code (an array is a discrete function after all). Having said that
I personally don't find this argument compelling, but at least it _is_ an
argument. I mean, just what is BK's or DMR's excuse for the typedef hack?

There have been (and are) too many languages that don't seem to have had
any real thought put to the design of the syntax. The designers seem to
start with "it should look like Pascal" or "it should look like C" and
blindly add syntax without giving it due consideration.

So please, any budding language designers/extenders out there, think long
and hard about about the syntax of the language and its impact on checking
context conditions (or static semantics if you prefer)

This is a heartfelt plea from somebody who has spent way too much time
poring over the definitions of languages like :- C, Pascal, Ada,
Oberon-2, ML, Haskell, ... etc. trying to make sense of their syntax and
context conditions.


Stephen J. Bevan be...@cs.man.ac.uk

PS Q: which language has the best syntax?
A: Lisp/Scheme, as there isn't any concrete syntax.
You program directly in the abstract syntax.
Makes writing the concrete -> abstract converter easy :-)
[On the other hand, some of us would put "easy to parse using yacc" fairly
low on our list of criteria for good language design. -John]

Dave Jones

unread,
Jan 13, 1992, 5:24:24 PM1/13/92
to
It looks like we could use a FAQ file on the subject of typedef
identifiers in C parsers. Folks are coming to the discussion late, and
starting the discussion over from square one. One well-meaning fellow
recently said that having the lexer look at the symbol-table violates
"engineering principles" and thus should not be done. He has obviously
never tried to implement the C grammar with a standard one-token-lookahead
parser generator, or he would know that it is not quite that simple.

So how about a FAQ? Volunteers anyone?
[If someone wants to write the definitive treatise on this topic, I'd be
happy to add it to the archives. Otherwise I'm declaring this topic closed.
-John]

Dave Jones

unread,
Jan 13, 1992, 5:51:05 PM1/13/92
to
>From article <92-0...@comp.compilers), by jwo...@convex.com (Jeff Woods):
) hje...@cs.cmu.edu (Mark Hjelm) writes:

) Something like this might be in order:
)
) parser code:
)
) typedefRecognition = ON ;
) status = yyparse() ;
)
...
) grammar file:
)
) type-specifier:
) tVoid typedefRecOn
) | tChar typedefRecOn
) .
) .
) .
) | tTypedefName typedefRecOn
) ;
)
) typedefRecOn:
) {
) typedefRecognition = ON ;
) }
) ;
)
) Hope this is helpful,
)
) Jeff


Yes, something *like* that is possible, but it is rather tricky. The
switches in grammar rules that toggle typedef- recognition must be
situated very carefully, and in places that some may find
counter-intuitive. In the example above, "typedefRecOn" should be located
*after* "tTypedefName" and "tVoid", not before. (!)

Also (for purposes of maintainability if for no other reason), you want to
be sure that the scanner is always "eager" to do lookahead, not "lazy" in
states where a lookahead is not urgently needed to select a state-change.

A few months ago I posted a YACC/LEX program which, to the best of my
knowledge, does everything correctly, although given the trickiness of the
problem, I would not be astonished to hear otherwise.

Also I was under the false impression that ANSI C was meant to be a strict
extension of C, and therefore implemented some features of C-Classic which
ANSI C dropped, thinking the formal grammar I started from was "buggy".
Thus my statement that I thought the parser was "ANSI C" was technically
incorrect, although probably not enough so to cause any problems. It is
"ANSI C with C-Classic Extensions", I guess.

Jim Roskind x5570

unread,
Jan 14, 1992, 4:23:39 PM1/14/92
to
> From: hje...@cs.cmu.edu (Mark Hjelm)
> Organization: School of Computer Science, Carnegie Mellon

>
> I have a parser, written using Yacc and Lex, for ANSI C. The grammar is
> taken pretty much verbatim from the standard. The scanner uses the symbol
> table to decide whether to return "identifier" or "typedef name" as the
> token type for an identifier. How do I KNOW that there are no situations
> which, due to parser lookahead, would cause the scanner to return an
> incorrect token type for an identifier (i.e. return "identifier", even
> though the identifier was just/will be made into a "typedef name")? Is
> there a general answer to this question for other parsing strategies
> (possibly with other amounts of lookahead) and other grammars (languages)?
>

This is an excellent and interesting question, and pulled out a very
nice line of responses. My answers are:

1) I have no "general method" of proving correctness.
2) In this case, correctness can be proved.
3) If your grammar came from the standard, the proof will
identify an error in your grammar.
4) A correct grammar exists, and I distribute it for free.


For folks that can't figure out why there is so much discussion about this
problem, consider the code fragment:

T(*b)[4];

Notice how the meaning is very different depending on whether T is a
typedef-name, or an identifier (and hopefully a function in this case).
For this reason most C lexers distinguish between typedef names and other
identifiers. Alas, as the poster points out, there is concern that the
feedback loop (parser annotates symbol table, which lexer interrogates,
and then feeds tokens to the parser) will fail in some complex scenario.

Other threads of response included the use of a GLR parser, which tries
all alternatives. Alas, as will be shown, the ambiguity in C is too
great, and such an approach will not provide a unique syntactically
correct parse when it does exist.

Other respondees mentioned that almost no yacc based parsers get this area
of the standard right. Since the grammar included in the standard is
misleading, the cause for this problem will become abundantly clear. I
will start with a description of how a proof of correctness could proceed,
take a slight aside to show why the traditional grammar tends to cause
this proof to fail, and explain how the grammar can be written (provably)
correctly. Since I am typing this on the fly, I will surely make some
errors, but I hope that most of the readers will find the content
interesting. Note that my grammar distribution includes a very large
paper that deals with many C/C++ ambiguities in much greater detail, and
with many more examples.

The fact that yacc uses LR(1) scanning is pivotal to the argument in this
special case. In this special case you can note that the typedef-ness of
an identifier can *only* change after a declarator is complete.
Specifically, the standard states that an identifier enters scope only
after its declarator is complete. The completion of a declarator is
always marked by one of three tokens:

";" end of declaration statement
"," end of individual declarator (another is comming)
"=" end of declarator, initializer is comming.

To prove correctness, you must show that reductions involved with
annotation of your symbol table take place while one of the three tokens I
just mentioned is waiting in the lookahead buffer. The argument notes
that since none of the three tokens I identified relies on the
typedef-ness of an identifier, everything will work out (i.e., the
identifier vs typedef-name separation by the lexer can always be done in a
correct and timely manner). Basically, all you have to do is look at the
position of the action code that is placed in the grammar to annotate the
symbol table, and be sure that this happens before one of the above three
terminating tokens (context counts) has being processed.

As an example, suppose you had a rule that looked like:

declaration: storage-class INT identifier ';' { /* annotate sym table */ }

Please don't bother to argue that you would never have such a specific
rule in the grammar, rather note that the action is being done *after*
reading the ';'. This is "bad." If you had such a rule, then there is a
chance that yacc would have read the next token (it is allowed to if it
needs to), and there is a chance that the next token *might* be an
identifier. Hence, with the above rule, there is a chance that the lexer
would process an upcoming identifier *prior* to the annotation of the
symbol table, and hence incorrect results could arise. For example:

typedef int T ;
T a;

might be incorrectly parsed (the second "T" might be read *and* tokenized
by the lexer before the symbol table is annotated). A guaranteed correct
rule could be written:

declaration: storage-class INT identifier { /* annotate sym table*/ } ';'

Note that here the action takes place with the parser at *most* looking
ahead at the ';', and hence the lexer would not have had to process future
identifier/typedef-name distinctions.

Getting back to your example, since you copied the grammar from the ANSI
Standard, your grammar is probably wrong. To expose the problem, try
parsing:

typedef int A, B(A); /* my bet is you get a syntax error*/

When correctly parsed, the above is equivalent to:

typedef int A;
typedef int B(A);

or simply:

typedef int A;
typedef int B(int); /* B's type is function taking an int
returning and int */


Again, to verify my interpretation, look at the verbose discussion of
declarations in the ANSI standard and note that an identifier enters scope
when its declarator is complete. Despite this assertion in prose, most
grammars are based on the K&R or ANSI grammars, which gathers an
"init-declarator-list" before combining the declarators with the
"declaration-specifiers." In the above example, gathering the entire list
"A, B(A)" before annotating the symbol table for "A" would allow the lexer
to incorrectly tokenize the second "A".

You might wonder how GNU GCC (or others related compilers) manage to get
the above parse right, when their grammar looks a lot like this standard
(and I assert, faulty) grammar. In the case of GCC, this area of the
action code includes a hack wherein the "decl-specifiers" are fetched from
lower points in the stack using $-1 (using negative numbers allows action
code to reach down into questionable areas of the yacc stack, whereas
positive numbers correspond to the tokens in the productions with the
action). Basically, by reaching backwards when they see a ',' (or '='),
they are able to annotate the symbol table even though they use the
standard grammar. Once again, it is provable that they get the correct
parse (if they can prove that $-1 always contains the desired
decl-specifier value. This latter assertion is far from trivial to prove,
as they must correctly process function declarations, which occasionally
have no decl-specifiers! They get it right, but a second hack is required
atop the first hack, and these two hack unite to make C++ parsing a horror
for them).

For those compiler hackers out there, it is worth noting that the problems
are greater than what was just given in the above example. Not only can
"typdef-ness" be established by a declarator, it can be taken away. For
example, consider the following:

#include <assert.h>
typedef char F;
main() {
long a=sizeof(F), F, b=sizeof(F);
assert(1 == a);
assert(sizeof(long) == b);
}

With proper rewriting, a grammar can be made to correctly process such
declaration without "cheating" and looking at $-1 (or where ever else this
info might be socked away). The trick is to combine the declarator with
the declaration specifier (and annotate the symbol table), then combine
with the optional initializer (rather than forming an init-declarator),
and then if there is a ',', form a token that once again acts just like
the decl-specs to process the next declarator. If you want to see the
correct grammar, take a look at my distribution (which includes a C and a
C++ grammar, along with a cheap FLEX input file).

Since one of the respondees included suggesting that a GLR parser could
just try all interpretations, and that only one would be valid, the
following is a rather interesting example:

typedef int T;
main() {
int T=100, a=(T)+1;
assert(101 == T);
}

Notice that the text "(T)+1" could incorrectly be interpreted, if "T" were
still a typedef-name, as a cast to type T of the expression "+1". Note
that the GLR approach would yield two distinct and valid parses, and hence
there would be no single winner (without additional heuristics).

Another person focused on the issues involving redeclaration of
typedefnames at inner scopes. Indeed, this was something of a bear to
tackle (without kludging). Examples of the code that usually causes
problems include:

typedef int T1 ;
typedef int T2 ;
typedef int T3;
main() {
const T1 T1; /* redeclares T1 to be an int */
const T2 (T2); /* redeclares T2 to be an int */
const T3; /* syntax error : missing declarator */
}

Bruce Blodgett (formerly of Honeywell Corporation) and I independently
implemented clean resolutions of this within a YACC grammar, but the
results can be seen within my distribution. Interestingly, this careful
analysis of C identified some unaddressed scenarios in the ANSI standard.
Since these results go beyond the scope of this posting, interested
readers are encouraged to look at my paper which is included with my
distribution.

Doug Lea and Doug Schmidt have graciously offered to provide anonymous ftp
sites for the 8 files, as well as the Berkeley YACC source (if you need
it).

ics.uci.edu (128.195.1.1) in the ftp/gnu directory (even though neither of
the archives are GNU related) as:

c++grammar2.0.tar.Z
byacc1.8.tar.Z

mach1.npac.syr.edu (128.230.7.14) in the ftp/pub/C++ directory as:

c++grammar2.0.tar.Z
byacc1.8.tar.z


Jim Roskind
Independent Consultant
(407)729-4348 or (617)290-4990 x5570
j...@hq.ileaf.com or ...!uunet!leafusa!jar

p.s.: I'm always looking for interesting compiler and/or grammar work.

-8 Doug Landauer 8-

unread,
Jan 14, 1992, 6:58:05 PM1/14/92
to
> So please, any budding language designers/extenders out there, think long
> and hard about about the syntax of the language and its impact on checking
> context conditions (or static semantics if you prefer)
>
> [On the other hand, some of us would put "easy to parse using yacc" fairly
> low on our list of criteria for good language design. -John]

On the other other hand, I would expect "easy for humans to parse quickly"
should be very high on everyone's list. The worst problem with these
kinds of syntax design mistakes is not so much the inconvenience that it
causes compiler implementors; rather, it is the fact that if the lexer
and/or parser have to jump through hoops to classify a construct, then so
does the *person* trying to read this code. The existence of these kinds
of not-quite-ambiguities make the language itself (i.e., *any* program
written in that language) inherently less readable. This is bad but sorta
manageable in C; it's just scary in C++.

> Q: which language has the best syntax?
> A: Lisp/Scheme, as there isn't any concrete syntax.
> You program directly in the abstract syntax.

I'm not quite sure what you mean here; I do believe that it is the
difficulty that most humans have parsing LISP that is one of the major
factors that has prevented that language from becoming as popular as its
proponents might like for it to be.

(Is this thread more suitable for comp.lang.misc now?)
--
Doug Landauer -- land...@eng.sun.com _
Sun Microsystems, Inc. -- STE, SunPro, Languages

Dale R. Worley

unread,
Jan 15, 1992, 1:14:39 PM1/15/92
to
[To the moderator: Even though the topic has been closed, I hope that
this article will be accepted anyway, since it describes how these
problems were solved in a production compiler that handled all the
tricky cases of ANSI C.]
[This is the absolute definite last word on parsing typedefs. Really. -John]

I worked on the parser for the Compass C compiler. The solution we used
to the typedef vs. identifier problem was to have the scanner return two
different token types, based on the symbol table. In order to make this
work correctly, a few things have to be done carefully:

An identifier has to be inserted into the symbol table when the
punctuation following its declarator (',', ';', or '=') is read. This
is not particularly tricky, since an LALR(1) parser never has to look
ahead at that point, and our parser looks ahead only when needed.

Making sure that tricky declarations like

typedef int foo;
foo;
foo x;
volatile foo;
int foo;

are handled correctly is done by having the grammar of declarations
enforce the rule that "if a typedef name is used as a declarator, the
declaration-specifiers of the declaration must contain a type-specifier".
Thus, if "foo" is a typedef, then

volatile foo;

is an empty declaration (and is illegal), while

int foo;

is a redeclaration of "foo" as an integer variable. To do this, the
grammar must keep track of whether a type-specifier has been seen in the
declaration-specifiers as it reads the next token and tries to decide
whether it is another declaration-specifier or a declarator. Thus you
have grammar rules like:

declaration-specifiers-without-type-specifier ::=
declaration-specifiers-without-type-specifier type-modifier

declaration-specifiers-with-type-specifier ::=
declaration-specifiers-with-type-specifier type-modifier
| declaration-specifiers-without-type-specifier type-specifier
| declaration-specifiers-with-type-specifier type-specifier

declaration ::=
declaration-specifiers-without-type-specifier
non-typedef-declarator
| declaration-specifiers-with-type-specifier
non-typedef-declarator
| declaration-specifiers-with-type-specifier
typedef-declarator

If the grammar is aware of these distinctions, it can be made LALR(1).
This multiples the grammar rules, of course, but this is not the only
place that a practical C grammar has several clones of a single rule in
the Standard's grammar.

Dale Worley Dept. of Math., MIT d...@math.mit.edu

Stephen J Bevan

unread,
Jan 15, 1992, 8:22:28 AM1/15/92
to
[On the other hand, some of us would put "easy to parse using yacc" fairly
low on our list of criteria for good language design. -John]

Maybe that's why there are so many (syntactically) badly designed
languages :-)

IMHO by definition the "syntax" should be parsable by a context free
grammar. As I said previously if you don't design it like this you should
have a good reason. I have nothing against languages that deviate from
the rule as long as there is some real benefit from it. The Ada example I
gave is one example of this. I'm at a loss to think of another one.

Also, I don't consider "yacc" to be the last word in syntax analysis.
Like C it's old and past it's prime. There are better tools freely
available (e.g. GMD toolbox), if only people would use them.

Stephen J. Bevan be...@cs.man.ac.uk

Dr A. N. Walker

unread,
Jan 17, 1992, 12:34:53 PM1/17/92
to
In article <92-0...@comp.compilers> land...@morocco.Eng.Sun.COM

(-8 Doug Landauer 8-) writes:

>> [On the other hand, some of us would put "easy to parse using yacc" fairly
>> low on our list of criteria for good language design. -John]

I'm not too worried about Yacc in particular, but I think "easy to
parse automatically" needs to come fairly high up -- see below.

>On the other other hand, I would expect "easy for humans to parse quickly"
>should be very high on everyone's list.

A successful language must be easy to (a) write [else no-one will
use it]; (b) read [else it will be unmaintainable]; and (c) parse [else
implementations will be expensive and rare]. Anyone [poetic licence!]
can write a Pascal parser -- so Pascal compilers are ten-a-penny, every CS
student studies them, and Pascal is (or was until recently) popular well
beyond the intrinsic merits of the language. Contrariwise, Algol is a pig
for amateur compilation, and languishes despite its merits.

>[...] if the lexer


>and/or parser have to jump through hoops to classify a construct, then so
>does the *person* trying to read this code.

This doesn't follow. Traditional lexers and parsers take a
microscopic view of the code (typically based on "what is the next
symbol?"). Human readers typically take a more global view, even on first
reading. I don't carry a symbol table in my head, so I expect to have to
refer back *or ahead* in the code to discover what some variable "is"; I
can make use of comments or the names of variables to intuit what some
code is trying to do; layout often helps to give the global structure of
the program. Of course, if the programmer writes obfuscated code, *then*
the human may have to jump through the same hoops.
--
Andy Walker, Maths Dept., Nott'm Univ., UK.
a...@maths.nott.ac.uk
[I'd expect that the obstacles to Algol60 implementations are call-by-name
and the lack of standardized I/O. On the other hand, Algol68 is indeed
difficulto to parse. -John]

Brian Bliss

unread,
Jan 20, 1992, 4:04:32 PM1/20/92
to
I don't want to start up the thread on the typedef problem all over again,
but as far as it is concerned with language design...

Consider the production for a C declaration statement, ala K&RII:

declaration -> dec_specifiers init_dec_list
^ by requiring a colon (like pascal) or some
other token right there, the entire problem
could have been avoided.

bb

Reply all
Reply to author
Forward
0 new messages