Maintaining scope while parsing C with a YACC grammar

eliben

unread,

Apr 25, 2011, 8:14:42 AM4/25/11

to

Hello,

Suppose I'm parsing C with a YACC-like grammar. While parsing this
code:

int main()
{
int a = 1;
{ /* internal scope */
int b = 2;
}
}

I want to know that the declaration "int b = 2" happens inside an
internal scope. My problem is that the way YACC grammars (and bottom-
up parsing in general) are structured, "int b = 2" is analyzed
*before* the whole block { int b = 2;} so I've seen "int b = 2" before
I've seen the complete block. What is the usual solution to this
problem?

Thanks in advance

[There's a couple of possibilities. One is to add separate rules for
the open and close braces with action code that increments and
decrements the nesting level, so you know when you reduce the
declaration what scope it is in. Another is to parse the whole thing
into an AST without trying to interpret the symbols other than making
pointers to a generic symbol table entry per name, then walk the AST
and add the scope and type info. Perhaps people will have other
suggestions. -John]

Robert A Duff

unread,

Apr 26, 2011, 12:22:45 PM4/26/11

to

eliben <eli...@gmail.com> writes:

> [There's a couple of possibilities. One is to add separate rules for
> the open and close braces with action code that increments and
> decrements the nesting level, so you know when you reduce the
> declaration what scope it is in. Another is to parse the whole thing
> into an AST without trying to interpret the symbols other than making
> pointers to a generic symbol table entry per name, then walk the AST
> and add the scope and type info. Perhaps people will have other
> suggestions. -John]

I strongly recommend the "build a tree" solution. It might seem like
a lot of trouble at first, but it will simplify things in the long
run.

Do all the interesting work during a subsequent walk of the tree. Or
multiple walks.

I'd do the "making pointers to a generic symbol table..." part during
the tree walk, too, except that C requires some sort of kludgery to
deal with typedefs.

- Bob
[The point of the generic symbol table stuff is that you have to remember the
names somehow, and that seems less awful than doing a strdup() for each
name and hanging the strings off the AST. -John]

Robert A Duff

unread,

Apr 26, 2011, 2:08:38 PM4/26/11

to

Robert A Duff <bob...@shell01.TheWorld.com> writes:

> [The point of the generic symbol table stuff is that you have to
remember the names somehow, and that seems less awful than doing a
> strdup() for each name and hanging the strings off the AST. -John]

Oh, yeah, sure -- I didn't understand what you meant.

You want a table of identifiers, so multiple occurrences are just an
index into that table or whatever. But no particular semantic
information attached to those identifiers.

I once wrote a compiler-like tool that represented identifiers
as an index into the source code. But I later decided that was
a dumb idea.

- Bob

eliben

unread,

Apr 29, 2011, 2:20:09 AM4/29/11

to

On Apr 26, 7:22 pm, Robert A Duff <bobd...@shell01.TheWorld.com>
wrote:

> eliben <eli...@gmail.com> writes:
> > [There's a couple of possibilities. One is to add separate rules for
> > the open and close braces with action code that increments and
> > decrements the nesting level, so you know when you reduce the
> > declaration what scope it is in. Another is to parse the whole thing
> > into an AST without trying to interpret the symbols other than making
> > pointers to a generic symbol table entry per name, then walk the AST
> > and add the scope and type info. Perhaps people will have other
> > suggestions. -John]
>
> I strongly recommend the "build a tree" solution. It might seem like
> a lot of trouble at first, but it will simplify things in the long
> run.
>
> Do all the interesting work during a subsequent walk of the tree. Or
> multiple walks.

Since it's parsing of C I'm talking about, this approach will have to
somehow handle ambiguity of this kind:

T * x;

This can be either a declaration or a multiplication, depending on
earlier symbol table information (whether T is a type or not).

So are you proposing to build an ambiguous AST that contains *both*
parses and resolve between them in later passes? Or just pick one and
maybe modify it later? Do you have references (papers, books, etc.)
explaining this technique?

Thanks in advance

Robert A Duff

unread,

May 2, 2011, 8:19:58 PM5/2/11

to

eliben <eli...@gmail.com> writes:

> On Apr 26, 7:22 pm, Robert A Duff <bobd...@shell01.TheWorld.com>
> wrote:
>> I strongly recommend the "build a tree" solution. It might seem like
>> a lot of trouble at first, but it will simplify things in the long
>> run.
>>
>> Do all the interesting work during a subsequent walk of the tree. Or
>> multiple walks.
>
> Since it's parsing of C I'm talking about, this approach will have to
> somehow handle ambiguity of this kind:
>
> T * x;

Right, that's what I meant by:

...except that C requires some sort of kludgery to
deal with typedefs.

in my earlier post.

> This can be either a declaration or a multiplication, depending on
> earlier symbol table information (whether T is a type or not).
>
> So are you proposing to build an ambiguous AST that contains *both*
> parses and resolve between them in later passes?

I have never seen a C parser that does that. I've seen Ada parsers
that do that (Ada has an ambiguous grammar, too). The idea is that
you create a single tree node that represents "this or that", and
during semantic analysis, you transform it into a "this" or a "that"
tree. It works well for Ada. I don't know how well it would work for
C.

The typical approach for C is to do as you say -- put typedefs in the
symbol table, and have them affect the parse (or really, the lexer).

In any case, I stand by my statement, 'I strongly recommend the "build a
tree" solution.' even for C. Defer as much as possible of semantic
analysis until after parsing, where "as much as possible" is not
necessarily "all of it".

>... Or just pick one and maybe modify it later?

That's possible, but kind of kludgy.

>...Do you have references (papers, books, etc.)
> explaining this technique?

No, sorry. I'd be interested to hear if anybody built a C parser that
didn't use any feedback into the parser from a symbol table to deal
with the typedef issue.

- Bob

Torben Ægidius Mogensen

unread,

May 3, 2011, 3:51:14 AM5/3/11

to

eliben <eli...@gmail.com> writes:
> Since it's parsing of C I'm talking about, this approach will have to
> somehow handle ambiguity of this kind:
>
> T * x;
>
> This can be either a declaration or a multiplication, depending on
> earlier symbol table information (whether T is a type or not).

One technique for handling this is to let the lexer access the symbol
table and determine if T is a type name or not and generate different
tokens for these. The grammar would then have productions somewhat
like

Declaration -> Type non-type-id
| ...

Type -> type-id
| Type *
| ...

Expression -> Expression * Expression
| non-type-id
| ...

It becomes much more complicated for real C, but the idea should be
clear enough.

This requires the parser to keep a symbol table for the current scope
available to the lexer. This table needs not contain full information
for each identifier, just enough to distinguish type names from other
names.

That said, I consider this kind of ambiguity bad language design, as
it is not only hard for a parser to handle, but also hard for a human
reader. Possible fixes are to make declarations and expressions /
statements non-overlapping syntactically (as in Pascal) or to keep
type names syntactically distinct from variable names, e.g. by making
type names start with upper case letter and variable names start with
lower case letters (as in Haskell).

Torben
[As Dennis said, "the ice is thin here." -John]

Paul B Mann

unread,

May 6, 2011, 1:43:51 PM5/6/11

to

There are two independent topics being discussed here.

(1) Scope of variables.
(2) typedef variables.

(1) Scope of variables is solved easily. In your symbol table you
have to keep track of the level for all variables. Every time you
see a '{' you have to increment the level. So the 'a' will be put
into the symbol table as a level 1 variable and the 'b' will be a
level 2. The bottom-up quality of LALR will not affect this. The
'a' will be seen first and the 'b' later, so no problem here.

(2) The infamous 'typedef' problem continues to plague the newbie or
part-time LALR grammar hacker, but it was solved way back in 1987 by
an LALR parser generator which used an integrated symbol table and
semantic grammar symbols (e.i. {typedef}). The state-of-the-art
simple solution works fine with the LRSTAR parser generator (see
http://highperware.com). Here is the simple LALR(1) grammar which
solves this 'typedef' problem:

/* C Typedef Solution. */

<error> => error();
<identifier> => lookup(); /* Symbol table lookup. */

/* Rules. */

Input -> [Declaration]... <eof> +> input_

Declaration
-> VarDecl [',' Var ]... ';' +> decl_
-> typedef VarDecl2 [',' Var2]... ';' +> typedefdecl_

VarDecl -> Type... Ident

Var -> [Ptr]... Ident

VarDecl2 -> Type... Ident2

Var2 -> [Ptr]... Ident2

Ident -> <identifier> +> ident_(1)

Ident2 -> <identifier> +> ident_(1,{typedef})

Type
-> SimpleType +> type_(1)
-> Type Ptr

Ptr -> '*' +> ptr_

SimpleType
-> char
-> int
-> short
-> unsigned
-> {typedef}

It handles the input file:

typedef unsigned int UINT, * UINTPTR;
UNIT a, b, c;
UINTPTR x, y, z;

Note, no hacks or kludges are required, just a state-of-the-art parser
generator.

Paul B Mann

Ira Baxter

unread,

May 13, 2011, 6:46:39 PM5/13/11

to

"Robert A Duff" <bob...@shell01.TheWorld.com> wrote in message

> No, sorry. I'd be interested to hear if anybody built a C parser that
> didn't use any feedback into the parser from a symbol table to deal
> with the typedef issue.

Our DMS Software Reengineering Toolkit does exactly this.
(www.semanticdesigns.com/Products/DMS/DMSToolkit.html).

We use GLR parsing machinery, which means we can decouple parsing from
semantic hackery. The grammar for this is just as you'd expect, with
productions for expression statements and declarations. The GLR
parser builds an ambiguous tree with maximal sharing of the subtrees.
At the end of the parse, you have a parse DAG with ambiguity nodes.

Name resolution is handled by an attribute grammar evaluator with some
special effects. The AGE is pretty general and we use it for lots of
tree-structured analyses; this is pretty nice for nested scopes,
building control flow graphs, etc. For name and type resolution, the
attributes passed up and down the tree are, as you might expect,
reference to symbol scopes and/or type declarations. The special
effects of interest here are the ability to declare an aborted
attribute evaluation at any tree node; this will kill off any
dependent attribute computation and delete the offending tree node and
its otherwise-unreferenced children (remember, there may be sharing,
so a node may have more than one reference). What we do to handle the
T* x issue is check for consistency of the types; if T is a type, then
T*x as an expression makes no sense and that part of the AGE dies off.
If T is numeric type, then T* x makes no sense as a declartion, and
that part of the AGE dies off. Destroying the "bad" subtree removes
the incorrect parse from the tree, and thus the final tree has no
ambiguity nodes and a completed name/type resoultion with symbol
tables that is correct. The C parser is question has been used in
anger to carry out static analysis of software systems with usome
19,000 compilation units. We think it is quite robust.

This scheme works really nicely for C, C++, COBOL and other languages,
and it means we can build grammars without worrying about semantic
constraints. The grammars are thus nice and clean, and the AGEs are
pretty clean too because of the separation of concerns.

You can do this by mixing symbol lookup and parsing, but then it
generally gets a lot messier to build both.

--
Ira Baxter, CTO
www.semanticdesigns.com

Chris F Clark

unread,

Jun 12, 2011, 8:43:13 PM6/12/11

to

I'm sorry I missed this discussion. I did see the comments by Ira
Baxter and Paul Mann though and found them very informative. The one
thing that it prompted me to add is that there results are not
contradictory.

Paul Mann's grammar additions allow one to describe scoping as a
S-attributed grammar problem--that's the class of calculations one can
perform in one pass as one parses with an LR grammar. This matches
the fact that many languages such as C & Pascal are defined to allow
their declarations to be processed in one pass with define before use
rules.

The system Ira Baxter described was a general attribute solving
system. Thus, it includes Paul's model, but being more general
requires more work to do so. The general system does allow parsing
languages like PL/I where the declaration can appear after some (or
all) of the uses.

Hope this helps,
-Chris

******************************************************************************
Chris Clark email: christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris