By intermediate representation, I mean the sort of information which a
compiler front-end passes to the back-end. Such a definition could
obviously be in use by compiler builders, or could be used to pass
information about programs between tools in an integrated environment.
The best example I know of this is the DIANA standard for Ada.
Any information about this greatly appreciated.
Thanks.
- Dave Martin
[There's certainly no widely used standard. The RTL used in GCC is fairly
well documented in the files that come with the distribution. -John]
--
Send compilers articles to comp...@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
>Is there a standard or widely accepted definition of an intermediate
>representation for ANSI C or for C++? (Or even one that's well
>documented, even thought it may not be in wide use.)
>
>By intermediate representation, I mean the sort of information which a
>compiler front-end passes to the back-end. Such a definition could
>obviously be in use by compiler builders, or could be used to pass
>information about programs between tools in an integrated environment.
>The best example I know of this is the DIANA standard for Ada.
A couple of responses have pointed me to the RTL (Register Transfer
Language) used in GNU's gcc. However, unless I completely misunderstand
RTL, it's not really an example of what I was asking for.
According to RTL documentation, RTL describes the instructions to be
output by the compiler; that is, the instruction set of a particular
machine's architecture. What I'm seeking is a description of the syntax
and semantics of the C program being compiled (which then has to be
translated into an appropriate machine language using the RTL).
Does RTL also somehow describe the syntax and semantics of C programs? If
not, how are the syntax and semantics of C programs represented in gcc, or
other compilers? Is it all terribly ad-hoc, or does it conform to any
well-documented representation format?
Any enlightening comments would be greatly appreciated.
Thanks.
- Dave Martin
[RTL is indeed what GCC uses between the parser and the code generator, but
you correctly surmise that it's much lower level than source code. If what
you really want is something to use in source tools, RTL isn't it. -John]
It is highly likely that this topic -- that of a high-level intermediate
representation for C++ programs -- will be the primary focus of the
Advanced Topics Workshop that will follow the upcoming USENIX C++
conference. When details are available, I will post them.
Scott
It's a shot in the dark, but what about OSF's ANDF format? It is much
higher level than GNU RTL, but does it keep enough of the source code
struture to be useful? Possibly. I've been out of this loop for some time
but I looked into it a year and a half ago. It should be fairly standard
by now.
Dave Howell
[I believe that ANDF is quite low level, considering that one of its goals
is to make it difficult to reverse-engineer compiled programs. And does
anyone actually use it? -John]
Perhaps it should. Can anybody tell me where to find documents or papers
defining ANDF or talking about its design or talking about using it or ...
So far, it's just a big rumor to me. I'd really like to read some facts.
Thanks,
Preston Briggs
[I haven't seen any public descriptions of ANDF either, and would also like
to see one. -John]
DIANA was supposed to work as the accepted intermediate representation for
Ada. However, in practice, for a number of different reasons, all of the
Ada compilers use a proprietary IR, even those based in part on DIANA.
Instead of a common IR, the Ada world (?) is moving towards a standard
method of accessing different IRs using a procedural interface, dubbed
ASIS. The ASIS spec defines an interface that is constant between Ada
compilers. But implementation of the interface is very different from
compiler to compiler.
The ASIS spec is intended to provide a single interface for toolsmiths and
CASE tools (etc.), not for compiler writers.
Steve Scalpone
Verdix Western Operations
I think the problem may be that many people don't see the point of making
a compiler generate this special AST if all they are going to do is then
generate code.
However, IMHO this ignores some of the good uses that a standard AST can
be put to. Some examples, that spring to mind are :-
+ pretty printers.
Tools like indent for C are hopless as they consistently fail on
source that doesn't match its limited input format.
+ documentation extractors
The common solution is to awk/sed/perl/grep sections out of the
source, but as with the pretty printing, they are highly reliant on
particular formats
+ tags generators
These can be made independent of the input format.
e.g. etags fails on some C I have because the declarations don't
match the special format it uses to recognise functions.
+ code analysers
e.g. somebody has written a program that trawls over the standard AST
produced by the SRC Modula III compiler and finds which procedures
are never called.
All the above can be written without a standard AST, but they are a
_lot_ easier to write if there is one, and don't require the user to
delve into the guts of a compiler or invent ad-hoc solutions with
crufty little languages like awk, perl ... etc.
Stephen J. Bevan be...@cs.man.ac.uk
[Can you really do a DIANA-like thing for C? I'd think that the syntactic
messiness introduced by the preprocessor would make it awfully hard to
come up with anything that was simultaneously easier to handle than the
original source and didn't lose something significant in translation. -John]
From what I remember, TDF is like Ten15, in that it's a set of algebraic
lattices, using constructors to make bits of TDF out of smaller bits of
TDF (so you end up with a tree-shaped structure). A full textual
description of the tree was typically vast, because it includes verbose
constructors for list-like comprehensions, but of course there is a more
compact notation, and a binary encoding.
Nick Haines ni...@cs.cmu.edu
[Found it, it's in my UNCOL bibliography, message 91-12-026. Also see the
next message.]
Can you really do a DIANA-like thing for C? I'd think that the syntactic
messiness introduced by the preprocessor would make it awfully hard to
come up with anything that was simultaneously easier to handle than the
original source and didn't lose something significant in
translation.
C macros are lexical macros rather than syntax macros, so in general a
Diana-like representation isn't feasible. However, if you're willing to
restrict yourself to syntactically well-formed macro bodies (you forbid ##
and, e.g., "#define BEGIN {"), you should be able to manage. This is done
by the ANDF producer (front end), in order to support installer (back end)
macro expansion, which is necessary for platform-dependent macro
expansion.
I can't see how to get around expanding syntactically ill-formed
macros, though.
-s
[I was more worried about things like typedef inside #if which can have
pervasive and rather messy effects, e.g. a particular name may be a scalar
or an array depending on preprocessor values. And can you tell us more about
experience with ANDF? -John]
ASTs are resource hogs, particularly of memory, and in the early 80s most
machines simply couldnt handle a DIANA-based compiler. Plus DIANA was
specified using an object-oriented language (IDL), which had to be mapped
into a non-object-oriented language language like C or ADA. A lot was lost
along the way, including readibility and performance.
My experience suggests that each tool in a programming environment will
have a particular set of requirements for the AST, including the compiler.
The skill is designing an AST that can satisfy them all, without allowing
the AST design to grow like topsy. To achieve this you need a good core
design coupled with extensibility.
> However, IMHO this ignores some of the good uses that a standard AST can
> be put to. Some examples, that spring to mind are :-
>
> + pretty printers.
> Tools like indent for C are hopless as they consistently fail on
> source that doesn't match its limited input format.
The relationship between the abstract syntax and concrete syntax is an
area that most AST specifications do not address. DIANA didn't, it only had
the notion of an "lx_comments" attribute on each node, with no definition
of how a comment was mapped to a node. Typically a given AST supports only
a canonical concrete representation, with the original lexical whitespace
removed. I used to believe that this was acceptable, but experience has
persuaded me that an AST needs to preserve all the original input and be
able to regenerate it exactly. This does not preclude a pretty-printer, of
course.
> + code analysers
> e.g. somebody has written a program that trawls over the standard AST
> produced by the SRC Modula III compiler and finds which procedures
> are never called.
For the record, the SRC Modula-3 compiler does not make available a public
AST interface. The program in question is based on another compiler
front-end, which does have a public AST interface, and is derived from the
now defunct Olivetti implementation of Modula-3. The AST specification was
strongly influenced by the DIANA work. However, it is directly specified
in a language with object types (Modula-3 itself), in which the compiler
and associated tools are also written. The end result is reasonably
efficient in both space and time, but still more resource hungry than some
people would find acceptable. You can find more about it in:
"An Extensible Programming Environment for Modula-3", Mick Jordan,
Proceedings of the Fourth ACM SIGSOFT Symposium on Software Development
Environments, Software Engineering Notes, 15, 6, Dec 1990.
A copy of the system can be had from gatekeeper.dec.com from the file
/pub/DEC/Modula-3/release/m3tk-2.06.tar.Z. You will need the standard SRC
Modula-3 distribution installed before you can build it.
Mick Jordan
I have a short advertising brochure on TDF which has the following
contact:
Dr. Nic E. Peeling
Defense Research Agency
Electronics Division
RSRE Malvern
St. Andrews Road
Great Malvern
Worcestershire WR14 3PS
United Kingdom
Phone: +44 684 895314
Fax: +44 684 894303
Internet: peeling%hermes...@relay.mod.uk
Good luck,
--
Pedro Nuno Cruz Diniz
--
Pedro Cruz Diniz Email:p...@inesc.inesc.pt Ph:+351-1-3100328 Fax:+351-1-525843
INESC - Instituto de Engenharia de Sistemas e Computadores
Rua Alves Redol, No. 9 /Apartado 10105
[Can you really do a DIANA-like thing for C? I'd think that the syntactic
messiness introduced by the preprocessor would make it awfully hard to
come up with anything that was simultaneously easier to handle than the
original source and didn't lose something significant in translation. -John]
Considering the mess you can generate with macros and #ifdefs, I think a
good C-DIANA would be very tough (if not impossible) to design. However,
I think this says more about the design (sic) of C(++) than it does about
DIANA like standards.
Reverse engineering a standard AST from a concrete syntax is always going
to be tough. However, the simple solution is to design the language
around the abstract syntax instead of the concrete syntax. This has many
benefits, one of which is a standard AST for "free".
This is a simple idea that is almost 30 years old, however, it seems that
the many budding language designers are either ignorant of it or choose to
ignore it for some reason. Either way, it is very depressing as it just
leads to the proliferation of (syntactically) badly designed languages.
Stephen J. Bevan be...@cs.man.ac.uk
Considering the mess you can generate with macros and #ifdefs, I think a
good C-DIANA would be very tough (if not impossible) to design. However,
I think this says more about the design (sic) of C(++) than it does about
DIANA like standards.
In his defense, Stroustrup (the designer of C++) has tried to invent
features that replace existing usages of macros and are syntactically
better-behaved. The most recent such feature is templates (generics).
The closest thing to a DIANA-like thing for C comes from ProCASE; I think
their product is called SmartSystem. They store a syntax tree, and they
also remember the mapping between macroexpansions and that syntax tree.
As far as I know this data structure is entirely proprietary and is not
directly accessible even to customers; they use it to drive their user
interface.
----
Sam Kendall ken...@CenterLine.COM
CenterLine Software Inc. (formerly Saber Software) uunet!saber!kendall
+1 617 498-3238 FAX: +1 617 868-6655