backend question

Martin Doering

unread,

Nov 12, 2002, 2:14:18 PM11/12/02

to

Hi!

If I would like to write a native compiler for a toy language (just
for my personal interest), the hardest thing for me is to generate
code for a specific processor in the end. So my question is, if I
could skip this last step by just using some well know intermediate
format?

For Minix there is a amsterdam compiler kit, which seems to provide
something like this. Are there others? And is there something like
this not too huge in size? I just want to play around a bit. Yes, and
it should be free...

--
Martin Doering
[These days the most popular intermediate format is C. -John]

Diego Novillo

unread,

Nov 13, 2002, 12:14:35 PM11/13/02

to

On Tue, 12 Nov 2002, Martin Doering wrote:

> For Minix there is a amsterdam compiler kit, which seems to provide
> something like this. Are there others? And is there something like
> this not too huge in size? I just want to play around a bit. Yes, and
> it should be free...
>

You may also want to try creating a new front end to GCC using
treelang as a template. Current CVS snapshots should include it.

You'll find general information about writing front ends to GCC
in http://gcc.gnu.org/readings.html. In particular, chapter 14
of the COBOL front end project includes detailed information
about GCC front ends.

Diego.

Hannah Schroeter

unread,

Nov 13, 2002, 12:15:01 PM11/13/02

to

Hello!

Martin Doering <doer...@gmx.de> wrote:

>If I would like to write a native compiler for a toy language (just
>for my personal interest), the hardest thing for me is to generate
>code for a specific processor in the end. So my question is, if I
>could skip this last step by just using some well know intermediate

>format? ...

>[These days the most popular intermediate format is C. -John]

Yes. Another approach (been there, done that) is to parse your
toy language into Lisp data structures and to compile it using
(compile 'nil foo), heavily using macros to do the real transformation
work.

Kind regards,

Hannah.

Fergus Henderson

unread,

Nov 13, 2002, 12:19:35 PM11/13/02

to

"Martin Doering" <doer...@gmx.de> writes:

>If I would like to write a native compiler for a toy language (just
>for my personal interest), the hardest thing for me is to generate
>code for a specific processor in the end. So my question is, if I
>could skip this last step by just using some well know intermediate
>format?
>
>For Minix there is a amsterdam compiler kit, which seems to provide
>something like this. Are there others? And is there something like
>this not too huge in size? I just want to play around a bit. Yes, and
>it should be free...
>

>[These days the most popular intermediate format is C. -John]

Other alternatives include

- C--
See <www.cminusminus.org>.

- Java
- Java byte codes

- C#
- .NET IL
See <http://www.msdn.microsoft.com/net/>, <www.go-mono.com>,
and <http://www.southern-storm.com.au/portable_net.html>.

- Interfacing with the GCC back-end
See <gcc.gnu.org>; in particular see the "treelang" toy
language front-end.

- Interfacing with other compilers, such as VPO, MLRISC, or lcc.

--
Fergus Henderson <f...@cs.mu.oz.au> | "I have always known that the pursuit
The University of Melbourne | of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh> | -- the last words of T. S. Garp.

Joachim Durchholz

unread,

Nov 13, 2002, 12:21:07 PM11/13/02

to

John wrote:
> [These days the most popular intermediate format is C. -John]

Which doesn't mean it's particularly suitable... C gets you started
quickly, but for an intermediate format, it abstracts away the wrong
things in some places. This begins to bite if you're doing
concurrency, exceptions, fancy integer arithmetic, tail-call
elimination, or state machines.

I'm not sure how far C-- has progressed. The CVS changelog shows recent
activity, and there seem to be usable (if beta) executables available.
(I dimly remember there are more projects of that kind, but I don't have
any pointers right now. You might want to take a look at the gcc home
page, people have been working on isolating the code generation backend
for some years now, the results might have become usable.)

HTH
Joachim

t...@cs.ucr.edu

unread,

Nov 17, 2002, 11:20:16 PM11/17/02

to

Joachim Durchholz <joac...@gmx.de> wrote:
+ John wrote:
+ > [These days the most popular intermediate format is C. -John]
+
+ Which doesn't mean it's particularly suitable... C gets you started
+ quickly, but for an intermediate format, it abstracts away the wrong
+ things in some places. This begins to bite if you're doing
+ concurrency, exceptions, fancy integer arithmetic, tail-call
+ elimination, or state machines.

Standard C lacks an indirect jump (though one can be kludged a switch
statement). Anything that can be done in say MIPS assembly language
can be done in gcc, which has an indirect goto. Where necessary, one
can generates comments that preserve the memory of what got abstracted
away.

Tom Payne
[I didn't say that C was an ideal intermediate language, I said it was
popular. It does seem to have adequate performance for a lot of
translation applications. -John]

Joachim Durchholz

unread,

Nov 20, 2002, 3:17:18 PM11/20/02

to

t...@cs.ucr.edu wrote:
> Joachim Durchholz <joac...@gmx.de> wrote:
> +

> + [C] begins to bite if you're doing

> + concurrency, exceptions, fancy integer arithmetic, tail-call
> + elimination, or state machines.
>
> Standard C lacks an indirect jump (though one can be kludged a switch
> statement).

Agreed.

> Anything that can be done in say MIPS assembly language
> can be done in gcc, which has an indirect goto.

Indirect goto is good for state machines (and it's been used for that
purpose). It doesn't help with the other issues.

> Where necessary, one can generates comments that preserve the memory
> of what got abstracted away.

It's not abstracted away in the generated C code, it's abstracted away
by the language. For example, tail-call elimination goes like this:
For the last subroutine call in a routine, don't push the parameters
and do a JSR; instead, pop the return address and the parameters of
the caller, push the parameters of the callee, push the popped return
address, then do a direct jump to the subroutine. (You can optimize
the push-pop sequence if the new stack frame is not larger than the
old one.) Impossible in C since C abstracts away the call stack. (A
Good Thing for human-written programming languages. Unsuitable for
compiling functional languages where loops are represented as
recursions.)

You can work around a lot of limitations of C - it's going to be ugly
and inefficient, but it's possible. Tail-call elimination is not
feasible (unless gcc offers another extension to handle exactly this,
of course *g*).

Regards,
Joachim

David Chase

unread,

Nov 20, 2002, 3:23:06 PM11/20/02

to

On 17 Nov 2002 23:20:16 -0500, t...@cs.ucr.edu wrote:
> Joachim Durchholz <joac...@gmx.de> wrote:
> + Which doesn't mean it's particularly suitable... C gets you started
> + quickly, but for an intermediate format, it abstracts away the wrong
> + things in some places. This begins to bite if you're doing
> + concurrency, exceptions, fancy integer arithmetic, tail-call
> + elimination, or state machines.

> Standard C lacks an indirect jump (though one can be kludged a switch
> statement). Anything that can be done in say MIPS assembly language
> can be done in gcc, which has an indirect goto. Where necessary, one
> can generates comments that preserve the memory of what got abstracted
> away.

Standard C doesn't give you enough control to write a precise garbage
collector (one that can see all the pointers, and exactly all the
pointers), nor does it give you good control of things like rounding
modes. Standard C also doesn't give you the ability to move your
execution stack (if, for instance, it is too small). Standard C also
gives you less than adequate control of floating point.

David Chase

Fermin Reig

unread,

Nov 24, 2002, 1:18:04 AM11/24/02

to

"David Chase" <ch...@world.std.com> writes:

> On 17 Nov 2002 23:20:16 -0500, t...@cs.ucr.edu wrote:
> > Joachim Durchholz <joac...@gmx.de> wrote:
> > + Which doesn't mean it's particularly suitable... C gets you started
> > + quickly, but for an intermediate format, it abstracts away the wrong
> > + things in some places. This begins to bite if you're doing
> > + concurrency, exceptions, fancy integer arithmetic, tail-call
> > + elimination, or state machines.
>
> > Standard C lacks an indirect jump (though one can be kludged a switch
> > statement). Anything that can be done in say MIPS assembly language
> > can be done in gcc, which has an indirect goto. Where necessary, one
> > can generates comments that preserve the memory of what got abstracted
> > away.
>
> Standard C doesn't give you enough control to write a precise garbage
> collector (one that can see all the pointers, and exactly all the
> pointers),

However, see Fergus Henderson's recent paper:

Accurate Garbage Collection in an Uncooperative Environment
Proceedings of the 2002 International Symposium on Memory Management,
Berlin, Germany, June 2002, pages 150-156.

Available at www.cs.mu.oz.au/research/mercury/information/papers.html

Fermín Reig

felix

unread,

Nov 24, 2002, 1:21:29 AM11/24/02

to

Joachim Durchholz wrote:
>
> You can work around a lot of limitations of C - it's going to be ugly
> and inefficient, but it's possible. Tail-call elimination is not
> feasible (unless gcc offers another extension to handle exactly this,
> of course *g*).

Tail-call elimination is no problem in C (or even Standard C). There
are several ways of implementing it (as shown by Steele, or Baker).
Wether that's ugly, is in the eye of the beholder. I find
machine-generated C much more appealing than x86 assembly language.
Of course it's less efficient than a direct, native code translation
scheme. But pure, raw efficieny isn't *that* important
anymore...otherwise everybody would still code in Assembler.

Besides, then there is point about the costs of porting a compiler to
new hardware. An important issue, IMHO.

On the other hand several people seem to work towards providing TCO
for GCC (I just read one or two papers, though - I don't know what the
current state of affairs is).

cheers,
felix

t...@cs.ucr.edu

unread,

Nov 24, 2002, 1:25:18 AM11/24/02

to

Joachim Durchholz <joac...@gmx.de> wrote:
+ t...@cs.ucr.edu wrote:

+> Joachim Durchholz <joac...@gmx.de> wrote:
+> +
+> + [C] begins to bite if you're doing

+> + concurrency, exceptions, fancy integer arithmetic, tail-call
+> + elimination, or state machines.
+>
+> Standard C lacks an indirect jump (though one can be kludged a switch
+> statement).
+
+ Agreed.
+
+> Anything that can be done in say MIPS assembly language
+> can be done in gcc, which has an indirect goto.
+
+ Indirect goto is good for state machines (and it's been used for that
+ purpose). It doesn't help with the other issues.
+
+> Where necessary, one can generates comments that preserve the memory
+> of what got abstracted away.
+
+ It's not abstracted away in the generated C code, it's abstracted away
+ by the language. For example, tail-call elimination goes like this:
+ For the last subroutine call in a routine, don't push the parameters
+ and do a JSR; instead, pop the return address and the parameters of
+ the caller, push the parameters of the callee, push the popped return
+ address, then do a direct jump to the subroutine. (You can optimize
+ the push-pop sequence if the new stack frame is not larger than the
+ old one.) Impossible in C since C abstracts away the call stack. (A
+ Good Thing for human-written programming languages. Unsuitable for
+ compiling functional languages where loops are represented as
+ recursions.)

The assembly language output of, say, a MIPS-targeted compiler
translates almost one-for-one into the following subset of GCC, which
I call "C-minor":
* three-address ALU instructions, e.g., "x = y + z"
* load and store instructions, e.g.:
"x = p[i];"
"p[i] = x;"
* unconditional jumps, e.g., "goto L5;"
* conditional jumps, e.g., "if( x ) goto L5;"
* indirect jumps, e.g., "goto* p;" which can be kludged via
standard switch statements.
The widths of three-address and of load/store instructions can be
controlled via casts where necessary.

+ You can work around a lot of limitations of C - it's going to be ugly
+ and inefficient, but it's possible. Tail-call elimination is not
+ feasible (unless gcc offers another extension to handle exactly this,
+ of course *g*).

The assembly language code for managing the run-time stack consists of
instructions that increment and decrement the stack pointer, which
translate to C instructions such as: "SP = SP + size;" If you compile
that statement with a typical C compiler, it might increment something
other than the register you have in mind, e.g., it might increment a
memory location. But that's a matter of implementation.

When you generate MIPS assembly code as your target langage, you must
assemble it with an assembler that interprets SP according to your
intent. The same holds if you translate to the C language,
i.e. compiling the target code with an ordinary conforming C-compiler
will produce correct but inefficient behavior. (The efficiency will
be considerably better than instruction-by-instruction
interpretation.) It's not difficult to write a simple translator that
translates C-minor to almost one-for-one to assembly language that
assembles to machine code of normal efficiency. That's particularly
true for simple RISC architectures -- achieving efficiency with more
complex architectures requires a more complex translator.

Tom Payne
[If you know exactly what compiler and target machine you're compiling
to, you can indeed do lots of gross but effective hacks in C. But if
you want your translated code to be at all portable, you can't try to
go mucking with the stack. -John]

Mark

unread,

Nov 24, 2002, 1:26:55 AM11/24/02

to

joachim....@gmx.de writes:
>> + [C] begins to bite if you're doing
>> + concurrency, exceptions, fancy integer arithmetic, tail-call
>> + elimination, or state machines.

>Indirect goto is good for state machines (and it's been used for that
>purpose). It doesn't help with the other issues.

Since a program is, itself, already a state machine (goto label =
state, goto = state transition, etc.), there's generally no need to
put another state machine on top of that.

So, the need for an explicit state machine (along with the apparent
need for indirect gotos) is generally a symptom of another more
fundamental issue: the lack of concurrency at the language level.

For, that's precisely where you have to concern yourself with bringing
the implicit state machine structure of the program up to the
forefront: as the pretext to chopping up processes into snippets so
they can be interleaved and threaded together by hand.

In fact, I've learned long ago to put this in a hair trigger. As soon
as you even hear 'state machine', you immediately jump 'anh anh!
You're trying to do concurrency by hand. Stop that.' (Slap on the
wrist).

So, the actual problem is not the lack of facility for indirect gotos
in the context of doing state machines. It's the lack of concurrency.

Fergus Henderson

unread,

Nov 24, 2002, 1:23:02 AM11/24/02

to

"David Chase" <ch...@world.std.com> writes:

>Standard C doesn't give you enough control to write a precise garbage
>collector (one that can see all the pointers, and exactly all the
>pointers),

It does (albeit with a potentially significant performance cost).
See my recent paper at ISMM'02 [1].

>Standard C also gives you less than adequate control of floating point.

C99 offers some improvements in this area.

References:

[1] Fergus Henderson,
"Accurate garbage collection in an uncooperative environment".

Proceedings of the 2002 International Symposium on Memory Management,
Berlin, Germany, June 2002, pages 150-156.

<http://www.cs.mu.oz.au/research/mercury/information/papers.html#high_level_gc>

Nick Maclaren

unread,

Nov 24, 2002, 6:36:29 PM11/24/02

to

Fergus Henderson <f...@cs.mu.OZ.AU> wrote:
>"David Chase" <ch...@world.std.com> writes:
>
>>Standard C doesn't give you enough control to write a precise garbage
>>collector (one that can see all the pointers, and exactly all the
>>pointers),
>
>It does (albeit with a potentially significant performance cost).
>See my recent paper at ISMM'02 [1].

Well, I have looked at it but can't say that I have studied it. I
think that you have assumed quite a few extra constraints - mainly of
the form that the C program is half sane!

For example, it is perfectly conforming for a program to treat a
pointer variable as an array of unsigned char, encrypt that and write
it to disk, overwrite the location, do a lot of memory allocation and
deallocation, read in and decrypt the pointer, use it, and expect the
pointed-to data to still be there!

I don't believe that is a soluble problem, except by never garbage
collecting just because there are no pointers to an object. You CAN
check that a pointer has been abused (e.g. reused after the location
has been freed), but you have to rely on calls to free to tell you
when the space may be released.

While I have never seen the encryption aspect, I have seen program
designs where the only pointer to an in-core construct was held on
disk, under some circumstances.

>>Standard C also gives you less than adequate control of floating point.
>
>C99 offers some improvements in this area.

For some meaning of the word "improvements". Many of us don't think
that they are improvements, even for the purposes for which they are
intended. But there are certainly extra features, intended for this
purpose.

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679

Joachim Durchholz

unread,

Nov 24, 2002, 6:38:37 PM11/24/02

to

felix wrote:
> On the other hand several people seem to work towards providing TCO
> for GCC (I just read one or two papers, though - I don't know what the
> current state of affairs is).

Last time I heard anything, its state was "it works but the
opportunities for actually applying the optimization are quite rare". It
was a long list of requirements that the C function should fulfill.
The one keyword that stuck with me was setjmp/longjmp, but that was just
one entry in a far-too-long-for-my-taste list of potential TCO inhibitors.

Regards,
Joachim

Nick Maclaren

unread,

Nov 26, 2002, 10:07:36 PM11/26/02

to

Unless I have missed something fundamental, that restriction is only
to the use of setjmp in a function being optimised to remove its
return. Not a big deal. There is a much nastier one to do with never
passing the address of a local variable to an external function.

Fergus Henderson

unread,

Dec 1, 2002, 10:40:40 PM12/1/02

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> writes:

>Fergus Henderson <f...@cs.mu.OZ.AU> wrote:
>>"David Chase" <ch...@world.std.com> writes:
>>
>>>Standard C doesn't give you enough control to write a precise garbage
>>>collector (one that can see all the pointers, and exactly all the
>>>pointers),
>>
>>It does (albeit with a potentially significant performance cost).
>>See my recent paper at ISMM'02 [1].
>
>Well, I have looked at it but can't say that I have studied it. I
>think that you have assumed quite a few extra constraints - mainly of
>the form that the C program is half sane!

Certainly. But we were talking about using C as a target language.
For that case, the high-level language compiler has complete control
over what C code is generated.

Note that even assembler doesn't give you enough control to write a
precise garbage collector that will work for arbitrary assembler programs!
So C is no worse than assembler in that respect.

Fergus Henderson

unread,

Dec 1, 2002, 10:41:08 PM12/1/02

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> writes:

>"Joachim Durchholz" <joac...@gmx.de> writes:
>|> felix wrote:
>|> > On the other hand several people seem to work towards providing TCO
>|> > for GCC (I just read one or two papers, though - I don't know what the
>|> > current state of affairs is).
>|>
>|> Last time I heard anything, its state was "it works but the
>|> opportunities for actually applying the optimization are quite rare". It
>|> was a long list of requirements that the C function should fulfill.
>|> The one keyword that stuck with me was setjmp/longjmp, but that was just
>|> one entry in a far-too-long-for-my-taste list of potential TCO inhibitors.
>
>Unless I have missed something fundamental, that restriction is only
>to the use of setjmp in a function being optimised to remove its
>return. Not a big deal. There is a much nastier one to do with never
>passing the address of a local variable to an external function.

At least with that one it is possible to work around it, by changing
the generated C code so that all such local variables are put inside a
local scope, and all tail calls are done by first using a "goto" to
jump outside that scope, and then doing the tail call. But this would
get very messy.

The worst restriction, IMHO, is that GCC's tail call optimization doesn't
work for position independant code.

However, there have been some improvements. For example, Andreas
Bauer has been doing some work on tail calls in GCC; with the support
of other GCC developers he has recently been able to put in place the
infrastructure needed to do tail call optimization for calls via
function pointers, and has implemented this for x86.

Nick Maclaren

unread,

Dec 3, 2002, 12:39:36 AM12/3/02

to

"Fergus Henderson" <f...@students.cs.mu.OZ.AU> writes:
|> "Nick Maclaren" <nm...@cus.cam.ac.uk> writes:
|> >Fergus Henderson <f...@cs.mu.OZ.AU> wrote:
|> >>"David Chase" <ch...@world.std.com> writes:
|> >>
|> >>>Standard C doesn't give you enough control to write a precise garbage
|> >>>collector (one that can see all the pointers, and exactly all the
|> >>>pointers),
|> >>
|> >>It does (albeit with a potentially significant performance cost).
|> >>See my recent paper at ISMM'02 [1].
|> >
|> >Well, I have looked at it but can't say that I have studied it. I
|> >think that you have assumed quite a few extra constraints - mainly of
|> >the form that the C program is half sane!
|>
|> Certainly. But we were talking about using C as a target language.
|> For that case, the high-level language compiler has complete control
|> over what C code is generated.

Then I am puzzled by the original remarks. Any general purpose
language can be used in that way - Fortran II and 66 were, heavily.
As with mathematical Turing machines, any general purpose language can
be compiled into any other, with some niggles about details.

The only question is how many of the features have to be emulated,
which affects the overheads.

|> Note that even assembler doesn't give you enough control to write a
|> precise garbage collector that will work for arbitrary assembler programs!
|> So C is no worse than assembler in that respect.

Agreed.

t...@cs.ucr.edu

unread,

Dec 7, 2002, 8:00:58 PM12/7/02

to

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
+ "Fergus Henderson" <f...@students.cs.mu.OZ.AU> writes:
[...]
+ |> Certainly. But we were talking about using C as a target language.
+ |> For that case, the high-level language compiler has complete control
+ |> over what C code is generated.
+
+ Then I am puzzled by the original remarks. Any general purpose
+ language can be used in that way - Fortran II and 66 were, heavily.
+ As with mathematical Turing machines, any general purpose language can
+ be compiled into any other, with some niggles about details.
+
+ The only question is how many of the features have to be emulated,
+ which affects the overheads.

The commonly used three-address load/store subset of C is very close
to quadruple-based intermediate code, which AFAIK translates as
efficiently as other intermediate notations to most real
architectures.

Tom Payne
[For expressions, sure. But I think the issues have more to do with
control structures, particularly interesting ones like tail calls and
non-local gotos. -John]