Compilation of stack based vm-code

Mat

unread,

May 9, 2011, 1:38:57 PM5/9/11

to

Hello,

I'm working at current on a simple compiler for a vm of my own and try
to compile vm-code in respect to there stack state. At current the
compiler caches up to three data-stack entries in some register and
replacing stack-accesses though direct register operations for states
with less than four elements. My problem is to find an elegant way to
update the state for code which involves accesses to more then three
stack-elements which is compiled with only tos cached. At current, all
registers except tos are pushed and popped at demand but that can't be
an optimal way to go. Probably it would be a good idea to transform
the vm code in some kind of secondary IL which consumes max. 2 stack
elements but not sure.

Stephen Pelc

unread,

May 9, 2011, 1:55:11 PM5/9/11

to

On Mon, 9 May 2011 10:38:57 -0700 (PDT), Mat <dam...@web.de> wrote:

>I'm working at current on a simple compiler for a vm of my own and try
>to compile vm-code in respect to there stack state.

See Peter Knaggs thesis:
http://www.rigwit.co.uk/thesis/index.html

Also papers about the VFX code generator:
http://www.complang.tuwien.ac.at/anton/euroforth/ef98/pelc98.pdf
http://www.mpeforth.com/arena/insidevfx.pdf

Most of this is quite old now. In general, you'll find that caching
more than one register at basic block boundaries is not a good idea.

Stephen

--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Mat

unread,

May 9, 2011, 2:17:36 PM5/9/11

to

On 9 Mai, 19:55, stephen...@mpeforth.com (Stephen Pelc) wrote:

> On Mon, 9 May 2011 10:38:57 -0700 (PDT), Mat <damb...@web.de> wrote:
> >I'm working at current on a simple compiler for a vm of my own and try
> >to compile vm-code in respect to there stack state.
>
> See Peter Knaggs thesis:
> http://www.rigwit.co.uk/thesis/index.html
>
> Also papers about the VFX code generator:
> http://www.complang.tuwien.ac.at/anton/euroforth/ef98/pelc98.pdf
> http://www.mpeforth.com/arena/insidevfx.pdf
>
> Most of this is quite old now. In general, you'll find that caching
> more than one register at basic block boundaries is not a good idea.
>
> Stephen
>
> --

> Stephen Pelc, stephen...@mpeforth.com

> MicroProcessor Engineering Ltd - More Real, Less Time
> 133 Hill Lane, Southampton SO15 5AF, England
> tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
> web:http://www.mpeforth.com- free VFX Forth downloads

Thanks, the papers are very helpful.

Pablo Hugo Reda

unread,

May 9, 2011, 3:34:03 PM5/9/11

to

I'm working in a compiler too, don't work full now but I try a lot of
ideas.

My current state is keep the stack form for every word compiled,
I start with form 0, TOS in eax, esi in pointer to NOS
form 1, when TOS and NOS in registers
form 2, when TOS, NOS and next value in registers.
the compiler in stack form 0 is trivial but the code generated is the
worst.
the main idea is gain stack form with registers.

I make a static stack diagram for a word to compile and record the
stack cell activity (is used for memory access?, is only reading this
value?, etc)
then with the cell stack activity, assign a register free and change
the stack form for this word.
last compile the word from the stack form
when have a call, I not decide now how work; or change the "interface"
for use registers o generate a pre- and post- call movs

again, I not have this full implementation of this but I near finish.

the main dificult is the x86 limitations, ej, few register, idiv,
imul need eax,edx free, shr/l not constant need ecx, etc..
this add noise to implementation.

I don't remember who forth use a dual register for TOS and NOS, (eax
or ebx is TOS for example) and change this when compiling for better
use of register, nice idea.

Rod Pemberton

unread,

May 9, 2011, 8:05:38 PM5/9/11

to

"Mat" <dam...@web.de> wrote in message
news:96bd78e6-b9f6-46bb...@l18g2000yql.googlegroups.com...

I'm not entirely clear on what you're asking, if anything. You've described
your problem...

Ron Cain's SmallC uses two registers and a stack. It's structure is very
similar to Forth with a stack and TOS and NOS in registers. However, the
way the stack and/or registers is used is different from Forth. The two
registers are separate from the stack, unlike TOS and NOS which are part of
the stack.

The example(s) below shows "M=L+5;" 1) in C, 2) in 8086 assembly as compiled
by a version of Ron Cain's SmallC, 3) and as 8086 converted to the Forth
equivalent via a modified version of that version of SmallC. It's *not*
what a _person_ might code in Forth: 5 L @ + M !

SmallC uses two registers AX and BX, and a stack as it's execution model.
So, it's almost a stack only machine, like Forth. If you cache TOS and NOS
in registers, then it's almost identical. The addresses of the C variables
M or L from "M=L+5" are stored on the C stack. Therefore, their addresses
are an offset from the stack pointer: SP. SP is copied into BP due to
assembly addressing-mode restrictions of 16-bit x86.

The equivalent Forth code simulates x86 registers AX and BX by *storing*
them to Forth TOS and NOS. That requires the DROP when storing AX (into
TOS). I.e., they are not stacked, but stored. NIP and TUCK work as POP BX
(from NOS) and PUSH AX (to TOS), respectively. They too are not pushing and
popping like Forth, but storing into the stack like x86 assembly. M and L
in the Forth code are Forth VARIABLEs.

C code
---
M=L+5;

SmallC x86:
---
LEA AX, [BP+offset_of_M]
PUSH AX
LEA AX, [BP+offset_of_L]
MOV BX,AX
MOV AX,[BX]
PUSH AX
MOV AX,5
POP BX
ADD AX,BX
POP BX
MOV [BX],AX

SmallC produces truly horrid x86 code from C. But, it works. It'd be much
shorter if hand-coded. And, it's only using two registers and a stack to do
the job.

Forth equivalent emitted via modified SmallC:
---
DROP
M
TUCK
DROP
L
@
TUCK
DROP
5
NIP
OVER
+
NIP
SWAP
!

The converted Forth code is also horrid, IMO. I.e., versus: 5 L @ + M !
But, it works. Remember, the strange used of DROPs are because we are
storing, not stacking, the registers. I.e., we've got to clear out what was
there, and *not* shift the stack.

Unfortunately, some here may not recognize which line corresponds to which
line, so I'll attempt to present them side-by-side, as best they fit, with
some comments. The C code and/or comments is in C comments /* ... code ...
*/ The Forth code is in Forth comments ( ... code ... ) It's best if you
can read the C code or comments, then the 8086 assembly will be easier to
follow.

/* M=L+5; */

LEA AX, [BP+offset_of_M]
/* &M - get address of M */
( DROP M )

PUSH AX
/* save M's address onto a stack */
( TUCK )

LEA AX, [BP+offset_of_L]
/* &L - get address of L */
( DROP L )

MOV BX,AX
MOV AX,[BX]
/* *(&L) - dereference L's address to get L's value */
( @ )

PUSH AX
/* save L's value */
( TUCK )

MOV AX,5
/* get constant 5 */
( DROP 5 )

POP BX
/* retrieve M's value */
( NIP )

ADD AX,BX
/* add 5 and L's value, e.g., 1 (one) */
( OVER + )

POP BX
/* retrieve M's address */
( NIP )

MOV [BX],AX
/* save value into M: (*(&M))=6; */
( SWAP ! )

HTH,

Rod Pemberton

Colin MacIntyre

unread,

May 10, 2011, 2:09:49 AM5/10/11

to

On Tuesday, May 10, 2011 3:34:03 AM UTC+8, Pablo Hugo Reda wrote:
> I don't remember who forth use a dual register for TOS and NOS, (eax
> or ebx is TOS for example) and change this when compiling for better
> use of register, nice idea.

Haha, that's FreeForth, I was just studying ff.asm last night. But I can't quite figure out how and why he did it that way? He sets some kind of swap bit, and I just get confused about it.

Colin

Nils M Holm

unread,

May 10, 2011, 6:54:07 AM5/10/11

to

Mat <dam...@web.de> wrote:
> I'm working at current on a simple compiler for a vm of my own and try

> to compile vm-code in respect to there stack state. [...]

I have used a stack-based VM in one of my compilers, too,
and I have come to the conclusion that emulating s stack
machine on a non-stack machine is a pointless exercise.
So later versions of the compiler attempt to synthesize
reasonable register-machine code from the stack-based
intermediate form.

Given the expression

r = a + b + c

which would have an intermediate form like

LOAD a; LOAD b; ADD; LOAD c; ADD; STORE r

the first (naive) version would generate code like that
on the lefthand side (where A is the accumulator of the
real CPU and X is an auxilary register of the CPU). It
can be easily "optimized" to the code on the right by
synthesizing "mov X,A" from "push A; pop X" and
eliminating redundant push/pop sequences:

load A,a load A,a
push A push A
load A,b load A,b
push A move X,a
pop X
pop A pop A
add A,X add A,x
push A push A
load A,c load A,c
push A move X,a
pop X
pop A pop A
add A,X add A,X
push A
pop A
store r,A store r,A

At this point, caching the TOS seems tempting, but I decided
to go a different route and synthesize instructions using
the various VLSI addressing modes instead, i.e. generating

mov A,a instead of mov A,a
add A,b mov X,b
add A,X

So the final generator emitted the following code for
r = a + b + c:

load A,a
add A,b
add A,c
store r,A

I also played with stack-based a register allocator, which
spilled intermediate values to registers instead of working
with an accumulator, which would generate the following code
for

r = a - b * c

accumulator-based register-allocating
load A,a load A1,a
push A
load A,b load A2,b
mul A,c mul A2,c
pop X
swap A,X
sub A,X sub A1,A2

The entire genesis of the back end is described in detail
in my book "Lightweight Compiler Techniques", which is now
in the public domain and can be fetched as a PDF copy from
my home page (see "obsolete CS textbooks"). The book archive
also contains the complete compiler with 8086/i386/C/VM
backends, VM interpreter, VM code optimizer, etc.

Maybe it's worth a look.

--
Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org
May all beings be happy, peaceful, and free from suffering!

Alex McDonald

unread,

May 10, 2011, 6:57:09 AM5/10/11

to

On May 10, 1:05 am, "Rod Pemberton" <do_not_h...@noavailemail.cmm>
wrote:
> "Mat" <damb...@web.de> wrote in message

If you eliminate all the stack shuffling (and there are also papers
that cover it on Anton Ertl's website), then the code requires no
stack and one register that is not TOS or NOS (or, at worst a PUSH and
POP if there's no free register). For the x86 two address operand
opcodes it can be decomposed to:

5 mov reg, <immed> 5
L @ + add reg, <addr L>
M ! mov <addr M>, reg

or possibly

L @ mov reg, <addr L>
5 + add reg, <immed> 5
M ! mov <addr M>, reg

Either way, if the code generator is smart all the stack shuffling
should disappear, and it should not require changing the cached TOS
and NOS. It's my suspicion that VFX is smart enough to do this. For
one; there are probably others. My code generator isn't one of them.

Mat

unread,

May 10, 2011, 7:05:50 AM5/10/11

to

On 9 Mai, 21:34, Pablo Hugo Reda <pablor...@gmail.com> wrote:
> I'm working in a compiler too, don't work full now but I try a lot of
> ideas.

Always nice to read someone is working on new concepts. Let us shear
our experiences.

> My current state is keep the stack form for every word compiled,
> I start with form 0, TOS in eax, esi in pointer to NOS
> form 1, when TOS and NOS in registers
> form 2, when TOS, NOS and next value in registers.
> the compiler in stack form 0 is trivial but the code generated is the
> worst.
> the main idea is gain stack form with registers.

My idea was similar but instead of compiling conventional forth alike
vm instructions I had defined a lousy VLIW inspired opcode for a misc
style instruction set (4 bit, 16 instructions), combining four of them
into a slot and bundle up to 8 instructions into a word (so with a
word-size of 64 bit this means an opcode holding 16 instructions). The
interpreter fetches each opcode and execute these bundles in a
"software pipeline". Because the stack-state is updated on the fly the
compiler don't need recalculate it for compilation. It is simply some
kind of trace compiler where each trace is identical to an opcode.
Compiled code extend the instruction set. Quite simple.

For register transformation the compiler replacing instruction
combinations of each opcode with register-based machine-code though a
branch table (pattern-matching) for states < 4, e.g:

LiLi 80 2 MulMul | SwapLi 2 MulAdd | LiAdd #&B8000 Rt-

compiles to:

initial stack-state: 2
MOV RBX, 80 ; LiLi + MulMul
SHL RBX, 1
IMUL RBX
XCHG RAX, [RSP] ; SwapLi + MulAdd
SHL RAX, 1
POP RDX
ADD RAX, RDX
ADD RAX, 0xB8000 ; LiAdd
RET
resulting stack-state: 1

This code isn't tested, My compiler just generate assembly listings at
current and it can be some instructions don't work as desired because
the x86-64 architecture is new to me.

two disadvantages are:

1.) Pattern-matching is code ineffective this way (large branch-table)
2.) The compiler must insert code for stack synchronisation (states >
3)

> I make a static stack diagram for a word to compile and record the
> stack cell activity (is used for memory access?, is only reading this
> value?, etc)
> then with the cell stack activity, assign a register free and change
> the stack form for this word.
> last compile the word from the stack form
> when have a call, I not decide now how work; or change the "interface"
> for use registers o generate a pre- and post- call movs

Ok, so you interpret the stack-state at compile time. Isn't this time
consuming ?

> again, I not have this full implementation of this but I near finish.
>
> the main dificult is the x86 limitations, ej, few register, idiv,
> imul need eax,edx free, shr/l not constant need ecx, etc..
> this add noise to implementation.

There exist a solution: Transform stack-based into CPS style code with
consumes max. 2 stack elements which can be statical assigned. I think
some functional programming languages profit from this representation
and a functional forth dialect has one big advantage: Because of its
more or less side-effect free nature the compiler can paralleling(?)
the code without much effort and CPS ensuring tail-call optimisation
for free.

> I don't remember who forth use a dual register for TOS and NOS, (eax
> or ebx is TOS for example) and change this when compiling for better
> use of register, nice idea.

as otherwise take a look at freeforth.

-Mat.

Alex McDonald

unread,

May 10, 2011, 7:13:38 AM5/10/11

to

Hmm, this;

L @ 5 + M !

gets from my optimizer (with a cached TOS);

() push reg

L @ mov reg, <addr L>
5 + add reg, <immed> 5
M ! mov <addr M>, reg

() pop reg

Not too shabby. The same isn't true of 5 L @ + M ! , where I get

() push reg

5 mov reg, <immed> 5

() push reg

L @ mov reg, <addr L>

+ add reg, <sp> \ stack ref to pushed 5

M ! mov <addr M>, reg

() pop reg
() pop reg

It's too straight line in dealing with the constructs.

Mat

unread,

May 10, 2011, 7:15:26 AM5/10/11

to

On 10 Mai, 12:54, Nils M Holm <news2...@t3x.org> wrote:
> snip

> The entire genesis of the back end is described in detail
> in my book "Lightweight Compiler Techniques", which is now
> in the public domain and can be fetched as a PDF copy from
> my home page (see "obsolete CS textbooks"). The book archive
> also contains the complete compiler with 8086/i386/C/VM
> backends, VM interpreter, VM code optimizer, etc.
>
> Maybe it's worth a look.

I will take a look, thanks.

Anton Ertl

unread,

May 10, 2011, 7:23:53 AM5/10/11

to

Mat <dam...@web.de> writes:
>Hello,
>
>I'm working at current on a simple compiler for a vm of my own and try
>to compile vm-code in respect to there stack state. At current the
>compiler caches up to three data-stack entries in some register and
>replacing stack-accesses though direct register operations for states
>with less than four elements. My problem is to find an elegant way to
>update the state for code which involves accesses to more then three
>stack-elements which is compiled with only tos cached. At current, all
>registers except tos are pushed and popped at demand but that can't be
>an optimal way to go.

As long as you have enough registers, it's relatively easy: load the
stack item from memory into a register when it is first accessed, and
store it (if it has been changed) when it is last accessed. Do the
actual operations between registers.

It gets a little trickier if you do not have sufficient registers. In
that case a good solution is to free the register of a stack item
whose next access is furthest in the future. If you do not want to
perform this liveness analysis, an approximation is to free the
register of the deepest stack element, but then you have to provide
for the case where this stack element is used in the next operation.

You free a register by saving its contents to memory if it has been
changed or if its home location has been changed (after a stack
manipulation word), or by doing nothing if the memory copy is
up-to-date.

> Probably it would be a good idea to transform
>the vm code in some kind of secondary IL which consumes max. 2 stack
>elements but not sure.

Yes, if you only have three registers, that's a good idea.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2010: http://www.euroforth.org/ef10/

Pablo Hugo Reda

unread,

May 10, 2011, 10:32:56 AM5/10/11

to

>
> Ok, so you interpret the stack-state at compile time. Isn't this time
> consuming ?
>

i'm sorry for my bad english, it's not in interpret-time, it's only
for compile..see this image for this analysis

http://reda4.googlecode.com/svn/trunk/screenshot/r4ana.png

the entire proyect is open source and you can see in googlecode. If
search the code for make this, is vtstack.txt (virtual stack)
http://code.google.com/p/reda4/source/browse/trunk/r4/System/vtstack.txt

Tell us when upload you proyect,
I try to read every free forth in the web !!

Stephen Pelc

unread,

May 10, 2011, 11:10:00 AM5/10/11

to

On Tue, 10 May 2011 03:57:09 -0700 (PDT), Alex McDonald
<bl...@rivadpm.com> wrote:

>Either way, if the code generator is smart all the stack shuffling
>should disappear, and it should not require changing the cached TOS
>and NOS. It's my suspicion that VFX is smart enough to do this. For
>one; there are probably others.

You suspicion is correct:
variable l ok
variable m ok
: t1 5 l @ + m ! ; ok
: t2 l @ 5 + m ! ; ok
dis t1
T1
( 004C6E30 8B15BCE04500 ) MOV EDX, [0045E0BC]
( 004C6E36 83C205 ) ADD EDX, 05
( 004C6E39 8915C0E04500 ) MOV [0045E0C0], EDX
( 004C6E3F C3 ) NEXT,
( 16 bytes, 4 instructions )
ok
dis t2
T2
( 004C6E60 8B15BCE04500 ) MOV EDX, [0045E0BC]
( 004C6E66 83C205 ) ADD EDX, 05
( 004C6E69 8915C0E04500 ) MOV [0045E0C0], EDX
( 004C6E6F C3 ) NEXT,
( 16 bytes, 4 instructions )
ok

We have a compile-time stack model with some form of data typing to do
this. Compilation speed is not as badly affected as one migh imagine,
although it is noticably slower than unoptimised code. On a big
application we produce about 500kb/sec of binary on a reasonable box.

Stephen

--
Stephen Pelc, steph...@mpeforth.com

MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691