Spilling registers, when and why [http://code.google.com/p/moebius-st/wiki/Contexts]

Klaus D. Witzel

unread,

Aug 30, 2008, 8:01:12 PM8/30/08

to Moebius project discussion

When rereading my threaded bytecode interpreter prototype, the demand for
keeping things in registers suddenly got in harmony with Moebius stack
structure/context design :)

The background is this: on fast stack-machines, two or more registers
represent the top-of-stack operands (B5000 does so, the faster the more
[$price$ tag on CPU], minimum is two registers).

For my threaded bytecode interpreter prototype I've now developed this,
example for one of Smalltalk's binary special selectors, say #+ :

o eax has receiver, no push
o ebx has argument, no push
// make room for 3 oops: receiver, argument, method
o sub $12 -> esp
o call threadedBytecodeRoutine
..............................
o mov eax -> $12+esp // receiver
o mov ebx -> $8+esp // argument
// free for local use: $4+esp
o do some primitive; return success cc
..............................
o bcc around // fall through if failed
// recall thisContext will now become parentContext
o call polymorphicSend // fills method at $4+esp
around:
// answer is in eax
o add $12 -> esp // pops into limbo: receiver, argument, method

Thus my threaded bytecode interpreter has now the same stack
structure/context layout as Moebius :) and pairs of sub/add can be
consolidated :)

The advantage for my threaded bytecode interpreter is that dynamically
generated native code is more compact when the top-n-registers (receiver +
some arguments) are saved by the *called* routine, and *never* need to be
restored :) Of course the polymorphicSend expects everything already on
the stack, and will clobber registers.

Furthermore, a method in Moebius begins with:

o push <- parentContext
o push <- vtable
o lea <- thisContext

and cleans both slots from stack, together with any temps, upon return.
This prolog/epilog can be done by my prototype's threaded bytecode methods
as well ;)

Always passing the top-n-registers instead of pushing them before call,
has clearly advantage for code compactness, don't you think so? In a
similiar fashion, native Moebius methods can spill registers (consolidated
pairs of sub/add which manage room and integrate polymorphicSend's room).

Cheers
Klaus

Igor Stasenko

unread,

Aug 30, 2008, 9:21:10 PM8/30/08

to moebius-proje...@googlegroups.com

2008/8/31 Klaus D. Witzel <klaus....@cobss.com>:

I changed things a bit in last version - by moving preserved stack
pointer down the stack. This is done to be conformant with
materialized contexts and to ease accessing preserved stack pointer.
The method prologue code is following:

thisContext stackPointer push. "preserve stack pointer"
self push. "push parent context"

"set current context pointer"
thisContext contextPointer: thisContext stackPointer.
"push vtable"
thisContext stackContextVTable push.

"reserve space for temps"
thisContext stackPointer: thisContext stackPointer - (numTemps *
thisContext wordSize).

I'm not sure if it will break your code (i hope it not). In any case,
a good overview of these things required before moving forward.

> Always passing the top-n-registers instead of pushing them before call,
> has clearly advantage for code compactness, don't you think so? In a
> similiar fashion, native Moebius methods can spill registers (consolidated
> pairs of sub/add which manage room and integrate polymorphicSend's room).
>

It is too good to be true :)
In Moebius the 'callee' methods is those who is a target of a
polymothic send - you don't need a CompiledMethod for anything else.
And inlined native methods is _inlined_ and therefore optimized to use
any registers and w/o any stack frames/decorations whatsoever.

The problem with polymorphic sends, that method arguments are
evaluated long before actual send is performed (entering a compiled
method).
And between pushing args and activating a method there can be a lot of
processing, like method lookup, in between which, of course, requires
some/lots of spare registers...

Also, consider expression like:

(self foo: x baz) a: y zork

which makes the whole idea useless, look:

argReg1 <- x "cool"
call #baz
argReg1 <- self "cool"
call #foo:
push returned_value "bad"
argReg1 <- y "cool"
call #zork
argReg1 <- pop "bad"
call #a:

So, you can't avoid using stack, and this example illustrates that you
even need more instructions for methods which expecting first
n-arguments in register(s) instead of stack comparing to those which
expecting everything on stack.

Its easy to add a 'magic' accessor lambdas like:
<cpu> regA: value
<cpu> regB: value
and
#regA / #regB for accessing, and put them in use by compiler.

But then a question, who will implement a clever register spilling
mechanism. :)

Also, most of native methods (and even some ST methods, like
accessors) can be compiled without using stack frames at all - a
native code generator with lively analyzer can clearly tell whether he
needs more than available CPU registers or not to peform all required
method's operations.

The only problem with such optimizations i see is debugging :)
You asking to support debugging all & everything and it implies some
limitations to generated code.

> Cheers
> Klaus
>
> >
>

--
Best regards,
Igor Stasenko AKA sig.

Klaus D. Witzel

unread,

Aug 31, 2008, 5:32:06 AM8/31/08

to Moebius project discussion

On Aug 31, 3:21 am, Igor Stasenko wrote:
> 2008/8/31 Klaus D. Witzel :

>
> > When rereading my threaded bytecode interpreter prototype, the demand for
> > keeping things in registers suddenly got in harmony with Moebius stack
> > structure/context design :)

...

> I changed things a bit in last version - by moving preserved stack
> pointer down the stack. This is done to be conformant with
> materialized contexts and to ease accessing preserved stack pointer.
> The method prologue code is following:
>
> thisContext stackPointer push. "preserve stack pointer"
> self push. "push parent context"

... these two same/layout as before ...

> "set current context pointer"
> thisContext contextPointer: thisContext stackPointer.
> "push vtable"
> thisContext stackContextVTable push.

... these two same/layout as before ...

> "reserve space for temps"
> thisContext stackPointer: thisContext stackPointer - (numTemps *
> thisContext wordSize).

... again, no change ...

> I'm not sure if it will break your code (i hope it not).

No, doesn't break :)

> In any case,
> a good overview of these things required before moving forward.

Sure.

> > Always passing the top-n-registers instead of pushing them before call,
> > has clearly advantage for code compactness, don't you think so? In a
> > similiar fashion, native Moebius methods can spill registers (consolidated
> > pairs of sub/add which manage room and integrate polymorphicSend's room).
>
> It is too good to be true :)
> In Moebius the 'callee' methods is those who is a target of a
> polymothic send - you don't need a CompiledMethod for anything else.
> And inlined native methods is _inlined_ and therefore optimized to use
> any registers and w/o any stack frames/decorations whatsoever.

No. Using a computer means that you are in control control. In this
particular case you make room for spilling as if the slots needed for
spilleage are temps, which indeed they are:

*one* simple sub $n -> esp // please correct me ;)

> The problem with polymorphic sends, that method arguments are
> evaluated long before actual send is performed (entering a compiled
> method).
> And between pushing args and activating a method there can be a lot of
> processing, like method lookup, in between which, of course, requires
> some/lots of spare registers...

... whose spilleage/temps have room on the stack already ...

> Also, consider expression like:
>
> (self foo: x baz) a: y zork
>
> which makes the whole idea useless, look:

... [ snipped, problem disappeared by design, yes ? ] ...

> So, you can't avoid using stack, and this example illustrates that you
> even need more instructions

Have one example which uses more instructions? and, did you mean how
many or how much?

> for methods which expecting first
> n-arguments in register(s) instead of stack comparing to those which
> expecting everything on stack.

No, I have

o room for method temps/register spilleage
o variable part for receiver/argument passing

and I don't mix between them. Okay?

> Its easy to add a 'magic' accessor lambdas like:
> <cpu> regA: value
> <cpu> regB: value
> and
> #regA / #regB for accessing, and put them in use by compiler.
>
> But then a question, who will implement a clever register spilling
> mechanism. :)

Have a look at TCC's, it's not just clever but also sufficient ;)

- http://bellard.org/tcc/tcc-doc.html#SEC32

> Also, most of native methods (and even some ST methods, like
> accessors) can be compiled without using stack frames at all - a
> native code generator with lively analyzer can clearly tell whether he
> needs more than available CPU registers or not to peform all required
> method's operations.

Sure.

> The only problem with such optimizations i see is debugging :)

Not really, when you have temps/spilleage separated from variable call/
send, then it looks like today's MethodContext/conceptually :)

It's only so that the spilleage is ultimatively made out of garbage,
especially immediately after sub $n -> esp. But that's a job for the
decompiler anyways.

> You asking to support debugging all & everything and it implies some
> limitations to generated code.

Sure, that is :)

Igor Stasenko

unread,

Aug 31, 2008, 7:46:35 AM8/31/08

to moebius-proje...@googlegroups.com

2008/8/31 Klaus D. Witzel <klaus....@cobss.com>:
>

Yes, but..

>> The problem with polymorphic sends, that method arguments are
>> evaluated long before actual send is performed (entering a compiled
>> method).
>> And between pushing args and activating a method there can be a lot of
>> processing, like method lookup, in between which, of course, requires
>> some/lots of spare registers...
>
> ... whose spilleage/temps have room on the stack already ...
>
>> Also, consider expression like:
>>
>> (self foo: x baz) a: y zork
>>
>> which makes the whole idea useless, look:
>
> ... [ snipped, problem disappeared by design, yes ? ] ...
>
>> So, you can't avoid using stack, and this example illustrates that you
>> even need more instructions
>
> Have one example which uses more instructions? and, did you mean how
> many or how much?
>

I think you misunderstood example.
Consider the program flow when all arguments lying on stack:
<evaluate 1st argument (it can be result of sub-expression, which
requires calling an unknown number of methods)>
<push 1st argument>
<evaluate 2nd argument>
<push 2nd argument>
...
<do method lookup>
<call the method>

and methods which expecting firts n arguments in registers:

so, it is obvious that second approach will require more instructions. Isnt?

>> for methods which expecting first
>> n-arguments in register(s) instead of stack comparing to those which
>> expecting everything on stack.
>
> No, I have
>
> o room for method temps/register spilleage
> o variable part for receiver/argument passing
>
> and I don't mix between them. Okay?
>
>> Its easy to add a 'magic' accessor lambdas like:
>> <cpu> regA: value
>> <cpu> regB: value
>> and
>> #regA / #regB for accessing, and put them in use by compiler.
>>
>> But then a question, who will implement a clever register spilling
>> mechanism. :)
>
> Have a look at TCC's, it's not just clever but also sufficient ;)
>
> - http://bellard.org/tcc/tcc-doc.html#SEC32
>

very good design :)

Klaus D. Witzel

unread,

Aug 31, 2008, 10:32:32 AM8/31/08

to Moebius project discussion

On Aug 31, 1:46 pm, Igor Stasenko wrote:
> 2008/8/31 Klaus D. Witzel :
>

...

> >> > Always passing the top-n-registers instead of pushing them before call,
> >> > has clearly advantage for code compactness, don't you think so? In a
> >> > similiar fashion, native Moebius methods can spill registers (consolidated
> >> > pairs of sub/add which manage room and integrate polymorphicSend's room).
>
> >> It is too good to be true :)
> >> In Moebius the 'callee' methods is those who is a target of a
> >> polymothic send - you don't need a CompiledMethod for anything else.
> >> And inlined native methods is _inlined_ and therefore optimized to use
> >> any registers and w/o any stack frames/decorations whatsoever.
>
> > No. Using a computer means that you are in control control. In this
> > particular case you make room for spilling as if the slots needed for
> > spilleage are temps, which indeed they are:
>
> > *one* simple sub $n -> esp // please correct me ;)
>
> Yes, but..
>

...
> >> So, you can't avoid using stack, and this example illustrates that you
> >> even need more instructions
>
> > Have one example which uses more instructions? and, did you mean how
> > many or how much?
>
> I think you misunderstood example.

No ;) your example is the general case, my example differs between
special selector "send" and polymorphic send. My example subsumes your
polymorphic send ;)

> Consider the program flow when all arguments lying on stack:
> <evaluate 1st argument (it can be result of sub-expression, which
> requires calling an unknown number of methods)>
> <push 1st argument>
> <evaluate 2nd argument>
> <push 2nd argument>
> ...
> <do method lookup>
> <call the method>
>
> and methods which expecting firts n arguments in registers:
>
> <evaluate 1st argument>
> <push 1st argument>
> <evaluate 2nd argument>
> <push 2nd argument>
> ...
> <do method lookup>
> <restore first n arguments and put them in registers>

Nah, don't restore: polymorphic sends require everything on the stack
(mentioned earlier) and are expected to clobber any register
(mentioned earlier). You perhaps confused that with secial selector
"send" example.

But instead (idea while discussing): we could have several
incarnations of method lookup routine, which does the mov receiver ->
$n+esp {; mov argument -> $m+esp}, same trick as with threaded
bytecode example and its special selector case; so code could even be
more compact for dynamically generated polymorphicSends as well.

> <call the method>
>
> so, it is obvious that second approach will require more instructions. Isnt?

N/A, I would say. If you continue your example with another special
selector case (roughly equiv. Moebius inlining case) then it will show
that things are being picked up after return from polymorphicSend,
either from spilleage slots or from elsewhere. So, still not more
instructions.
...

> >> But then a question, who will implement a clever register spilling
> >> mechanism. :)
>
> > Have a look at TCC's, it's not just clever but also sufficient ;)
>

> > -http://bellard.org/tcc/tcc-doc.html#SEC32

Igor Stasenko

unread,

Aug 31, 2008, 11:04:38 AM8/31/08

to moebius-proje...@googlegroups.com

2008/8/31 Klaus D. Witzel <klaus....@cobss.com>:
>

Yes, i'm confused. :)
What i think i'm sure now: A polymorphic sends requires all args on
stack. So this is not changed.

Then question is what other kinds of sends you have, where you would
want some n arguments to reside in registers instead?
Ahh.. a special selectors send .. well, Moebius do not makes any
preferences to selectors , and i think PICs will obviously render such
optimization redundant.

>
> But instead (idea while discussing): we could have several
> incarnations of method lookup routine, which does the mov receiver ->
> $n+esp {; mov argument -> $m+esp}, same trick as with threaded
> bytecode example and its special selector case; so code could even be
> more compact for dynamically generated polymorphicSends as well.
>

I can imagine that method could have two entry points

point1:
" load arguments from stack"
mov reg1 <- [Sp + offset]
mov reg2 <- [Sp + offset2]
point2:
" at this entry point, arguments is in reg1 and reg2"

so compiler can choose a more preferable entry point depending on the call site.

but i think this is too early optimization :)

Also, another kind of optimization is for cases like:
obj foo.
obj bar.

here, a compiler could reuse a receiver value pushed on stack after
returning from call to #foo,
and instead of pushing receiver again it simply not cleaning the
stack, avoiding extra load receiver & push instructions.
Such code sequences can be transformed to cascades, before entering to
low level stages of compilation, so optimization can be implemented
for cascades only.

Klaus D. Witzel

unread,

Aug 31, 2008, 12:59:23 PM8/31/08

to Moebius project discussion

On Aug 31, 5:04 pm, Igor Stasenko wrote:
> 2008/8/31 Klaus D. Witzel :
...

> >> <evaluate 1st argument>
> >> <push 1st argument>
> >> <evaluate 2nd argument>
> >> <push 2nd argument>
> >> ...
> >> <do method lookup>
> >> <restore first n arguments and put them in registers>
>
> > Nah, don't restore: polymorphic sends require everything on the stack
> > (mentioned earlier) and are expected to clobber any register
> > (mentioned earlier). You perhaps confused that with secial selector
> > "send" example.
>
> Yes, i'm confused. :)

... and, perhaps, because you are looking to the rear of the rocket,
instead through it's front vacuum-molecule-bombardement-shield ;)

> What i think i'm sure now: A polymorphic sends requires all args on
> stack. So this is not changed.

Right.

> Then question is what other kinds of sends you have, where you would
> want some n arguments to reside in registers instead?

Well, as written eralier, it is #methodLookup, one of the most called
(because not inlined) routine, ever. When your code reaches this point
in time+space, iyou have hands full of things in registers (otherwise
Moebius native code would not have been optimized to use registers at
all).

So the idea is, exactly as in the parallel special selectors example,
that

o caller of #methodLookup makes room
o #methodLookup incarnation can mov (not push) receiver+arguments
o => more compact code in native methods

> Ahh.. a special selectors send ..

Yes, they made me think ;)

> well, Moebius do not makes any
> preferences to selectors , and i think PICs will obviously render such
> optimization redundant.

No, a special selector send is equiv. to Moebius #methodLookup
call ... same point in time+space, similiar reason (caller cannot do
"it" alone).

> > But instead (idea while discussing): we could have several
> > incarnations of method lookup routine, which does the mov receiver ->
> > $n+esp {; mov argument -> $m+esp}, same trick as with threaded
> > bytecode example and its special selector case; so code could even be
> > more compact for dynamically generated polymorphicSends as well.
>
> I can imagine that method could have two entry points
>
> point1:
> " load arguments from stack"

Yes. Always. Unconditionally, for each and every method followed by
polymorphicSend's #enter. But *loading* args is *not* the point here.

> mov reg1 <- [Sp + offset]
> mov reg2 <- [Sp + offset2]
> point2:
> " at this entry point, arguments is in reg1 and reg2"

Because #methodLookup, which runs before #enter, does *not* clobber
every register it can? No, no. wrong example ;)

> so compiler can choose a more preferable entry point depending on the call site.

... on the call site of #methodLookup: yes, sure.

> but i think this is too early optimization :)

No, it can (but must not) be added only because stack structure/layout
currently is of benefit, more by accident than by intention, for
#methodLookup's ability to help with code compactness of its callers'
sites :)

> Also, another kind of optimization is for cases like:
> obj foo.
> obj bar.
>
> here, a compiler could reuse a receiver value pushed on stack after
> returning from call to #foo,
> and instead of pushing receiver again it simply not cleaning the
> stack, avoiding extra load receiver & push instructions.
> Such code sequences can be transformed to cascades, before entering to
> low level stages of compilation, so optimization can be implemented
> for cascades only.

Sure.
...

Igor Stasenko

unread,

Aug 31, 2008, 1:44:16 PM8/31/08

to moebius-proje...@googlegroups.com

2008/8/31 Klaus D. Witzel <klaus....@cobss.com>:
>

[snip]

so, the point is, to optimize given portion of code, which doing a
polymorphic send:

ProtoObject >> callBind: selector numberOfArguments: argCount
"This method inlined by compiler"
<native>
| vtable fmTable method |

<variable: #vtable knownAs: #VTable>
<variable: #method knownAs: #CompiledMethod>
<variable: #fmTable knownAs: #FixedMethodTable>

self push.
selector push.
argCount push.
vtable := self vtable.
fmTable := vtable fixedMethodsTable.
method := fmTable bindMethod.
^ method enter: 3

this method is inlined at each call site and pushing 3 arguments:
receiver, selector and number of arguments before calling a #bind:
method.
Also, as we agreed before, since result of binding could be any
object, then at next step compiler should also inline a call to a
special #activate method (by using fixedMethodsTable) - which is
responsible for activating given object as method.

What i'm also thinking about, that both things should be merged into
single one , so compiler should invoke only a single method per call
site, which should do everything: lookup & bind, and then send
#activate to result. including #doesNotUnderstand etc.

Which makes me think of, how to name such method?

performSend: selector numArgs: num
and
performSendToSuper: selector numArgs: num

and with PICs it should be:

performSend: selectorIndex numArgs: num
and
performSendToSuper: selectorIndex numArgs: num

where selectorIndex is an index of PIC entry of current method. Which
can be easily retrieved by knowing the parent context and its method.
So, instead of passing selector we passing a PIC entry index and then
can extract selector and check the cached lookup value(s) before doing
a real lookup.

Btw, some OT about PICs.. I really thinking what if we can make PICs
behavior separated from methods. Yes, i'm talking about creating an
instance of PICEntry (or name it as you like) for each call site, and
simply put it in method literals.

Then we can pick a slot in fixedMethodsTable for them, and do cache
checking to be flexible and separated from method :)

As an example, if we go in that way, then we could implement this
method for Symbol as well, which will stand for an uncached version of
lookup.
Then compiler should ask a class where method going to be installed to
provide a class for call-site-info, and creating an instance of it
using special method, like:

(in compiler)
callSiteClass := sourceClass callSiteInfoClass.
callSiteLiteral := callSiteClass createCallSiteInfoFor: selector.
callSiteLiteralIndex := self putInLiterals: callSiteLiteral.

and then performing a send code will be:

push receiver
push (literals at: callSiteLiteralIndex)
push numberOfArguments
call #performSend[toSuper] "call using fixed table e.g. non-polymorphic"

now, #performSend method , before doing anything will call a
#cacheLookup: receiver to a given call-site-info object.
In this way we can have different kinds of caching:
Symbol - no caching
PICEntry - PIC
PICEntryAndGlobalCache - for doing a PIC check, and then if it fails,
check in global cache object , which held in its class var.
CacheForMultidimensionalPolySend - hello from slate :)
... etc..

The extra cost for keeping call-site-info entries as separate objects
is 2 words per entry (header) comparing to imprinting entries inside a
method. But think how much more flexibility we getting because of it.

Klaus D. Witzel

unread,

Aug 31, 2008, 4:00:57 PM8/31/08

to Moebius project discussion

On Aug 31, 7:44 pm, Igor Stasenko wrote:
> 2008/8/31 Klaus D. Witzel :
>

#bind: selectorArg methodThenActivate: numArgs

> where selectorIndex is an index of PIC entry of current method. Which
> can be easily retrieved by knowing the parent context and its method.
> So, instead of passing selector we passing a PIC entry index and then
> can extract selector and check the cached lookup value(s) before doing
> a real lookup.

I have #sendSansPIC: and #sendThroughPIC: and always one argument
(selector or picIndex).

> Btw, some OT about PICs.. I really thinking what if we can make PICs
> behavior separated from methods. Yes, i'm talking about creating an
> instance of PICEntry (or name it as you like) for each call site, and
> simply put it in method literals.

I once thought about that, too. Motivation: consolidation can only be
done if there is something to consolidate: so, scan the PICS of, say a
class' methods and consolidate them when appropriate, so eventually
many of them would go away dynamically after a while. Creating things
dynamically is learned at school but, not the opposite ;)

And when recompiling a method its call sites get get fresh PICS (or
can even adopt its old ones).

Yes, for this all you need class PICEntry ;) but let's call them
SynapticPIC ;)

:) and hello from me, Slate :)

> ... etc..
>
> The extra cost for keeping call-site-info entries as separate objects
> is 2 words per entry (header) comparing to imprinting entries inside a
> method. But think how much more flexibility we getting because of it.

:) have still a free slot in compactClassesArray? </grin>

Reply all

Reply to author

Forward