The background is this: on fast stack-machines, two or more registers
represent the top-of-stack operands (B5000 does so, the faster the more
[$price$ tag on CPU], minimum is two registers).
For my threaded bytecode interpreter prototype I've now developed this,
example for one of Smalltalk's binary special selectors, say #+ :
o eax has receiver, no push
o ebx has argument, no push
// make room for 3 oops: receiver, argument, method
o sub $12 -> esp
o call threadedBytecodeRoutine
..............................
o mov eax -> $12+esp // receiver
o mov ebx -> $8+esp // argument
// free for local use: $4+esp
o do some primitive; return success cc
..............................
o bcc around // fall through if failed
// recall thisContext will now become parentContext
o call polymorphicSend // fills method at $4+esp
around:
// answer is in eax
o add $12 -> esp // pops into limbo: receiver, argument, method
Thus my threaded bytecode interpreter has now the same stack
structure/context layout as Moebius :) and pairs of sub/add can be
consolidated :)
The advantage for my threaded bytecode interpreter is that dynamically
generated native code is more compact when the top-n-registers (receiver +
some arguments) are saved by the *called* routine, and *never* need to be
restored :) Of course the polymorphicSend expects everything already on
the stack, and will clobber registers.
Furthermore, a method in Moebius begins with:
o push <- parentContext
o push <- vtable
o lea <- thisContext
and cleans both slots from stack, together with any temps, upon return.
This prolog/epilog can be done by my prototype's threaded bytecode methods
as well ;)
Always passing the top-n-registers instead of pushing them before call,
has clearly advantage for code compactness, don't you think so? In a
similiar fashion, native Moebius methods can spill registers (consolidated
pairs of sub/add which manage room and integrate polymorphicSend's room).
Cheers
Klaus
I changed things a bit in last version - by moving preserved stack
pointer down the stack. This is done to be conformant with
materialized contexts and to ease accessing preserved stack pointer.
The method prologue code is following:
thisContext stackPointer push. "preserve stack pointer"
self push. "push parent context"
"set current context pointer"
thisContext contextPointer: thisContext stackPointer.
"push vtable"
thisContext stackContextVTable push.
"reserve space for temps"
thisContext stackPointer: thisContext stackPointer - (numTemps *
thisContext wordSize).
I'm not sure if it will break your code (i hope it not). In any case,
a good overview of these things required before moving forward.
> Always passing the top-n-registers instead of pushing them before call,
> has clearly advantage for code compactness, don't you think so? In a
> similiar fashion, native Moebius methods can spill registers (consolidated
> pairs of sub/add which manage room and integrate polymorphicSend's room).
>
It is too good to be true :)
In Moebius the 'callee' methods is those who is a target of a
polymothic send - you don't need a CompiledMethod for anything else.
And inlined native methods is _inlined_ and therefore optimized to use
any registers and w/o any stack frames/decorations whatsoever.
The problem with polymorphic sends, that method arguments are
evaluated long before actual send is performed (entering a compiled
method).
And between pushing args and activating a method there can be a lot of
processing, like method lookup, in between which, of course, requires
some/lots of spare registers...
Also, consider expression like:
(self foo: x baz) a: y zork
which makes the whole idea useless, look:
argReg1 <- x "cool"
call #baz
argReg1 <- self "cool"
call #foo:
push returned_value "bad"
argReg1 <- y "cool"
call #zork
argReg1 <- pop "bad"
call #a:
So, you can't avoid using stack, and this example illustrates that you
even need more instructions for methods which expecting first
n-arguments in register(s) instead of stack comparing to those which
expecting everything on stack.
Its easy to add a 'magic' accessor lambdas like:
<cpu> regA: value
<cpu> regB: value
and
#regA / #regB for accessing, and put them in use by compiler.
But then a question, who will implement a clever register spilling
mechanism. :)
Also, most of native methods (and even some ST methods, like
accessors) can be compiled without using stack frames at all - a
native code generator with lively analyzer can clearly tell whether he
needs more than available CPU registers or not to peform all required
method's operations.
The only problem with such optimizations i see is debugging :)
You asking to support debugging all & everything and it implies some
limitations to generated code.
> Cheers
> Klaus
>
> >
>
--
Best regards,
Igor Stasenko AKA sig.
>> The problem with polymorphic sends, that method arguments are
>> evaluated long before actual send is performed (entering a compiled
>> method).
>> And between pushing args and activating a method there can be a lot of
>> processing, like method lookup, in between which, of course, requires
>> some/lots of spare registers...
>
> ... whose spilleage/temps have room on the stack already ...
>
>> Also, consider expression like:
>>
>> (self foo: x baz) a: y zork
>>
>> which makes the whole idea useless, look:
>
> ... [ snipped, problem disappeared by design, yes ? ] ...
>
>> So, you can't avoid using stack, and this example illustrates that you
>> even need more instructions
>
> Have one example which uses more instructions? and, did you mean how
> many or how much?
>
I think you misunderstood example.
Consider the program flow when all arguments lying on stack:
<evaluate 1st argument (it can be result of sub-expression, which
requires calling an unknown number of methods)>
<push 1st argument>
<evaluate 2nd argument>
<push 2nd argument>
...
<do method lookup>
<call the method>
and methods which expecting firts n arguments in registers:
<evaluate 1st argument>
<push 1st argument>
<evaluate 2nd argument>
<push 2nd argument>
...
<do method lookup>
<restore first n arguments and put them in registers>
<call the method>
so, it is obvious that second approach will require more instructions. Isnt?
>> for methods which expecting first
>> n-arguments in register(s) instead of stack comparing to those which
>> expecting everything on stack.
>
> No, I have
>
> o room for method temps/register spilleage
> o variable part for receiver/argument passing
>
> and I don't mix between them. Okay?
>
>> Its easy to add a 'magic' accessor lambdas like:
>> <cpu> regA: value
>> <cpu> regB: value
>> and
>> #regA / #regB for accessing, and put them in use by compiler.
>>
>> But then a question, who will implement a clever register spilling
>> mechanism. :)
>
> Have a look at TCC's, it's not just clever but also sufficient ;)
>
> - http://bellard.org/tcc/tcc-doc.html#SEC32
>
very good design :)
Yes, i'm confused. :)
What i think i'm sure now: A polymorphic sends requires all args on
stack. So this is not changed.
Then question is what other kinds of sends you have, where you would
want some n arguments to reside in registers instead?
Ahh.. a special selectors send .. well, Moebius do not makes any
preferences to selectors , and i think PICs will obviously render such
optimization redundant.
>
> But instead (idea while discussing): we could have several
> incarnations of method lookup routine, which does the mov receiver ->
> $n+esp {; mov argument -> $m+esp}, same trick as with threaded
> bytecode example and its special selector case; so code could even be
> more compact for dynamically generated polymorphicSends as well.
>
I can imagine that method could have two entry points
point1:
" load arguments from stack"
mov reg1 <- [Sp + offset]
mov reg2 <- [Sp + offset2]
point2:
" at this entry point, arguments is in reg1 and reg2"
so compiler can choose a more preferable entry point depending on the call site.
but i think this is too early optimization :)
Also, another kind of optimization is for cases like:
obj foo.
obj bar.
here, a compiler could reuse a receiver value pushed on stack after
returning from call to #foo,
and instead of pushing receiver again it simply not cleaning the
stack, avoiding extra load receiver & push instructions.
Such code sequences can be transformed to cascades, before entering to
low level stages of compilation, so optimization can be implemented
for cascades only.
so, the point is, to optimize given portion of code, which doing a
polymorphic send:
ProtoObject >> callBind: selector numberOfArguments: argCount
"This method inlined by compiler"
<native>
| vtable fmTable method |
<variable: #vtable knownAs: #VTable>
<variable: #method knownAs: #CompiledMethod>
<variable: #fmTable knownAs: #FixedMethodTable>
self push.
selector push.
argCount push.
vtable := self vtable.
fmTable := vtable fixedMethodsTable.
method := fmTable bindMethod.
^ method enter: 3
this method is inlined at each call site and pushing 3 arguments:
receiver, selector and number of arguments before calling a #bind:
method.
Also, as we agreed before, since result of binding could be any
object, then at next step compiler should also inline a call to a
special #activate method (by using fixedMethodsTable) - which is
responsible for activating given object as method.
What i'm also thinking about, that both things should be merged into
single one , so compiler should invoke only a single method per call
site, which should do everything: lookup & bind, and then send
#activate to result. including #doesNotUnderstand etc.
Which makes me think of, how to name such method?
performSend: selector numArgs: num
and
performSendToSuper: selector numArgs: num
and with PICs it should be:
performSend: selectorIndex numArgs: num
and
performSendToSuper: selectorIndex numArgs: num
where selectorIndex is an index of PIC entry of current method. Which
can be easily retrieved by knowing the parent context and its method.
So, instead of passing selector we passing a PIC entry index and then
can extract selector and check the cached lookup value(s) before doing
a real lookup.
Btw, some OT about PICs.. I really thinking what if we can make PICs
behavior separated from methods. Yes, i'm talking about creating an
instance of PICEntry (or name it as you like) for each call site, and
simply put it in method literals.
Then we can pick a slot in fixedMethodsTable for them, and do cache
checking to be flexible and separated from method :)
As an example, if we go in that way, then we could implement this
method for Symbol as well, which will stand for an uncached version of
lookup.
Then compiler should ask a class where method going to be installed to
provide a class for call-site-info, and creating an instance of it
using special method, like:
(in compiler)
callSiteClass := sourceClass callSiteInfoClass.
callSiteLiteral := callSiteClass createCallSiteInfoFor: selector.
callSiteLiteralIndex := self putInLiterals: callSiteLiteral.
and then performing a send code will be:
push receiver
push (literals at: callSiteLiteralIndex)
push numberOfArguments
call #performSend[toSuper] "call using fixed table e.g. non-polymorphic"
now, #performSend method , before doing anything will call a
#cacheLookup: receiver to a given call-site-info object.
In this way we can have different kinds of caching:
Symbol - no caching
PICEntry - PIC
PICEntryAndGlobalCache - for doing a PIC check, and then if it fails,
check in global cache object , which held in its class var.
CacheForMultidimensionalPolySend - hello from slate :)
... etc..
The extra cost for keeping call-site-info entries as separate objects
is 2 words per entry (header) comparing to imprinting entries inside a
method. But think how much more flexibility we getting because of it.