The current calling conventions are optimized for the case where the
caller knows that that a call is being made. Works well for that, and
I'm relatively happy with how things are functioning. (Nothing's
perfect, alas) The current convention falls down badly when more
restricted or top-level calls--that is, calls that have a very
restricted calling convention or look like there is no parent
bytecode,. Both vtable and opcode functions fall into this category.
So. Two options. We can have horribly slow calls into bytecode that
implement vtable and opcode functions. Or we can have alternative
calling conventions for 'special' functions. (No, "alter the normal
calling conventions" isn't an option :)
Personally I'm up for alternative calling conventions in this case.
These functions *are* special, and treating them as such isn't
unwarranted. There's stuff that can't happen, like having
continuations escape, and returns exit a runloop rather than just
transferring bytecode control. There's no ambiguity or runtime
setting of inbound parameters either--parameters are fixed and known
at compiletime. As such, I'm thinking that it's fine to strip things
down a lot.
With that in mind, I'm up for some discussion of how to make these
look, and how the subs should behave WRT calling in and returning
things. So... go for it, and lets get this out of the way so we can
implement it for 0.1.1 and speed up objects just the tiniest bit. :)
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
I don't think that calling conventions are actually a problem with
overridden vtable method or such. Just the opposite - they are fine. A
HLL compiler and imcc both know one way to spit out the code for a sub -
being it a normal one a method call or an overridden vtable or opcode
But - and that's likely the reason of your mail - it's a bit slow (not
horribly any more - but still slow).
So let's investigate the individual steps of an overridden vtable
method call, i.e. the delegate code:
1) Register preserving
We can't do much about that - except we switch to a scheme, where such
method functions have to preserve their registers - but that violates
symmetry and is a penalty if such a function is called directly.
Register preserving is optimized - it reuses allocated register frame
memory with a free list and doesn't take much time.
2) Method lookup
That's currently two hash look ups: one for the namespace one for the
method. I've speeded up that by using hash functions instead of
PerlHash interface. Using a method cache (or getting the namespace PMC
out of loop) reduces that to one hash look up.
3) Setting up registers according to PCC
This boils down to nothing with the JIT core (~6 machine instructions)
The fib benchmark shows that nicely.
4) Setting up method arguments
That's currently using the signature string. It loops over the
signature and gets va_list type arguments passed in into registers.
Shouldn't take much time - we typically have 3 arguments only
Could be hard coded again like in your first version. OTOH this is
5) Creating a return continuation. Could be optimized away, *if* we know
that's always a method sub and is run in its own interpreter loop.
An <end> opcode would do it. OTOH we might need it to restore some
context items. We could keep some return continuations around (in a
free_list) and only update their context: s. the C<updatecc> opcode.
6) Reentering the run loop
These needs currently 5 function calls:
- runops pushes a new Parrot_exception
- runops_ex is a currently needed ugly hack to allow intersegment
branches (i.e. evaled code has a "goto main" inside)
- runops_int handles resumable opcodes like C<trace>
- runops_xxx does run loop specific setup, like JITting the code
if it isn't yet JITted.
- the runloop itself finally
We can call to some inner runops, if a method call doesn't need all
this setup. We can also call a specialized runops-wrapper that
shortcuts this setup. Doesn't achieve much though s. below.
7) Leaving all these run loops
8) return value handling, if any
9) register frame restore
So when above sequence is run for a new object, we additionally have
It was already discussed how to speed that up with a different object
layout and not using any aggregate PMC containers.
Finally some current timing results (parrot -O3, Athlon 800, JIT core)
Create 100.000 new PerlInts + 100.000 invokecc __init 0.24 s
Create 100.000 new delegate PMCs and call __init 0.60 s
same, call runops_int directly 0.57 s
Create 100.000 new objects and call __init 1.00 s
Object instantiation is 40 % of the whole used time. Let's start to
optimize object layout first.
> Object instantiation is 40 % of the whole used time. Let's start to
> optimize object layout first.
I think there's definitely the potential for a big speed-up there.
For instance, simply replacing the Array that used to store the
class, classname and attributes gives me a speed-up of about 15%
on the object benchmarks; getting rid of the indirection entirely
should be a much bigger win.