Dan Sugalski <d...
> I've been thinking about vtable and opcode functions written in
> bytecode, and I think that we need an alternate form of sub calling.
> (And yes, this *is* everyone's chance to say "I told you so")
I don't think that calling conventions are actually a problem with
overridden vtable method or such. Just the opposite - they are fine. A
HLL compiler and imcc both know one way to spit out the code for a sub -
being it a normal one a method call or an overridden vtable or opcode
But - and that's likely the reason of your mail - it's a bit slow (not
horribly any more - but still slow).
So let's investigate the individual steps of an overridden vtable
method call, i.e. the delegate code:
1) Register preserving
We can't do much about that - except we switch to a scheme, where such
method functions have to preserve their registers - but that violates
symmetry and is a penalty if such a function is called directly.
Register preserving is optimized - it reuses allocated register frame
memory with a free list and doesn't take much time.
2) Method lookup
That's currently two hash look ups: one for the namespace one for the
method. I've speeded up that by using hash functions instead of
PerlHash interface. Using a method cache (or getting the namespace PMC
out of loop) reduces that to one hash look up.
3) Setting up registers according to PCC
This boils down to nothing with the JIT core (~6 machine instructions)
The fib benchmark shows that nicely.
4) Setting up method arguments
That's currently using the signature string. It loops over the
signature and gets va_list type arguments passed in into registers.
Shouldn't take much time - we typically have 3 arguments only
Could be hard coded again like in your first version. OTOH this is
5) Creating a return continuation. Could be optimized away, *if* we know
that's always a method sub and is run in its own interpreter loop.
An <end> opcode would do it. OTOH we might need it to restore some
context items. We could keep some return continuations around (in a
free_list) and only update their context: s. the C<updatecc> opcode.
6) Reentering the run loop
These needs currently 5 function calls:
- runops pushes a new Parrot_exception
- runops_ex is a currently needed ugly hack to allow intersegment
branches (i.e. evaled code has a "goto main" inside)
- runops_int handles resumable opcodes like C<trace>
- runops_xxx does run loop specific setup, like JITting the code
if it isn't yet JITted.
- the runloop itself finally
We can call to some inner runops, if a method call doesn't need all
this setup. We can also call a specialized runops-wrapper that
shortcuts this setup. Doesn't achieve much though s. below.
7) Leaving all these run loops
8) return value handling, if any
9) register frame restore
So when above sequence is run for a new object, we additionally have
It was already discussed how to speed that up with a different object
layout and not using any aggregate PMC containers.
Finally some current timing results (parrot -O3, Athlon 800, JIT core)
Create 100.000 new PerlInts + 100.000 invokecc __init 0.24 s
Create 100.000 new delegate PMCs and call __init 0.60 s
same, call runops_int directly 0.57 s
Create 100.000 new objects and call __init 1.00 s
Object instantiation is 40 % of the whole used time. Let's start to
optimize object layout first.