the whole and everything

Leopold Toetsch

unread,

Jul 19, 2004, 3:40:00 PM7/19/04

to Perl 6 Internals, Dan Sugalski

... I hope is (inline) attached below
leo

FRAMES.TXT

Dan Sugalski

unread,

Jul 19, 2004, 4:43:23 PM7/19/04

to Leopold Toetsch, Perl 6 Internals

At 9:40 PM +0200 7/19/04, Leopold Toetsch wrote:
>Below [1] is a small test program, which basically shows the speed of
>calling generators aka coroutines. But *what* I want to discuss isn't
>restricted to calling coroutines. Its the same (more or less) with
>calling any subroutine-like thingy, being it a method, an overriden
>operator, or an internal method, like array.sort(), if the "optmizer"
>doesn't know, what kind of method gets called, because e.g. the method
>was passed in as function argument.

Leo, we've talked about this before. The sensible and straightforward
thing to do in a case like this is to tag in the sub pmc which
register frames are used by the sub. When a sub is invoked via C
code, those frames and nothing else are saved. It takes 8 bits and
some logical tests which can cut down on the allocating and copying
by only noting which frames are actually touched.

If we have the *calling* sub tag the dirty frames we can cut down on
it even more by only saving the frames that are touched and need
saving. We could even save the explicit save/restore this way if we
wanted to, if we allowed the call functions to update the
continuations we pass in, leaving it up to the calling mechanism to
determine what needs saving and what doesn't.

I'm not sure what you're thinking of with coroutines. Co-routines
which yield out data (rather than returning and resetting) need to
exit via a different mechanism (like, say, the *yield* op) which
saves off the relevant current state in the current sub object.

All sub objects need to keep part of the interpreter structure cached
in them. The context structure almost requires duplication, as do the
opcode-related pointers and info (so each sub can have its own handle
on what opcodes are in force) and some of the other things in there.
(Which ought to be in the context structure anyway, honestly) Subs
and the first invocation of a coroutine should have defaults in
there, while the current data should be copied out into the backing
store when a coroutine yields so it can be put right back again as it
was left when the coroutine exited. The big difference between subs
and coroutine PMCs is that subs only need one copy, while coroutines
need two (A current and a default, with the default restored on
normal exit) and that's arguably only needed in certain circumstances.
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Leopold Toetsch

unread,

Jul 20, 2004, 4:35:35 AM7/20/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:

> Leo, we've talked about this before. The sensible and straightforward
> thing to do in a case like this is to tag in the sub pmc which
> register frames are used by the sub.

And what, if the sub calls another sub?

The current pushtopp() is already an illegal optimization. We are not
preserving P0, P1, P2 (for methods). So to preserve these, they are
copied up to the upper frame half, where they occupy valuable register
numbers, that else could get allocated. The same is true for all
function arguments.

That's all hidden by the PIR code. Look at the generated PASM code.

$ cat o.pir
.sub foo method,prototyped
.param pmc a
.param pmc b
.param pmc c
bar()
... # reuse a,b,c

$ parrot -o- o.pir
foo:
set P21, P5
set P20, P6
set P19, P7
set P18, P2
set P16, P1
...

5 registers are already used. P0 isn't preserved. That's it.

You want to have improvment on the register allocator. ABove is one
reason, why it gets into spilling easily.

This is P registers only. We have I0..I4. Should be preserved. S0 for
methods. We don't do it currently. A illegal speed hack. And as said
(you are always snipping import things ;), whenever a subroutine or
method will use some native S or N registers, we end up saving 640
bytes. In *each* function call. And restore 640 bytes on each return.

My proposal did show a way, how to copy ~640 bytes *once* per subroutine
creation. You didn't even comment that.

> I'm not sure what you're thinking of with coroutines.

I'm thinking of using 2 continuations for jumping back and forth.

leo

Piers Cawley

unread,

Jul 20, 2004, 9:47:13 AM7/20/04

to l...@toetsch.at, Dan Sugalski, perl6-i...@perl.org

Leopold Toetsch <l...@toetsch.at> writes:

> Dan Sugalski <d...@sidhe.org> wrote:
>
>> Leo, we've talked about this before. The sensible and straightforward
>> thing to do in a case like this is to tag in the sub pmc which
>> register frames are used by the sub.
>
> And what, if the sub calls another sub?

Then the call to the inner sub saves the registers that are used by the
inner sub.

Dan Sugalski

unread,

Jul 20, 2004, 9:55:53 AM7/20/04

to l...@toetsch.at, perl6-i...@perl.org

At 10:35 AM +0200 7/20/04, Leopold Toetsch wrote:
>Dan Sugalski <d...@sidhe.org> wrote:
>
>> Leo, we've talked about this before. The sensible and straightforward
>> thing to do in a case like this is to tag in the sub pmc which
>> register frames are used by the sub.
>
>And what, if the sub calls another sub?

That's straightforward, and I didn't flesh this out properly.

For this to work, you need two things:

1) A per-sub mask of register frames it uses
2) A currently dirty frame marker

When a sub is called (however it's called) we:

a) Do a logical and of the sub's used mask and the dirty mask
b) Save those frames
c) Set the mask to a logical or of the used and dirty mask

And yes, this will, with sufficient call depth, result in an
all-bits-set dirty mask, which is also why we allow bytecode to
*unset* bits in the dirty frame marker, but only if those bits are
set in the sub's mask of frames it uses.

We probably need to move the sub-being-called PMC into a separate
area and out of the P registers, the same way as we need to move the
return continuation out.

>This is P registers only. We have I0..I4. Should be preserved. S0 for
>methods. We don't do it currently.

Since we're *caller* save, the only time those registers should be
saved is if the *caller* cares about them, or the *caller* can't be
sure. The only time the caller can't be sure is when we're calling
into a vtable method. (If the caller fails to save registers on a
regular sub or method call, that's their own problem) Those are
relatively rare, and as such I'm not that inclined to optimize for
them at the expense of mainline code.

Having said that, I think the scheme outlined above handles the
issues of vtable calling and lets us toss the manual pushtop/savetop
around sub and method calls if we choose to do that.

>A illegal speed hack. And as said
>(you are always snipping import things ;),

No, I snip when you head way off into the west. You tend to get too
tied up in the details of the current implementation. I don't much
give a damn about the implementation (other than at a meta-level--the
code should be documented, commented, follow the coding conventions
(which, yes, I'm sometimes guilty of not) and understandable) since
it's often not relevant other than to point out places where the
architecture has issues.

> whenever a subroutine or
>method will use some native S or N registers, we end up saving 640
>bytes. In *each* function call. And restore 640 bytes on each return.
>
>My proposal did show a way, how to copy ~640 bytes *once* per subroutine
>creation. You didn't even comment that.

Well, it seemed obviously wrong, and inefficient in most cases, so I
didn't. If you don't like the scheme I outlined above I can go into
more detail.

> > I'm not sure what you're thinking of with coroutines.
>
>I'm thinking of using 2 continuations for jumping back and forth.

That can't work quite right as it stands, though I just realized what
we need to do to make it work. Requires we have a return continuation
slot in the interpreter structure separate from the P registers.
(Which we've sorta planned on anyway) Otherwise exception handlers
won't nest right.

Leopold Toetsch

unread,

Jul 20, 2004, 10:59:11 AM7/20/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:
> At 10:35 AM +0200 7/20/04, Leopold Toetsch wrote:

> And yes, this will, with sufficient call depth, result in an
> all-bits-set dirty mask, which is also why we allow bytecode to
> *unset* bits in the dirty frame marker, but only if those bits are
> set in the sub's mask of frames it uses.

How is the dirty mask usuable, when bits are reset? Anyway, any code
making use of Parrot native register types will have to preserve all
most of the time.

>> whenever a subroutine or
>>method will use some native S or N registers, we end up saving 640
>>bytes. In *each* function call. And restore 640 bytes on each return.
>>
>>My proposal did show a way, how to copy ~640 bytes *once* per subroutine
>>creation. You didn't even comment that.

> Well, it seemed obviously wrong,

Why?

> ... and inefficient in most cases, so I
> didn't.

Copying 640 bytes once, or 640 bytes * 2 * nr of calls? What is
inefficient?

> ... If you don't like the scheme I outlined above I can go into
> more detail.

I'm all for a better scheme. Moving P0-P2, S0, I0-I4 somewhere else is
fine. BTW five registers (I0-I4) for information that fits into one is
overkill anyway.

leo

Dan Sugalski

unread,

Jul 20, 2004, 11:34:32 AM7/20/04

to l...@toetsch.at, perl6-i...@perl.org

At 4:59 PM +0200 7/20/04, Leopold Toetsch wrote:
>Dan Sugalski <d...@sidhe.org> wrote:
>> At 10:35 AM +0200 7/20/04, Leopold Toetsch wrote:
>
>> And yes, this will, with sufficient call depth, result in an
>> all-bits-set dirty mask, which is also why we allow bytecode to
>> *unset* bits in the dirty frame marker, but only if those bits are
>> set in the sub's mask of frames it uses.
>
>How is the dirty mask usuable, when bits are reset?

Since you'd only be allowed to reset bits that were set in your used
mask (that is, for frames that *must* have been saved on entry) then
it's usable just fine.

For example, if somewhere in your code you *only* use P registers,
you could do:

unset_used 11111100b # assuming registers in INSP order, two bits each

at the start, and turn off saving of the I, N, and S registers. Any
function call (or at least any call made through a vtable or other
mechanism where the caller can't know it's making a call) from then
on wouldn't save them.

If we go with this unconditionally and drop the requirement for the
caller to save the frames its interested in (counting, instead, on
this mechanism to do it universally) then only the bits that were set
in the current function's 'used bits' flag (and, thus, saved when we
entered this function) could be unset.

> Anyway, any code
>making use of Parrot native register types will have to preserve all
>most of the time.

Nonsense. The low frame of I, S, and N registers will rarely need
saving because they're rarely dirtied in the normal course of
affairs, and when they *are* dirty then they'll need to be saved
regardless.

Once again, this is *only* an issue when making calls when the caller
can't know that a call's being made. This should, in general, be
unusual. When we *do* know we're making a call then we save those
registers that we care about which, at the point of the call, should
generally not be all of them. (And yes, this may require some code
analysis to see what's used at this point, or used from this point to
the next call)

> >> whenever a subroutine or
>>>method will use some native S or N registers, we end up saving 640
>>>bytes. In *each* function call. And restore 640 bytes on each return.
>>>
>>>My proposal did show a way, how to copy ~640 bytes *once* per subroutine
>>>creation. You didn't even comment that.
>
>> Well, it seemed obviously wrong,
>
>Why?
>
>> ... and inefficient in most cases, so I
>> didn't.
>
>Copying 640 bytes once, or 640 bytes * 2 * nr of calls? What is
>inefficient?

This *only* makes a difference for vtable functions written in
bytecode. For normal code we're already copying the frames in and out
when we make a call and there's no way around that.

> > ... If you don't like the scheme I outlined above I can go into
>> more detail.
>
>I'm all for a better scheme. Moving P0-P2, S0, I0-I4 somewhere else is
>fine.

The only downside is it makes the continuations larger, since they
need to preserve this information. OTOH if everyone's saving it
anyway we might as well.

>BTW five registers (I0-I4) for information that fits into one is
>overkill anyway.

Erm... no. You at least need 5 bytes, so that's two registers, (or
two words somewhere) and using full words rather than bytes is faster
since you skip the mask and shift on most processors.

Hell with it, let's just do it. Gimme a bit and I'll spec the ops to
get/set the metainformation for the call things. It'll also make
doing stack back traces a damn sight easier, which'll be a win.

So much for not changing the calling conventions. :( (Unless we want
to have invoke and its friends automagically move this info over for
now to give people time to transition)

Leopold Toetsch

unread,

Jul 20, 2004, 11:56:17 AM7/20/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:
>>
>>Copying 640 bytes once, or 640 bytes * 2 * nr of calls? What is
>>inefficient?

> This *only* makes a difference for vtable functions written in
> bytecode. For normal code we're already copying the frames in and out
> when we make a call and there's no way around that.

Did you even read my proposal? It works[1] for *all* subroutines. I'm not
talking about calls from within C.

leo

[1] unless proven wrong.

Dan Sugalski

unread,

Jul 20, 2004, 12:21:31 PM7/20/04

to l...@toetsch.at, perl6-i...@perl.org

Yeah, I read the proposal. It's desperately un-thread-safe, which is
one of the things that didn't make it out in my last reply. You're
moving state data out of the interpreter structure, which is
threadsafe and can't be shareable between threads, into the sub pmc,
which is shareable. You can't cache state data like this, it means
that you can't have two or more threads in the same sub PMC at once.
That just won't work.

Larry Wall

unread,

Jul 20, 2004, 12:43:31 PM7/20/04

to perl6-i...@perl.org

On Tue, Jul 20, 2004 at 11:34:32AM -0400, Dan Sugalski wrote:
: So much for not changing the calling conventions. :(

I think most of us would agree that you're allowed to break anything
you like this week. Worry about unbreaking things after OSCON...

Larry

Leopold Toetsch

unread,

Jul 20, 2004, 12:46:52 PM7/20/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:

> ... It's desperately un-thread-safe, which is

> one of the things that didn't make it out in my last reply.

Your recent words related to threads were: we don't optimize for threaded
programs. We optimize for the common case, that is single-threaded.

I forgot that in my proposal. Subroutine PMCs need duplication for
new threads.

> .. You can't cache state data like this, it means

> that you can't have two or more threads in the same sub PMC at once.

Yes. of course. So with the small cost of duplicating sub PMCs for
multiple threads we can toss all register saving code.

We have to reJIT and re-prederef for threads anyway. I said, that its
probably simplest to attach the code directly to the sub, or make each
sub a code segment.

leo

Dan Sugalski

unread,

Jul 20, 2004, 1:19:00 PM7/20/04

to perl6-i...@perl.org

I'm not worried about OSCON. And honestly getting Parrot right is a
lot more important than a pie-throwing contest with Guido. (I'll take
a pie before we mutate parrot for that)

Dan Sugalski

unread,

Jul 20, 2004, 1:30:23 PM7/20/04

to l...@toetsch.at, perl6-i...@perl.org

At 6:46 PM +0200 7/20/04, Leopold Toetsch wrote:
>Dan Sugalski <d...@sidhe.org> wrote:
>
>> ... It's desperately un-thread-safe, which is
>> one of the things that didn't make it out in my last reply.
>
>Your recent words related to threads were: we don't optimize for threaded
>programs. We optimize for the common case, that is single-threaded.
>
>I forgot that in my proposal. Subroutine PMCs need duplication for
>new threads.

That doesn't work for closures, which can be shared across threads in
a pool. Also, in the face of continuations the cache is useful
exactly *once* (since from then on there might be a handle on it),
after which you have to recreate anyway. The only way the cache would
be useful would be if you created it when the sub pmc is initially
created, but then you're going to end up creating frames that won't
otherwise be used, which is a waste, with no savings anywhere else,
since the cache doesn't buy you anything anyway, as it's one-use.

> > .. You can't cache state data like this, it means
>> that you can't have two or more threads in the same sub PMC at once.
>
>Yes. of course. So with the small cost of duplicating sub PMCs for
>multiple threads we can toss all register saving code.

Sorry, no, that just doesn't work. You still end up with a lot of
allocation and copying time, and add in an extra level of indirection
to register access for a 2-3% speed hit in general. I don't see it as
a win.

Leopold Toetsch

unread,

Jul 20, 2004, 1:24:52 PM7/20/04

to perl6-i...@perl.org

There is no need to change these things before OSCON. All benches that
run are running from equal speed C<d=dict.fromkeys(xrange(1000000))> up
to around 3 times the speed of Python, e.g. the PI() coroutine in b2.py
:)

> Larry

leo

Leopold Toetsch

unread,

Jul 20, 2004, 3:18:55 PM7/20/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:
> At 6:46 PM +0200 7/20/04, Leopold Toetsch wrote:

>>I forgot that in my proposal. Subroutine PMCs need duplication for
>>new threads.

> That doesn't work for closures, which can be shared across threads in
> a pool.

We don't have shared closures. We don't have thread pools. Not in any
thread papers here on the list. "We dont't optimize for threads" - Dan.

Can you please elaborate a bit more on that issue. And why these
closures need to be shared.

> ... Also, in the face of continuations the cache is useful
> exactly *once*

No. The whole frame is the continuation. Its holding exactly the
interpreter state at the time of calling into the sub. Including
registers, which makes register preserving obsolete.

leo

Dan Sugalski

unread,

Jul 20, 2004, 4:00:51 PM7/20/04

to l...@toetsch.at, perl6-i...@perl.org

At 9:18 PM +0200 7/20/04, Leopold Toetsch wrote:
>Dan Sugalski <d...@sidhe.org> wrote:
>> At 6:46 PM +0200 7/20/04, Leopold Toetsch wrote:
>
>>>I forgot that in my proposal. Subroutine PMCs need duplication for
>>>new threads.
>
>> That doesn't work for closures, which can be shared across threads in
>> a pool.
>
>We don't have shared closures. We don't have thread pools. Not in any
>thread papers here on the list.

This all came up with the big thread blowup at the beginning of the year.

http://www.nntp.perl.org/group/perl.perl6.internals/20181

amongst others.

>"We dont't optimize for threads" - Dan.

Yeah, but neither do we design in a way that makes threading
impossible or phenomenally difficult.

>Can you please elaborate a bit more on that issue. And why these
>closures need to be shared.

Any closure, or any other anonymous sub can be stuck into a PMC.
Since they can be shared that means they'll cross threads. I suppose
you could defer cloning its info until it was actually shared.

Still, this scheme means as soon as you fire up a thread you have to
go and clone off the interpreter information on all the named
subroutines, which is a lot of work.

> > ... Also, in the face of continuations the cache is useful
>> exactly *once*
>
>No. The whole frame is the continuation. Its holding exactly the
>interpreter state at the time of calling into the sub. Including
>registers, which makes register preserving obsolete.

Which means that as soon as you use it the first time it becomes
useless, since you can't know when it stops being used. That makes
them one use--every time you enter the sub you need a new one, just
as if you were calling in recursively.

It's just not going to fly, Leo.

Larry Wall

unread,

Jul 21, 2004, 11:41:44 AM7/21/04

to perl6-i...@perl.org

On Tue, Jul 20, 2004 at 07:24:52PM +0200, Leopold Toetsch wrote:
: There is no need to change these things before OSCON. All benches that

: run are running from equal speed C<d=dict.fromkeys(xrange(1000000))> up
: to around 3 times the speed of Python, e.g. the PI() coroutine in b2.py
: :)

So Evil Larry™ suggests that you embed a Python interpreter and hand off
the unsuccessful tests back to Python. :-)

However, Good Larry® does in fact appreciate the notion of integrity
even in the face of flying cream pies...

Larry

Leopold Toetsch

unread,

Jul 22, 2004, 6:54:08 AM7/22/04

to Larry Wall, perl6-i...@perl.org

Larry Wall <la...@wall.org> wrote:

> So Evil Larry? suggests that you embed a Python interpreter and hand off

> the unsuccessful tests back to Python. :-)

Good idea. Here is bx.pir from Evil Leo:

$ cat bx.pir
.sub main @MAIN
.param pmc argv
.const string PY = '/usr/local/bin/python -O '
.local pmc pipe
.local string cmd
.local string file
file = argv[1]
cmd = PY . file
open pipe, cmd, "-|"
.local string res
lp:
read res, pipe, 4096
print res
$I0 = length res
if $I0 goto lp
close pipe
.end

$ time parrot bx.pir b0.py
3141592653
3141592653

real 0m4.677s
user 0m0.040s
sys 0m0.010s

Surprise, its equally fast:)

> Larry

leo

Leopold Toetsch

unread,

Jul 25, 2004, 5:45:48 AM7/25/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:
> At 9:18 PM +0200 7/20/04, Leopold Toetsch wrote:

[ I've to come back to my proposed scheme ]

>>No. The whole frame is the continuation. Its holding exactly the
>>interpreter state at the time of calling into the sub. Including
>>registers, which makes register preserving obsolete.

> Which means that as soon as you use it the first time it becomes
> useless, since you can't know when it stops being used. That makes
> them one use--every time you enter the sub you need a new one, just
> as if you were calling in recursively.

I don't think so. As it might not be outerly clear, how it would look
like: here are two code snippets showing the basics of my idea.

1.) Calling a sub

next = sub_pmc->invoke->(&interp, sub_pmc, next);
^^^^^^^

void *
invoke(Interp** interp, PMC* self, void *next) {
Interp *caller = *interp;
parrot_sub_t sub = PMC_sub(self);
Interp *frame = sub->frame;
if (!frame || frame->caller)
frame = copy_interp(caller);
else
update_context(frame, caller);
frame->prev = sub->frame;
sub->frame = frame;
frame->caller = caller;
frame->sub_pmc = self;
copy_func_params(frame, caller);
*interp = frame;
next = switch_to_segment(frame, sub->seg);
return next;
}

2.) return from a sub

PMC *self = interp->sub_pmc;
next = self->vtable->return(&interp, self)

void *
return(Interp** interp, PMC* self) {
parrot_sub_t sub = PMC_sub(self);
Interp *frame = sub->frame;
Interp *caller = frame->caller;
Interp *prev = sub->frame = frame->prev;
if (prev && !prev->caller)
add_frame_cache(prev, sub); [1]
copy_return_values(caller, frame);
*interp = caller;
return switch_to_segment(caller, sub->seg);
}

> It's just not going to fly, Leo.

Im Gegenteil ;)

I know, that there are issues with threads. But we have to duplicate or
C<< pmc->vtable->share >> anyway. But copying the sub PMC is cheap, much
more expensive is that we have to reJIT (and/or re-predereference) the
code, because these run loops have absolute register addresses. Doing
that now at a much more fine-grained level, i.e. per subroutine that is
executed in a thread and not per file, might be even a big win.

Above scheme is thread-safe, if there are locks around the code. This
could be an alternative to duplicating subroutine PMCs, it depends. But
for now, I'd like to get single-threaded execution running - fast.

[1] If there is no caller on that frame, it would go into some kind of
frame cache, which C<copy_interp()> could use, or it could be kept for
that sub, probably depending on memory usage.

leo

Leopold Toetsch

unread,

Jul 27, 2004, 7:21:47 AM7/27/04

to Perl 6 Internals, Dan Sugalski

Leopold Toetsch wrote:

[ proposal about a new function calling scheme ]

Attached is a minimal patch that shows the concept of the proposed
function calling scheme.
* works only with "function-less" run-cores [1]
* the 2 new opcodes "mycall" and "return" abuse the _pointer_keyed
vtable slots of sub.pmc
* no recursive calls (would need that the prederefed code is attached to
the sub)

Here are timings of the attached programs:

$ time parrot -j -Oc c.imc
in main
200000

real 0m1.069s
user 0m0.910s
sys 0m0.020s

$ time parrot -C ch.imc
in main
200000

real 0m0.356s
user 0m0.250s
sys 0m0.000s

I think that a factor 3 improvement in function call speed (and a factor
of ~4 for overloaded vtables) is an argument to have a closer look at
this scheme. Here are again the key ideas:

* the interpreter structure is the context, and the continuation
* instead of pushing registers onto the register frame stacks an
interpreter structure gets attached to each sub, and the
subsequent code is running with that interpreter

Comments welcome,
leo

[1] switch, cgoto, CGP. Other run-cores would need a special
restart-notification that the interpreter had changed.
JIT needs a bit more.

hack_42.patch

c.imc

ch.imc

Luke Palmer

unread,

Jul 27, 2004, 2:10:46 PM7/27/04

to Leopold Toetsch, Perl 6 Internals, Dan Sugalski

Leopold Toetsch writes:
> $ time parrot -j -Oc c.imc
> in main
> 200000
>
> real 0m1.069s
> user 0m0.910s
> sys 0m0.020s
>
> $ time parrot -C ch.imc
> in main
> 200000
>
> real 0m0.356s
> user 0m0.250s
> sys 0m0.000s
>
> I think that a factor 3 improvement in function call speed (and a factor
> of ~4 for overloaded vtables) is an argument to have a closer look at
> this scheme. Here are again the key ideas:
>
> * the interpreter structure is the context, and the continuation
> * instead of pushing registers onto the register frame stacks an
> interpreter structure gets attached to each sub, and the
> subsequent code is running with that interpreter

Yes, this scheme is definitely worth a look. I'm currently battling
Perl 5's function call speed in my own projects; i.e.:

for (1..4000) {
# OpenGL calls
}
# Fast enough (30 FPS)

-----

sub draw {
# OpenGL calls
}
for (1..4000) {
draw()
}
# Not fast enough (18 FPS)

This isn't something I want to have to battle. When a language severely
limits my ability to abstract, I consider not using that language in
favor of something that doesn't punish me. I really don't want to give
C++ preference to Perl 6.

Luke