I have some idea about reordering of how PCC works with CallContexts.
At the moment CallContext created by Parrot_pcc_build_signature on
caller side and stored inside current CallContext. This is most
unnatural way of calling Subs and lead as to quite few issues:
1. We don't always need fresh CallContext on callee side. For example
C functions/methods doesn't have to have any.
2. Corollary of 1 is increased pressure on GC.
3. Corollary of 1 that Sub PMC (and inherited PMCs such as Coroutine,
Continuation, etc) can create CallContext and storage for registers in
single call. Which is "suboptimal" (i.e. crappy slow).
I want to change it other way around:
1. Caller fills params in current CallContext.
2. Callee creates new CallContext (if needed).
3. Callee unpacks params from parent CallContext.
After implementing this we can optimize CallContext creation to
allocate storage for registers in single call.
All of this should give us
1. Some small performance boost (about 5% I think).
2. Will definitely speed up C METHODs calls. Which will help with
something like 6model.
3. But most important will make PCC much more sensible in terms of
allocating stuff. Which lead us to various optimizations like "Scheme
style" CallContext stack allocations with trivial bump pointer
allocation and bit of magic for capturing CallContext for
Continuations
Any comments, objections, volunteers to help with it?
--
Bacek
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Right now calling and returning from sub are symmetric - callee fills
the return values in callContext and jumps to return address. Then the
caller unpacks return values. How the proposed change would affect
this?
Best regards
--
Luben Karavelov
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
This approach is not thread-safe. There's a very good reason for keeping
all data relevant to the call contained within the CallContext for that
call.
The main solution for the performance problem is to replace the GC with
a reasonably performant modern implementation. Another improvement would
be to make CallContexts lazy, so storage for registers isn't allocated
until it's absolutely needed (in some cases, never). Polymorphism can
help too, it may be appropriate for calls to C functions to use a much
simpler CallContext that respects the same interface as the standard one.
Allison
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
It wont change. set_returns op uses grand-parent CallContext to fill.
Really? How it's can be non-thread safe? And what other reasons for
"keeping all data" in prematurely created CallContext?
> The main solution for the performance problem is to replace the GC with
> a reasonably performant modern implementation. Another improvement would
Niiice. Do tell me. Especially because I've put about one year effort
to bring Generational GC parrot. And it's maximum what we can do now.
All other algorithms requires "movable" GCable (for compacting). And
without rewriting whole PMC/Buffer handling in Parrot it's virtually
impossible to implement.
Yes, "6model" is better foundation for implementing compacting than
current PMC.data/VTABLE_mark approach. But it'sjust foundation and
will require a lot of work to implement moving/compacting.
> be to make CallContexts lazy, so storage for registers isn't allocated
> until it's absolutely needed (in some cases, never). Polymorphism can
No. Current approach is exactly this. And it's slow. Twice slower for
the record. Because in 99% of the cases we are calling GC _twice_ to
allocate CallContext.
> help too, it may be appropriate for calls to C functions to use a much
No. Polymorphism will _slow_ things down. Me and chromatic broke "poor
man VTABLE polymorphism" after landing of current PCC just to bring it
on speed with previous one. And I'm talking about 30% performance
improvement.
> simpler CallContext that respects the same interface as the standard one.
Anyway, current PCC approach is wrong from the beginning. We always
doing marshalling/demarshalling of arguments for all calls. And it's
_slow_. Really really slow.
--
Bacek.
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Think of Erlang. Every subroutine is a safe point to split off parallel
execution, because every subroutine is a self-contained unit. This is
absolutely critical in moving toward modern concurrent implementations.
They can handle things like data-parallelism in the background
automatically. It doesn't make sense to jettison one of Parrot's best
features and take a step backward toward stack-like dispatch.
>> The main solution for the performance problem is to replace the GC with
>> a reasonably performant modern implementation. Another improvement would
>
> Niiice. Do tell me. Especially because I've put about one year effort
> to bring Generational GC parrot. And it's maximum what we can do now.
> All other algorithms requires "movable" GCable (for compacting). And
> without rewriting whole PMC/Buffer handling in Parrot it's virtually
> impossible to implement.
Then maybe we should be looking at rewriting PMC/Buffer handling instead
of this. If the real problem is the fact that allocating CallContexts is
expensive, then attack the root and make it less expensive.
Do you have some profiling results that show where the current GC is
most expensive?
> Yes, "6model" is better foundation for implementing compacting than
> current PMC.data/VTABLE_mark approach. But it'sjust foundation and
> will require a lot of work to implement moving/compacting.
Something of a tangent, but how much of Parrot's current dispatch does
6model use? Anything? Parrot currently has a pile of pretty expensive
corner cases baked into dispatch that were added for Perl 6. But, if
Perl 6 isn't using them anymore, then ripping them out could give Parrot
some substantial speed gains (and improve maintainability at the same
time). The current multiple dispatch plumbing is a good example. It was
designed for Perl 6, but AFAIK, Perl 6 doesn't use it anymore.
>> be to make CallContexts lazy, so storage for registers isn't allocated
>> until it's absolutely needed (in some cases, never). Polymorphism can
>
> No. Current approach is exactly this. And it's slow. Twice slower for
> the record. Because in 99% of the cases we are calling GC _twice_ to
> allocate CallContext.
Then we need to work on calling GC only once. Or allocate CallContexts
from a separate short-lived pool to isolate them from the main body of GC.
TMTOWTDI. Always.
>> help too, it may be appropriate for calls to C functions to use a much
>
> No. Polymorphism will _slow_ things down. Me and chromatic broke "poor
> man VTABLE polymorphism" after landing of current PCC just to bring it
> on speed with previous one. And I'm talking about 30% performance
> improvement.
Based on what profiling results?
>> simpler CallContext that respects the same interface as the standard one.
>
> Anyway, current PCC approach is wrong from the beginning. We always
> doing marshalling/demarshalling of arguments for all calls. And it's
> _slow_. Really really slow.
I'm all for speeding things up. And I'll be the first to admit that
Parrot's current dispatch system was only intended as a "temporary"
partial fix of the old dispatch system (which was a horrible mass of
spaghetti code.) But further fixes need to be based on profiling data,
and not sacrifice Parrot's key competitive features.
Allison
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
I'm apprehensive about bacek's proposal, but I might not know enough
about his plan to really judge it. It's hard to say that CPS is one of
Parrot's strong points because we still don't implement it in a fully
leveraged, completely symmetric way. The idea that any Sub invocation
can be boxed up and dispatched to a different thread is a very
important part of the threading work that nine has been doing, and I
wouldn't want to do anything that damages the progress he has made.
If bacek says we can have speedups without sacrificing important
functionality, I'm inclined to trust him and see what he comes up
with.
>>> The main solution for the performance problem is to replace the GC with
>>> a reasonably performant modern implementation. Another improvement would
I don't think GC is a major bottleneck anymore, at least not to the
magnitude that it used to be. The case can definitely be made that PMC
allocation and initialization are too slow, but GC (mark and sweep) is
not a problem right now. Most problems we have with GC have more to do
with volume of allocated PMCs, and not with the underlying algorithm.
Allocating PMC headers and PMC data structures separately, from two
separate pools has drawbacks.
We already try to reuse CallContext PMCs between the call and the
return of a sub invocation. If we keep a pool of them around we can
try to reuse them more often than that. We already cache and attempt
to reuse register frames by size. More caching and reusing is probably
a good idea.
> Something of a tangent, but how much of Parrot's current dispatch does
> 6model use? Anything? Parrot currently has a pile of pretty expensive
> corner cases baked into dispatch that were added for Perl 6. But, if
> Perl 6 isn't using them anymore, then ripping them out could give Parrot
> some substantial speed gains (and improve maintainability at the same
> time). The current multiple dispatch plumbing is a good example. It was
> designed for Perl 6, but AFAIK, Perl 6 doesn't use it anymore.
Perl 6 does use it's own dispatcher, so there is a chance that we can
rip out some bits of our dispatcher that Perl6 no longer relies on.
For instance, we now have a get_context_p opcode, which can get a call
context much more quickly than a get_params with :call_context.
Ripping out :call_context (which was never fully implemented anyway)
will be a small start. :named :optional and :named :slurpy args are
also much more expensive than many other arrangements. Of course,
ripping those things out does start to eat away at core dispatch
functionality and we don't do that just for fun.
I'm going off on a tangent, I know. Going through PCC and looking for
things that we no longer need to support for the cost would be a good
exercise.
>> No. Current approach is exactly this. And it's slow. Twice slower for
>> the record. Because in 99% of the cases we are calling GC _twice_ to
>> allocate CallContext.
Twice for what, the CallContext and the hash for named args? I'm not
sure how we expect to get much faster here. Copying a pointer to a
register is just as expensive as copying the contents of that
register. Rearranging the Caller's register frame to make for easy
access by the callee is just as expensive as unpacking contents out in
the callee.
Again, if bacek says it's possible I trust that it is. A more
constructive starting point, in my mind, is to start going through our
list of features and supported behaviors and start cutting out things
which cost more than they are worth. When we have fewer requirements
to meet, we will be much more free to rearrange the core algorithms.
>> Anyway, current PCC approach is wrong from the beginning. We always
>> doing marshalling/demarshalling of arguments for all calls. And it's
>> _slow_. Really really slow.
I would really like to see a breakdown of the costs involved. If we
turn GC mark/sweep off, what are the relative costs of CallContext
allocation and initialization, marshalling caller args, demarshalling
callee params, and resetting the CallContext to prepare for the
return. A comparison of these things will help to inform our
decisions.
> I'm all for speeding things up. And I'll be the first to admit that
> Parrot's current dispatch system was only intended as a "temporary"
> partial fix of the old dispatch system (which was a horrible mass of
> spaghetti code.) But further fixes need to be based on profiling data,
> and not sacrifice Parrot's key competitive features.
If that's true, I'm not sure I've ever seen what the long-term,
non-temporary plan was supposed to be. I've got plenty of long-term
plans of my own, but I developed those plans privately, long after the
initial PCC refactors. If other people have other ideas for the long
road to follow, I would be very interested to hear them.
--Andrew Whitworth
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Reaaally bad example. I can point on at least 5 flows in this
statement. For the beginning:
1. Erlang VM is not "generic vm" to support variety of dynamically
typed languages.
2. Erlang is immutable language with Message Passing architecture.
It's _always_ safe to split
execution into different thread, process, server in this case.
3. Erlang doesn't support multi-dispatch. Pattern matching happen
inside subroutine.
4. Erlang is not "modern concurrent implementation". Till mid-2000 it
didn't support multicore CPU properly.
5. "STM" is "modern concurrent implementation" for mutable world. FSVO "modern".
6. Catering PCC for "always possible multithreaded execution" is
semantically same error as
"catering PCC for both internal and external calls". Least Common
Denominator. It is _slow_.
>>> The main solution for the performance problem is to replace the GC with
>>> a reasonably performant modern implementation. Another improvement would
>>
>> Niiice. Do tell me. Especially because I've put about one year effort
>> to bring Generational GC parrot. And it's maximum what we can do now.
>> All other algorithms requires "movable" GCable (for compacting). And
>> without rewriting whole PMC/Buffer handling in Parrot it's virtually
>> impossible to implement.
>
> Then maybe we should be looking at rewriting PMC/Buffer handling instead
> of this. If the real problem is the fact that allocating CallContexts is
> expensive, then attack the root and make it less expensive.
No. Real problem was described in my first mail. Current PCC model is wrong.
> Do you have some profiling results that show where the current GC is
> most expensive?
It's not "cpu cache friendly".
>> Yes, "6model" is better foundation for implementing compacting than
>> current PMC.data/VTABLE_mark approach. But it'sjust foundation and
>> will require a lot of work to implement moving/compacting.
>
> Something of a tangent, but how much of Parrot's current dispatch does
> 6model use? Anything? Parrot currently has a pile of pretty expensive
Yes, rakudo/nqp uses current PCC to pass arguments. Not Parrot
multidispatch though.
> corner cases baked into dispatch that were added for Perl 6. But, if
> Perl 6 isn't using them anymore, then ripping them out could give Parrot
> some substantial speed gains (and improve maintainability at the same
> time). The current multiple dispatch plumbing is a good example. It was
> designed for Perl 6, but AFAIK, Perl 6 doesn't use it anymore.
>>> be to make CallContexts lazy, so storage for registers isn't allocated
>>> until it's absolutely needed (in some cases, never). Polymorphism can
>>
>> No. Current approach is exactly this. And it's slow. Twice slower for
>> the record. Because in 99% of the cases we are calling GC _twice_ to
>> allocate CallContext.
>
> Then we need to work on calling GC only once. Or allocate CallContexts
> from a separate short-lived pool to isolate them from the main body of GC.
>
> TMTOWTDI. Always.
BSCINABTE
>>> help too, it may be appropriate for calls to C functions to use a much
>>
>> No. Polymorphism will _slow_ things down. Me and chromatic broke "poor
>> man VTABLE polymorphism" after landing of current PCC just to bring it
>> on speed with previous one. And I'm talking about 30% performance
>> improvement.
>
> Based on what profiling results?
valgrind --tool=callgrind.
https://github.com/parrot/parrot/commit/12b59772e3146e5055e57f963236bfb700bbd48b
git log src/call/args.c, search for "%"
>>> simpler CallContext that respects the same interface as the standard one.
>>
>> Anyway, current PCC approach is wrong from the beginning. We always
>> doing marshalling/demarshalling of arguments for all calls. And it's
>> _slow_. Really really slow.
>
> I'm all for speeding things up. And I'll be the first to admit that
> Parrot's current dispatch system was only intended as a "temporary"
> partial fix of the old dispatch system (which was a horrible mass of
> spaghetti code.) But further fixes need to be based on profiling data,
> and not sacrifice Parrot's key competitive features.
Which "competitive features"??? Possibility of splitting execution in
any Sub call? Really? Sub.invoke will be called in _same_ thread.
Inside Sub.invoke we can create/clone/whatever with arguments before
passing execution into different thread. And we can do it _when_
needed. And most importantly _only_ _when_ needed.
And redesigning Parrot to be heavily multi-threaded VM is really
interesting task. But I wouldn't call it "Parrot". Just because it
will be easy to do it from clean start. Or use Erlang if it's matter.
He has totally proven his worth over the years, and if I said anything
that sounded otherwise, I apologize.
I'm only talking about some technical details of the proposal.
Specifically, about storing arguments in the parent's context. There's
no need for this to be anything other than a straightforward technical
conversation.
> If that's true, I'm not sure I've ever seen what the long-term,
> non-temporary plan was supposed to be. I've got plenty of long-term
> plans of my own, but I developed those plans privately, long after the
> initial PCC refactors. If other people have other ideas for the long
> road to follow, I would be very interested to hear them.
The biggest goal of the last major refactor was to cut down from a dozen
incompatible APIs for making calls, to a single streamlined path. But,
it was largely a surface fix, and the next step was to replace the old
crufty dispatch core behind the clean new API.
Life got in the way, I got sucked in by school and then work, but it's
still a good next step.
Allison
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
If I understand Bacek correctly, he wants to store arguments in the
parent context only temporarily, then copy them to the callee's context
immediately after the call.
Nick
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Um... none of that is relevant to the question of encapsulated dispatch.
The fact that Erlang has some good features doesn't mean I'm elevating
it to superlanguage status.
> Yes, rakudo/nqp uses current PCC to pass arguments. Not Parrot
> multidispatch though.
Cool. I'd like to propose ripping it out. With a suitable delay, of
course, to make sure there aren't any other languages using it, and to
talk through what they do need and if there's a better way to provide it.
>> Based on what profiling results?
>
> valgrind --tool=callgrind.
>
> https://github.com/parrot/parrot/commit/12b59772e3146e5055e57f963236bfb700bbd48b
>
> git log src/call/args.c, search for "%"
Oh, I wasn't asking about profiling 2 years ago, I was asking about
profiling today. The code has changed quite a bit.
> And redesigning Parrot to be heavily multi-threaded VM is really
> interesting task. But I wouldn't call it "Parrot". Just because it
> will be easy to do it from clean start. Or use Erlang if it's matter.
Building a multi-threaded VM is one of the key reasons Parrot was started.
Allison
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Rip out MMD? Winxed and NQP-rx currently use Parrot's MMD, and several
libraries such as PCT rely on it. I'm not saying it can't be done, of
course. Many existing cases can probably be replaced by better use of
inheritance. If we rip out MMD, I think it will expose shortcomings
in other systems, especially NameSpaces. I'm not against opening the
box, I just want to make sure nobody is surprised by what we will find
inside.
>> And redesigning Parrot to be heavily multi-threaded VM is really
>> interesting task. But I wouldn't call it "Parrot". Just because it
>> will be easy to do it from clean start. Or use Erlang if it's matter.
>
> Building a multi-threaded VM is one of the key reasons Parrot was started.
Again I mention nine's work. He's managed to do almost exactly this,
with a few remaining rough edges. Invocations can be bundled up into a
Task PMC. The Task can be dispatched onto any of a number of
underlying worker threads. Overhead for an individual task is very
small, which makes parallelization much more attractive for certain
tasks. So long as Task.invoke can still eagerly gobble up arguments,
it doesn't really matter where they come from or how they are
marshaled.
If we can make C-based method invocation faster that obviously is a
big help for built-in types. Once we have 6model and start moving
towards a system where more methods are PBC-based instead, those
savings become less attractive.
We still need an abstracted interface where arguments come from (or
appear to come from) a signature object. We can't have a Sub rely on
intimate details of the memory layout of its caller. So long as we
have some abstraction boundary, I think we can keep most things
working just as well as they are now. If we can make signatures hold
less state by default, load data lazily when possible, and be more
easily reused between invocations, I think we will see much win.
</lurk>
This is something that has been bugging me about the direction of Perl 6 for
a few years.
The relationship between immutability and implicit thread safety has been
well known for a very long time: I was reading old research papers about it
in 1989, and while the trend towards generation of code at runtime has
opened new opportunities for inferences, those are really complicated, hard
to get right, and dependent on language features that we've brushed aside in
designing Perl 6.
There seems to be an assumption (in P6) that auto-threading is only of
interest is when the programmer has signalled that it's OK, either through
eigentypes, or using vector constructs like "map".
There's a bunch of hand-waving towards "detecting when parallelism is
possible", but I fear that won't get far in practice. Analysis at compile
time an NP-complete problem; doing it at runtime is still NP complete but
has a greater chance of finding stuff, at the expense of having to re-run
when immutability can't be proven; and "map" and eigenvalues aren't used
much in real-life programs.
We should be able to do much, MUCH better: parallelism should be the
default, all the time.
But to get there, we need a language and a VM that makes "doing it right"
easier than "doing it wrong".
* For the language that means making declaring & using a "constant" easier
(more concise) than a "variable".
* For the VM that means enabling the despatch mechanisms to see whether a
PMC represents a mutable container or an immutable value, and assisting
constructors to switch off mutability once initialization is complete.
Folk worry that complicated despatch mechanisms are too costly, but I think
such concerns are short-sighted. Going 60% slower on a single core might
seem like a big loss, but as soon as you can utilize 3 or more cores, you're
ahead, and by 10 cores you're at quadrupal speed single-core speed.
For several years chip manufacturers have been shipping sample quantities of
CPUs with many tens of cores. By the time that Perl 6 comes to mainstream,
even mobile phones are likely to have upwards of a eight cores; it would be
a great shame if the "ordinary" Perl 6 program won't be able to use them.
-Martin
<lurk>
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev